Adding Annotation Qualifiers

Sometimes, the gene annotation is provided in several files. The core data, such as gene model coordinates and the names of the transcripts, can be available as gff3 files while the extra information can be supplied in a form of additional files. The command add annotation_qualifier is designed to add extra qualifiers to the genes already loaded to the database.

The additional gene properties can be supplied in tab-delimited files. Each line in such files can have the information that could be stored as annotation qualifiers. To link each line in the text to the corresponding gene model in the database, we need to have the gene name or another sort of unique identifier provided in one of the columns. This value will be used to uniquely identify the gene model. During the test phase, PersephoneShell will try to establish an unambiguous link between lines in the file and the gene models in the database and confirm if all genes can be found.

Specify the map set (MapSetId or MapSetPath), the track name (TrackName) and the qualifier used as the unique transcript name (GeneNameQualifier). Provide information about the columns to be loaded - by the column index in tab-delimited file. We will also need to specify which qualifier names will be used to store this information in the database. For example, to load the value in column 4 (0-based) in a qualifier called "Pfam", use this instruction:

TextQualifierIndex.4=Pfam

Please review the sample files in the Samples/AnnotQualifier folder of PersephoneShell.

For example, here is the control file Wm82.a4_info.ini for parsing a tab-delimited file:

[MapSet]
; Either MapSetId or MapSetPath is required.
; MapSetId: id of a target map set.
;MapSetId=1
; MapSetPath: path of a target map set.
MapSetPath="/Glycine max/Wm82.a4.v1"

[Annotation]
Sources=$DATA/soy/Gmax_508_Wm82.a4.v1.annotation_info.txt
; TrackName (required): Annotations from this track will be modified (qualifiers added)
TrackName="Gene models"
; GeneNameQualifier (required): name of qualifier whose value uniquely identifies a gene (e.g., locus_tag) in db and links it to line in the file
GeneNameQualifier=Name

; SkipHeaderLines: the number of lines to skip parsing
TextSkipHeaderLines=1
; CommentPrefix: comment prefix to skip parsing
;TextCommentPrefix="#"
; Delimiter: specify one among Colon(:), Comma(,), Period(.), Hyphen(-), SemiColon(;), Slash(/), Tab(\t), VerticalBar(|)
TextDelimiter=Tab
; GeneNameIndex: 0-based index of the column that contains the gene name
GeneNameIndex=2

; MultipleGenesPerName: If false (default), each name should uniquely identify exactly one gene model.
; If a name is common for multiple genes (e.g., gene_id vs. transcript_id), set MultipleGenesPerName to true and the same value from file will be assigned to multiple genes
MultipleGenesPerName=true

; TextQualifierIndex.INDEX(0-based)=qualifierName((:displayText),dataType,dataFormat)
TextQualifierIndex.4=Pfam
TextQualifierIndex.5=Panther
TextQualifierIndex.6=KOG
TextQualifierIndex.7=ec
TextQualifierIndex.8=KO
TextQualifierIndex.9=GO
TextQualifierIndex.10=Best-hit-arabi-name
TextQualifierIndex.11=arabi-symbol
TextQualifierIndex.12=arabi-defline

; RenameDuplicatedQualifiers: in case there are lines with the same gene and qualifier, the qualifier's name collision can be resolved by automatic
; renaming duplicated qualifiers. For example, two lines like these:
; gene1 value1
; gene1 value2
; contain two different values for the same gene that should be stored under the same qualifier, for example, Qual1.
; If RenameDuplicatedQualifiers=true, psh will try to disambiguate the qualifiers by introducing a new qualifier Qual1_1
; that will store the value 'value2'. The result will be saved as Qual1=value1 and Qual1_1=value2.
; default=false
;RenameDuplicatedQualifiers=true

[QualifierLinks]
; Some qualifiers can be shown as hyper-links.
; Link qualifier name-value to external sources.
; %s in the link will be replaced by the value of qualifier
;QUALIFIER_NAME=PLACEHOLDER_URL
;ID="http://rice.plantbiology.msu.edu/cgi-bin/gbrowse/rice/?name=%s"
; The line above would result in the qualifier "ID" shown as a hyper-link. For example, if ID="Os1g123", the URL would be "http://rice.plantbiology.msu.edu/cgi-bin/gbrowse/rice/?name=Os1g123"

; Optional: a qualifier link can be embedded into a longer text of the qualifier value. For example, 'Dbxref' can contain multiple identifiers, each of them
; can be used to construct a hyperlink. Use a regular expression to extract a substring that will be converted into the hyperlink inside the text. The regular expression
; should be appended using a pipe '|' symbol:
;Dbxref=https://www.ebi.ac.uk/interpro/entry/InterPro/%s/|Interpro:(IPR\d+)
; The command above instructs to analyze the qualifier value stored under key 'Dbxref', find substring that starts with 'Interpro:'
; and extract the part that has 'IPR' as the first letters followed by digits.
; Use the psh command 'add qualifier_link' to add more regular expressions.

KOG=https://www.ncbi.nlm.nih.gov/Structure/cdd/%s|(KOG\d+)
Pfam=https://www.ebi.ac.uk/interpro/entry/pfam//%s|(PF\d+)
Panther=http://www.pantherdb.org/panther/familyList.do?searchType=basic&fieldName=all&organism=all&listType=6&fieldValue=%s|(PTHR[^,]+)
KO=https://www.genome.jp/dbget-bin/www_bget?%s|(K\d+)
GO=https://www.ebi.ac.uk/QuickGO/term/%s|(GO:\d+)

Remember that some qualifiers can be displayed as a hyperlink (see the section [QualifierLinks]) - %s in the link template will be replaced by the value of the qualifier.

One thing to note here is using regular expressions to extract the values for the hyperlinks from longer qualifier values. For example, if your qualifier GO contains several entries such as 'GO:0003824,GO:0005506', to split the text into several hyperlinks use a regular expression (GO:\d+) which should be specified after the hyperlink, separated by a pipe symbol:

https://www.ebi.ac.uk/QuickGO/term/%s|(GO:\d+)

This will result in two separate hyperlinks embedded into one qualfier value: GO:0003824,GO:0005506

Different regular expressions can be applied to the same qualifier. For example, Dbxref can contain multiple identifiers of different sort. If you want to specify several URLs attached to several regular expressions each, type them on separate lines and mention the qualifier name only once:

Dbxref=https://www.ebi.ac.uk/QuickGO/term/%s|(GO:\d+)
http://pfam.xfam.org/family/%s|(PF\d+)

For easier reading, indent the lines that belong to the same qualifier.

For more examples of the qualifier links please consult this.