Normally, after adding an organism and its map sets, maps, and sequences using PersephoneShell, you would want to add gene annotation tracks. To add gene annotations,  use the add command (see Add) to process a control file (see Control Files) with instructions how to load the data for the annotations. This page shows the steps necessary to add gene annotations to Glycine max (soybean) genome in your Persephone database.

Please study the sample INI file and run the command add annotation that references this file. As usual, first, run the test (using -t switch) and only if everything is correct load the data to the database, preferably using -v for verbose output. As usual, use auto-complete to speed up typing.

add annotation -c Samples/Annotation/Wm82.a4_phytozome-gff.ini -t

Sample file

Here is an example sample file Wm82.a4_phytozome-gff.ini


[ProcessRun]
; Run description: if specified, a custom description will be used,
;                  otherwise, "Added annotation for {MapSet path} from {Sources}." will be used.
RunDescription="Load annotations for Wm82.a4.v1 from Phytozome"

[MapSet]
; Either MapSetId or MapSetPath is required.
; MapSetId: id of a target map set.
;MapSetId=247848026
; MapSetPath: path of a target map set.
MapSetPath="/Glycine max/Wm82.a4.v1"

[Method]
; To add a new annotation method, specify a method name and a color to be shown in the annotation track.
; You don't need to provide this info if the method already exists in the database.
; Supported color name is available in http://www.flounder.com/csharp_color_table.htm
; If the method already exists, it will be updated. Otherwise, a new method will be added.
;Name={NamedColor|HTML hex code|R,G,B}
; Color examples: Red, #FF0000, or 255,0,0

;mRNA=203,128,147
Phytozome13=100,100,255

[Annotation]
; GFF specification in http://www.sequenceontology.org/gff3.shtml
; Sources (required): GFF or GTF file(s) for annotations, accessible locally or remotely via URL. 
; It is important to keep the file name extension consistent with the file format.
; Files can be compressed with gzip.
; $DATA variable defined in psh.exe.config as DataDir can be used in the path.
Sources=$DATA/soy/Gmax_508_Wm82.a4.v1.gene_exons.gff3.gz 

; Tracks (required): comma delimited. Provide the name of the section below that describes the track
;                    At least one track section need to be specified.
Tracks="Genes"
; MaskedParentTypes: parent GFF types (column 3) that together with their child records should be ignored
; This saves some memory during GFF parsing, especially if the GFF file contains many records of this type.

;MaskedParentTypes="biological_region","repeat_region"
; The GFF files are expected to have their records grouped by map. This means that all records of a chromosome should come
; in one continuous block, followed by a block for another chromosome, etc. The records within the block can be unsorted (typical for Ensembl GFFs),
; the program will sort them in memory.
; SkipInconsistentRecords: indicates how to treat the conflicting records. If SkipLine - try to create annotation without
; individual conflicting lines, SkipMRNA - skip the entire MRNA if at least one line of the record is conflicting others, 
; None (by default) - nothing to skip. No case sensitivity.
;SkipInconsistentRecords=SkipMRNA

;AbuttingExons: exons that have an intron of size 0 between them. We can throw an Error (default), AddBoth or Merge such exons.
;AbuttingExons=AddBoth

; Commit frequency: indicates how often the process commits annotations. Every N annotations.; Commit frequency: indicates how often the process commits annotations. Every N annotations.
CommitFrequency=1000

[Genes]
; Method (required): annotation method. If new, should be specified in METHOD section.
Method=Phytozome13

; Column2: while parsing the gff file, only those lines will be analyzed that have the listed values in the gff column 2. 
; Normally, this column shows the method or the source of the annotation.
; If the gff file has multiple sources in column 2, setting this filter will reduce the amount of data to be parsed and significantly improve the performance.
;Column2="maker"

; Track name (required): name of track
TrackName="Gene models"
; TrackDescription: track description
TrackDescription="Gene models predicted by Phytozome13
Downloaded from https://genome.jgi.doe.gov/portal/pages/dynamicOrganismDownload.jsf?organism=Gmax, file Gmax_508_Wm82.a4.v1.gene_exons.gff3.gz.
\"Best hits\" are defined as the top result returned from BLASTP alignment of this species proteome to the target 
    (A. thaliana, O. sativa, or C. reinhardtii listed above). This was run with blast+ 2.2.26 with parameters:
    blastall -p blastp -F \"mS\" -b 1500 -v 1500 -e 0.001 -M BLOSUM45
    and further filtered with an 1E-3 cutoff e-value."

; Type: GFF type (column 3) of annotation items. If not specified, both exon (SO:0000147) and CDS (SO:0000316) will be parsed.
; This line is typically left commented out
; The types of one-exon features can be defined here. To load multiple types, e.g., lncRNA, tRNA into one track,
; list them separated by comma. These items will be loaded as they are - no child records will be analyzed.
; ParentType should be commented out
;Type=lnc_RNA,tRNA

; Parent type (optional): GFF type of parent items that groups annotation items, 
; for protein-coding genes this is typically "mRNA". The child items, such as "exon" or "CDS", that have this common parent, 
; will be grouped and form gene models. The types like "five_prime_UTR" or "three_prime_UTR" are not required but if present
; will be checked for consistency with the exon and CDS records.
; If Parent type is omitted - parent will be auto detected by taking direct parents of exons, cds
ParentType="mRNA"

; Qualifier type: GFF type of parent items that contain qualifiers
QualifierType="mRNA"

; Qualifier attributes: qualifiers to be loaded from the GFF attribute (the last column of GFF). 
; If QualifierTypes list more than one type, to disambiguate the qualifiers with identical names, 
; use "fully-qualified" name that includes the QualifierType. For example, if "gene" and "mRNA" records, requested in QualifierTypes, both have qualifier "Name",
; to load only "Name" from the mRNA line, type "mRNA:Name"
QualifierAttributes=Name,pacid,longest,ancestorIdentifier

; For the qualifiers that need renaming, use a more flexible way of loading qualifiers engaging QualifierAttributeKey instruction. 
; The attribute key in GFF can be stored as a qualifier with a different name. For example, to disambiguate gene:Name and mRNA:Name, 
; store them as two different qualifiers, "geneName" and "transcriptName" using this construct:
; QualifierAttributeKey.gene:Name=geneName
; QualifierAttributeKey.mRNA:Name=transcriptName
;QualifierAttributeKey.[Type:]AttributeKey=qualifierName
;QualifierAttributeKey.Note=Function
; The line above stores attribute "Note" as qualifier "Function"

; New.: Create new qualifiers by extracting substrings from existing gff attributes using regular expression.
; The general form of the instruction is
;New.newQualifierName=[Type:]attributeKey,regex
;New.Pfam=gene:Dbxref,Pfam:(\w+)
;New.Prosite=gene:Dbxref,Prosite:(\w+)
;New.InterPro=gene:Dbxref,InterPro:(IPR\w+)
;New.GO=gene:Ontology_term,(GO:\w+)
; The line above creates a new qualifier 'GO' by extracting a substring from an attribute 'Ontology_term'.
; Note that the new qualifiers can also be created later, by using the command 'add annotation_qualifier'.

; ParentGroupName: the new track will be placed under a parent node with this name. 
; To reduce the number of track nodes on the top level, group the tracks of similar type.
;ParentGroupName=gene models
; IsShownFirst: if false, the track will not be shown by default when the map is opened for the first time
;IsShownFirst=true
; IsSearchable: If true (default), the track data will be indexed for search. If false, the indexing will be skipped
;IsSearchable=false

[QualifierLinks]
; Some qualifiers can be shown as hyper-links.
; Link qualifier name-value to external sources.
; %s in the link is where a qualifier value is placed.
;QUALIFIER_NAME=PLACEHOLDER_URL[|regex]
;ID="http://rice.plantbiology.msu.edu/cgi-bin/gbrowse/rice/?name=%s"
; The line above would result in the qualifier "ID" shown as a hyper-link. For example, if ID="Os1g123", the URL would be "http://rice.plantbiology.msu.edu/cgi-bin/gbrowse/rice/?name=Os1g123"
Name="https://phytozome-next.jgi.doe.gov/report/transcript/Gmax_Wm82_a4_v1/%s"

; Optional: a qualifier link can be embedded into into a longer text of the qualifier value. For example, 'Dbxref' can contain multiple identifiers, each of them 
; can be used to construct a hyperlink. Use a regular expression to extract a substring that will be converted into the hyperlink inside the text. The regular expression
; should be appended using a pipe '|' symbol:
;Dbxref=https://www.ebi.ac.uk/interpro/entry/InterPro/%s/|Interpro:(IPR\d+)
;            https://pfam.xfam.org/family/%s|(PF\d+)
; The command above contains two URL-regex pairs applied to the qualifier key Dbxref separated by a new line.
; The first one instructs to analyze the qualifier value stored under key 'Dbxref', find substring that starts with 'Interpro:'
; and extract the part that has 'IPR' as the first letters followed by digits.
; The second one will extract substrings that start with PF and contain digits, and plug them into the placeholder %s of the URL that points to Pfam site.
; Later, you can use the psh command 'add qualifier_link' to add more URLs and regular expressions.


[AnnotationSearches]
; Add qualifier name-value to search term {GeneName, GeneFunction}
; Some qualifiers can have a special meaning: they represent the gene name and gene function. 
; This will help narrow down the text search specifically to the gene name or function in Persephone.
; This section will establish the distinction. The keyword GeneName tells which qualifier is designated as the gene name.
; Important: use the qualifier names here, which can be different from the GFF attribute names. For example, if you have introduced a qualifier "transcriptName" by
; renaming the GFF attribute mRNA:Name, then type GeneName=transcriptName.
;SEARCH_TERM=QUALIFIER_NAME
GeneName="Name"
;GeneFunction=Description

[MapMapping]
; Normally, the map names in the GFF files should correspond (case sensitive) to the map names in the database. 
; In such case, no map name matching is necessary and this section can be empty.
; If map names in file are different from those in DB, you should map each MAP_NAME in the file to a MAP_NAME in DB.
; The manual mapping is below:
;MAP_NAME_IN_FILE=MAP_NAME in DB
;Chr1=Chr.1
;Chr2=Chr.2
;Chr3=Chr.3
; Check 'printmapping' command that may help generating the name mapping tables.

; Alternatively, use MapIdentifiedBy
; MapsIdentifiedBy: if all maps in the file instead of the map name are identified by their alternative IDs like MAP_ID, ACCESSION_NO or GENOME_DNA_ID,
; provide the mapping with just one line using either MapName, MapId, AccessionNo or GenomeDnaId, for example:
;MapsIdentifiedBy=AccessionNo

; LoadListedMapsOnly: if true, only data for the maps listed in this section will be added.
; If false, PersephoneShell will still try to match names from the file to maps in the database
; using MAP_NAME (case-sensitive), and if the map is not found, the annotation line will be skipped
;LoadListedMapsOnly=true


[DbSequences]
; Oracle only
; The ID columns below are used in loading annotations.
; If there is no sequence/trigger assigned to these columns, you must specify a sequence for them.
;PROCESS_RUN.RUN_ID=ID_SEQ
;GDNA_ANNOT.ANNOT_ID=ID_SEQ
;DESCRIPTION.DESCR_ID=ID_SEQ
;TRACK.TRACK_ID=ID_SEQ

Please refer to the comments in the control file, they document the instruction lines in detail.

There are a few points to focus on.

  • The INI file contains a few [sections]. The section [Annotation] has a key Tracks pointing to sections below. Such design allows loading several tracks from the same file using one command. Each section referenced by Tracks= corresponds to a separate track.
  • Some GFF files may have inconsistent records. For example, exons can overlap, or, a 'gene' record can cover a shorter region than the child exons, etc. To allow PersephoneShell load the rest of the records, please consider using the instruction SkipInconsistentRecords:

SkipInconsistentRecords=SkipMRNA

Without this instruction, PersephoneShell will refuse loading the models if there are errors in a GFF. When reading the data from the GFF file, PersephoneShell assembles gene models from records of type 'gene', 'mRNA', 'exon', 'CDS', 'five_prime_UTR', 'three_prime_UTR'. Any inconsistency is reported. If you still want to load the file skipping the wrong records, you will need to use SkipInconsistentRecords instruction.

  • It is worth noting that in case the textual value is provided as one line, the quotation marks are not necessary:

ParentType="mRNA"

and

ParentType=mRNA

will have the same effect. At the same time, if the text contains multiple lines, the value should be enclosed into quotation marks:

TrackDescription="Gene annotation from
 some external resource.
Publication:
Authors
Title"

The text can include simple HTML tags, such as <i>, <b> or <href>. Note, that internal quotation marks needed for the HTML code should be "escaped" by backslash (\):

"Visit the main page at <a href=\"https://google.com\">Google</a>"

Qualifiers

QualifierTypes= instruction tells records of which GFF type(s) should be used as the source of qualifiers. If the qualifiers that you plan to load appear on the line of type 'gene' and 'mRNA', write

QualifierTypes=gene,mRNA


Gene qualifiers, key-value pairs normally supplied in the GFF attribute field, can be loaded under the same names as attributes.

The GFF attributes are listed in the last column of the GFF file:

ID=Glyma.01G000100.Wm82.a4.v1;Name=Glyma.01G000100;ancestorIdentifier=Glyma.01G000100.Wm82.a2.v1

To create qualifiers named the same as the attributes, list them in an instruction like this:

QualifierAttributes=...

This would create qualifiers with the key identical to the attribute name. For example,

QualifierAttributes=ancestorIdentifier

will create a qualifier with the key ancestorIdentifier and corresponding value (Glyma.01G000100.Wm82.a2.v1).

If QualifierTypes lists more than one type, there is a chance that attributes with the same name appear in both types. In such cases, to disambiguate the names, we need to use a fully-qualified form of the attribute including the type prefix, such as gene:Name or mRNA:Name.

The GFF attribute can be stored as a qualifier under different name. In the example below, we refer to the attribute Name from 'gene' line (gene:Name) as qualifier 'geneName'. The qualifier 'Name' from the line 'mRNA' (mRNA:Name) becomes 'transcriptName'. Once the GFF attributes have been assigned to qualifiers, reference them by the qualifier names in the further instructions, such as QualifierLinks or AnnotationSearches:

QualifierAttributeKey.gene:Name=geneName
QualifierAttributeKey.mRNA:Name=transcriptName

[AnnotationSearches]
GeneName=transcriptName
GeneFunction=Note

When running in the test mode, PersephoneShell will display the list of qualifiers from GFF lines of the requested types specified in QualifierTypes= instruction. In our case, we ask to process qualifiers from types gene and mRNA:

QualifierTypes=gene,mRNA

The test mode will print the qualifiers found in the corresponding lines:

PS> add annotation -c Samples/Annotation/Wm82.a4_phytozome-gff.ini -t
...
 List of distinct qualifiers in file (with example values):
  mRNA:
                                 ID (86,256 records): Glyma.01G000100.1.Wm82.a4.v1
    [ind] [included]                Name (86,256 records): Glyma.01G000100.1
          [included]               pacid (86,256 records): 41118016
          [included]             longest (86,256 records): 1
          [included]  ancestorIdentifier (55,193 records): Glyma.01G000100.1.Wm82.a2.v1
                                  Parent (86,256 records): Glyma.01G000100.Wm82.a4.v1

The included qualifiers with example values will be clearly highlighted to help you see what data will be loaded and whether your selection of qualifiers needs any modification. The text [ind] marks the qualifiers that will be indexed for text search (see QualifierFilter fields in the PersephoneShell configuration file).

GFF type

If we are talking about protein-coding genes, three types of gene model tracks can be loaded: 
(a) transcripts with internal CDS; 
(b) CDS only; 
(c) exons only. 

Normally, the gene models are loaded together with UTRs and CDS parts of the gene. In this case, keep the Type record commented out. If you prefer to load the CDS part only, so that the gene models always start with the initiator ATG codon, set Type to CDS

; Type: GFF type (column 3) of annotation items. If not specified, both exon (SO:0000147) and CDS (SO:0000316) will be parsed.
Type="CDS"

If, for some reason, you would like to load only exon structure without the CDS information, specify "exon" as the GFF type to be loaded:

; Type: GFF type (column 3) of annotation items. If not specified, both exon (SO:0000147) and CDS (SO:0000316) will be parsed.
Type="exon"

Usually, Type record is commented out, and both "CDS" and "exon" GFF types are processed.

The Type does not have to be "CDS" or "exon". For example, if you are loading miRNA track, the GFF type could be 

Type="miRNA_primary_transcript"

Important

If Type is set to the value different from "CDS" or "exon", the gene models will always have one-exon features. In this case ParentType will be ignored, as it is used only to group the items that are usually of type 'exon' or 'CDS'.

Please also consult a sample INI file for loading micro-RNAs.

QualifierLinks

This section converts qualifiers into hyperlinks. The qualifier values, either the entire string or its part extracted by using a regular expression, will be converted into a URL.

In simple cases, use the full qualifier value to fill the placeholder %s in the URL. The instruction

Name="https://phytozome-next.jgi.doe.gov/report/transcript/Gmax_Wm82_a4_v1/%s"

will result in the value (e.g., Glyma.01G053300.1) of the qualifier Name placed entirely into the link, like this:

https://phytozome-next.jgi.doe.gov/report/transcript/Gmax_Wm82_a4_v1/Glyma.01G053300.1

Sometimes, a long text of the qualifier value contains multiple identifiers that can be used to form different URLs:

Take a look at the qualifier Ontology_term. It contains multiple identifiers, each of them worth to be presented as a separate hyperlink. To extract the proper sub-string from the text of the qualifier, use regular expressions. To generate multiple URLs from the value of Ontology_term, use this construct:

Ontology_term=http://amigo.geneontology.org/amigo/term/%s|(GO:\w+)

The regular expression (GO:\w+) specified after the pipe symbol will be used to extract the text according to the capture rule enclosed in the parentheses. Each match to GO:\w+ will form a separate link and will be embedded into the text of the value of Ontology_term.

You can apply several URL-regex pairs to the same qualifier. According to the syntax of the INI files, it is not allowed to use the same key under the same section more than once. You cannot have:

Qualifier1=link1
Qualifier1=link2

To work around this limitation, specify the qualifier name only once and provide the URL-regex pairs on separate lines:

Dbxref=http://amigo.geneontology.org/amigo/term/%s|(GO:\w+)
      https://prosite.expasy.org/%s|Prosite:(PS\w+)
      https://pfam.xfam.org/family/%s|(PF\d+)
      https://www.ebi.ac.uk/interpro/entry/InterPro/%s/|(IPR\d+)

Just for reference, here are a few useful examples:

InterPro=https://www.ebi.ac.uk/interpro/entry/InterPro/%s/|(IPR\d+)
Pfam=https://pfam.xfam.org/family/%s|(PF\d+)
Prosite=https://prosite.expasy.org/%s|Prosite:(PS\w+)
GO=http://amigo.geneontology.org/amigo/term/%s|(GO:\w+)
Dbxref=http://amigo.geneontology.org/amigo/term/%s|(GO:\w+)
      https://prosite.expasy.org/%s|Prosite:(PS\w+)
      https://pfam.xfam.org/family/%s|(PF\d+)
      https://www.ebi.ac.uk/interpro/entry/InterPro/%s/|(IPR\d+)

Remember that you can always add more qualifier links after the gene annotation has been loaded. Use the command 'add qualifier_link' or 'edit qualifier_link'

MapMapping

Let's focus on MapMapping section. This division is sometimes necessary if the GFF file references chromosomes using names that are different from their names in the database. For example, the sequences in the database can have names such as 'Chr.1', 'Chr.2', etc., while the records in the GFF file use '1', '2', etc. Or you can imagine when the original FASTA file has complex definition lines that use quite-uninformative accession names together with human-readable strings like 'chromosome 1'. During loading, the sequences are given names like 'chromosome 1' but the GFF file may reference them by accession. The MapMapping section allows to associate the sequence names in the file to the map names in the database. The table may list the names in the file and provide the corresponding name in the database. The maps in the database can be identified by MapName, AccessionNo or MapId. You can generate such translation table by PersephoneShell's command printmapping. If you know that '1' in the file corresponds to the map called 'Chr.1' in the database, put a record

"1"="Chr.1"

Note, the double quotes are not necessary if the names do not contain spaces.

Please see other methods of the map mapping in the Control files section.

  1. Under Annotation, find Sources line and use the correct path to the GFF file.
  2. As usual, first, test the INI file by using '-t' switch:

PS> add annotation -c add_MSU_osa1r7.ini -t

and, if all tests are successful, load the data, usually with the verbose output:

PS> add annotation -c add_MSU_osa1r7.ini -v