Normally, after adding an organism and its map sets, maps and sequences using PersephoneShell, you would want to add gene annotations. To add gene annotations,  use the add command (see Add) to process a control file (see Control Files) with instructions how to load the data for the annotations. This section shows the steps necessary to add gene annotations to Oryza sativa (rice) genome in your Persephone database.

  1. An example control file "Samples\Annotation\add_MSU_osa1r7.ini" is included in the PersephoneShell zip file archive and is shown below.

[ProcessRun]
; Run description: if specified, a custom description will be used. Will be ignored if a RunId is specified.
;                  otherwise, "Added annotations for {MapSet Accession No.} from {Sources}." will be used.
RunDescription="Loaded MSU rice MSU_osa1r7 from http://rice.plantbiology.msu.edu"

[MapSet]
; Either MapSetId or MapSetPath is required.
; MapSetId: id of a target map set.
;MapSetId=247582504
; MapSetPath: path of a target map set.
MapSetPath="/Oryza sativa/MSU_osa1r7"

[Method]
; To add a new annotation method, specify a method name, CDS flag and a color to be shown in the annotation track.
; CDS flag is a boolean value that indicates if the annotation method is for CDS or not.
; Supported color name is available in http://www.flounder.com/csharp_color_table.htm
; If a method already exists, it will be updated. Otherwise, a new method will be added.
;Name=IsCds,{NamedColor|HTML hex code|R,G,B}
;mRNA=false,203,128,147
MSU=false,100,200,255

[Annotation]
; GFF specification in http://www.sequenceontology.org/gff3.shtml
; Sources (required): GFF file(s) for annotations located locally or remotely accessible via URL.
Sources="d:\Data\rice\all.gff3"
; Tracks (required): comma delimited. Corresponding section(s) named the same must exist
;                    At least one track need to be specified.
Tracks="MSU"
; MaskedParentTypes: parent GFF types (column 3) that does not create a block itself.
;MaskedParentTypes="biological_region","repeat_region"
; DoSort: indicates whether or not to sort the items.
;         Uses only when gene structure in GFF is not properly grouped. e.g. ENSEMBL
;         Cannot correctly sort when genes are overlapped. e.g. RNA genes within a protein coding gene.
;DoSort=true
; IgnoreWrongAnnotations: if true, skips loading wrongly-translated annotations (wrong start/stop codon). False by default.
;IgnoreWrongAnnotations=false
; IgnoreConflictingLines: if true, skips lines that come in conflict with previous lines. For example, an exon can overlap with previous exon, and such records can be ignored
;IgnoreConflictingLines=true
; SkipInconsistentRecords: if true, the whole gene record will be skipped if at least one inconsistent line will be detected
SkipInconsistentRecords=true
; Commit frequency: indicates how often the process commits annotations. Every N annotations.
CommitFrequency=1000

[MSU]
; Method (required): annotation method. If new, should be specified in METHOD section.
Method="MSU"
; Track name (required): name of track
TrackName="MSU gene models"
; TrackDescription: track description
TrackDescription="Gene annotation from MSU"
; Type: GFF type (column 3) of annotation items. If not specified, both exon (SO:0000147) and CDS (SO:0000316) will be parsed.
;Type="CDS"
; Parent type (required): GFF type of parent items that groups annotation items
ParentType="mRNA"
; Qualifier type: GFF type of parent items that contains qualifiers
QualifierType="gene"
; Qualifier attributes: qualifiers to be loaded in the GFF attribute
QualifierAttributes="ID","Note"

[QualifierLinks]
; Link qualifier name-value to external sources.
; %s in the link is where a qualifier value is positioned.
;QUALIFIER_NAME=PLACEHOLDER_URL
;ID="http://rice.plantbiology.msu.edu/cgi-bin/gbrowse/rice/?name=%s"

[AnnotationSearches]
; Add qualifier name-value to search term {GeneName, GeneFunction}
;SEARCH_TERM=QUALIFIER_NAME
GeneName="ID"
GeneFunction="Note"

[MapMapping]
; If no mapping is found in this section, it assumes that each MAP_NAME in file exactly matches a MAP_NAME in DB.
; If map names in file are different from those in DB, map each MAP_NAME in file to a MAP_ID or an ACCESION_NO in DB.
; Otherwise, annotation won't be added.
; LoadListedMapsOnly - if true, only maps listed here will be loaded
LoadListedMapsOnly=true
;MAP_NAME in file=MAP_ID or ACCESSION_NO or MAP_NAME in DB
;Chr1=Chr1
;Chr2=Chr2
;Chr3=Chr3


[DbSequences]
; The ID columns below are used in loading annotations.
; If there is no sequence/trigger assigned to these columns, you must specify a sequence for them.
;PROCESS_RUN.RUN_ID=ID_SEQ
;GDNA_ANNOT.ANNOT_ID=ID_SEQ
;DESCRIPTION.DESCR_ID=ID_SEQ
;TRACK.TRACK_ID=ID_SEQ


Please refer to the comments in the control file, they document the instruction lines in detail.

Three types of gene model tracks can be loaded:
(a) a transcript with internal CDS;
(b) CDS only;
(c) exons only.

Normally, the gene models are loaded together with UTRs and CDS parts of the gene. In this case, keep the Type record commented out. If you prefer to load the CDS part only, so that the gene model would start with the start ATG codon, uncomment

; Type: GFF type (column 3) of annotation items. If not specified, both exon (SO:0000147) and CDS (SO:0000316) will be parsed.
Type="CDS"

If, for some reason, you would like to load only exon structure without the CDS information, specify "exon" as the GFF type to be loaded:

; Type: GFF type (column 3) of annotation items. If not specified, both exon (SO:0000147) and CDS (SO:0000316) will be parsed.
Type="exon"

Usually, Type record is commented out, and both "CDS" and "exon" GFF types are processed.

Let's focus on MapMapping section. This division is sometimes necessary if the GFF file references chromosomes under different names. For example, the sequences in the database can have names such as 'Chr.1', 'Chr.2', etc., while the records in the GFF file use '1', '2', etc. Or you can imagine when the original FASTA file has complex definition lines that use quite-uninformative accession names together with human-readable strings like 'chromosome 1'. During loading, the sequences are given names like 'chromosome 1' but the GFF file may reference them by accession. The MapMapping section allows to translate sequence names in the file to the names in the database. The table may list the names in the file and provide the corresponding name in the database. The maps in the database can be identified by MapName, AccessionNo or MapId. You can generate such translation table by PersephoneShell's command printmapping. If you know that '1' in the file designates the map called 'Chr.1' in the database, put a record

"1"="Chr.1"

Note, the double quotes are not necessary if the names do not contain spaces.

Please see other methods of the map mapping in the Control files section.

  1. Under Annotation, find Sources line and use the correct path to the GFF file.
  2. As usual, first, test the INI file by using '-t' switch:

PS> add annotation -c add_MSU_osa1r7.ini -t

and, if all tests are successful, load the data, usually with the verbose output:

PS> add annotation -c add_MSU_osa1r7.ini -v