Adding Markers to Sequence Maps from GFF3 file

After you have added an organism and its map sets, maps, sequences, and annotation with PersephoneShell, you may want to add marker tracks. To add  the markers use the add command (see Add) to load a control file (see Control Files) with the data for the markers. This section shows the steps necessary to add a track with repeats to rice chromosome maps. Please note, in Persephone terms, a marker is a named location on a map. It can be a single point or can cover a region of genome, it can have a generic (like 'repeat', reused in multiple locations) or a specific name. As the markers are stored in the database, each of them is assigned a unique ID. This allows Persephone to list all locations of a single marker in multiple map sets, helping with inventory of genetic resources.


This section describes how to load markers from GFF files. See Add Markers (Delimited Text Files) for steps to load markers using delimited TXT files.

Please review a copy of the add_MSU_osa1r7_rm.ini control file, which is included in the "Samples\Marker" folder and is shown below.

; RunDescription: if specified, a custom description will be used. Will be ignored if a RunId is specified.
;                 otherwise, "Added markers for {MapSet Accession No.} from {Sources}." will be used.
RunDescription="Added repeats for MSU_osa1r7 from"

; Either MapSetId or MapSetPath is required.
; MapSetId: id of a target map set.
; MapSetPath: path of a target map set.
MapSetPath="/Oryza sativa/MSU_osa1r7"

; To add or update a marker type, specify a type name and description.
;SNP="Used to measure methylation level."
REPEAT="Generic marker used to map internal repeats on genome."

; To add or update a marker name type, specify a type name and description.
FULL_NAME="Full name."

; To add or update a marker sequence type, specify a type name and description.
;SEQ="The full sequence the marker."

; Source (required): a TXT file or GFF file located locally or remotely accessible via URL.
; Number format culture: specifies a culture name used to parse numbers in data. Default value is en - English.
;                        e.g. de - German, es - Spanish, fr - French. For more cultures,
; FileType: {Text (delimited text file)|Gff}
; Origin (required): database or institution that the file originate
; MappingMethod: method to map markers in the file. e.g. BLAST, RepeatMasker
;                if not specified, 'Unknown' is used.
; Tracks (required): comma delimited. Corresponding section(s) named the same must exist
;                    At least one track need to be specified.
; SourceOrganismId: markers are suggested to be unique in a source organism.
; BypassLookup: A marker name-id dictionary to check duplication will be built.
;               To bypass this step, set BypassLookup true. Default value is false.
; SearchAliases: Indicates if other names besides primary name is searched or not.  Default value is false.
; RebuildIndex: indicates if marker indices are rebuilt or not. Default value is false.
; Commit frequency: indicates how often the process commits markers. Every N markers.

; MarkerType (required): type of marker. If new, should be specified in MarkerType section.
; TrackName: track name to be displayed on the plate.
; TrackDescription: track description shared across maps in the MapSet.
TrackDescription="Repeat regions identified by RepeatMasker."
; TrackType: choose one among
;            1. GENERIC_BP_TRACK: markers for physical maps.
;            2. MARKER_TRACK: markers for genetic maps.
;            3. HEAT_TRACK: markers for physical maps. Heatmap of physical distance.
;            4. DENSE_BP_MARKER_TRACK: dense marker track for physical maps. Heatmap will not work.
;            5. CYTOBAND_TRACK: cytoband markers. Each marker has to have 'GieStain' qualifier.    
; TrackColor: {NamedColor|HTML hex code|R,G,B}
; PrimaryMarkerNameType: markers can have multiple names, each name has its type, one of the types should store the primary name;
; The primary name is shown in the marker label
; AddModes: choose a mode to add name, mapping, qualifiers or sequence among
;           1. AddAnyway: Add regardless of duplication. Faster as it does not check.
;           2. AddOrDie: add if not exists; die (throw exception) otherwise.
;           3. AddOrUpdate: add if not exists; update otherwise.
;           4. AddOrSkip: add if not exists; skip otherwise.
; GffSource: Gff column 2. Database name or software that generated these features.
;            if not specified, all the sources will be included.
; GffType: Gff column 3. Feature type. A term from the full Sequence Ontology.
;          if not specified, all the types will be included.

; GffMarkerNameAttributeKey (required at least one): Gff attribute key whose value contains marker name.
; GffMarkerSequenceAttributeKey: Gff attribute key whose value contains marker sequences.
; GffMarkerQualifierAttributeKey: Gff attribute key whose value contains marker qualifiers.
; GffMappigQualifierAttributeKey: Gff attribute key whose value contains mapping qualifiers.

; LoadListedMapsOnly: if true, only maps listed below will be loaded, otherwise psh will try to match map_names
; not listed here to MAP_NAME, MAP_ID or ACCESSION_NO. The map will be automatically skipped if no match is found.

; If no mapping is found in this section, it assumes that each MAP_NAME in file exactly matches a MAP_NAME in DB.
; If map names in file are different from those in DB, map each MAP_NAME in file to its MAP_NAME, MAP_ID or ACCESION_NO in DB.
; Otherwise, marker will be created without mapping.


Please carefully read the comments that are provided for each record, hopefully, they serve as a good documentation displayed right by hand.

Here is a sample GFF3 file:

##gff-version 3
Chr1        RM        repeat_region        1001        1052        .        .        .        ID=rm_1;Name=ORSiTRTM00000002;Note=Telomere%20sequence%2C%20putative
Chr1        RM        repeat_region        1053        1086        .        .        .        ID=rm_2;Name=%28CCCTAA%29n
Chr1        RM        repeat_region        1087        1116        .        .        .        ID=rm_3;Name=ORSiTRTM00000002;Note=Telomere%20sequence%2C%20putative
Chr1        RM        repeat_region        2285        2435        .        .        .        ID=rm_4;Name=ORSiTEMT00400001;Note=Explorer%20type%20MITE%2C%20putative
Chr1        RM        repeat_region        2303        2435        .        .        .        ID=rm_5;Name=ORSgTEMT00400007;Note=Explorer-like%20MITE%2C%20putative
Chr1        RM        repeat_region        3310        3340        .        .        .        ID=rm_6;Name=%28CAAAT%29n
Chr1        RM        repeat_region        8878        9124        .        .        .        ID=rm_7;Name=ORSgTEMT00101020;Note=Tourist-like%20MITE%2C%20putative
Chr1        RM        repeat_region        9072        9140        .        .        .        ID=rm_8;Name=ORSgTEMT00101037;Note=Tourist-like%20MITE%2C%20putative

When loading markers, we need to know the name of each marker, the name type (there could be multiple names for the same marker), the mapping coordinates, etc. The format of GFF3 files assumes fixed number of predefined columns, so we know in advance, that the map name is provided in the very first column (index=0), the mapping start and end come in the columns with index=3 and 4. The last column contains marker attributes that, if needed, PersephoneShell will parse and store in a form of qualifiers. To specify which attributes should be loaded, and optionally how they should be renamed, use a record in a form:


The record above instructs to find Note in the list of GFF attributes:


and save the key-value pair as a qualifier called "note" (the URL encoding of the value will be decoded). Alternatively, the name of the qualifier "Note" can be shown with a different label, for example "Description". Save it as a short qualifier key, for example, "D", and create a single record in the database (table QUALIFIER_DISPLAY) telling to display the qualifier "D" as "Description". This may be helpful if the text for the qualifier key is supposed to be long, and you want to save on space needed for millions of markers. If you use the proper syntax (below), PersephoneShell will take care of this renaming for you.


By default, all qualifier values have type "string". If you want, you can specify the data type of the value. In addition to "string", the qualifier value can be treated as an integer or a floating number. This would ensure that the sorting or filtering of data grids works correctly. Specify the data type and data format similar to this form:


Other recognized formats are "int", "integer", "long".

As usual, first, test the INI file by using '-t' switch:

PS> add marker -c add_MSU_osa1r7_rm.ini -t

and, if all tests are successful, load the data, usually with the verbose output:

PS> add marker -c add_MSU_osa1r7_rm.ini -v