Adding Markers to Sequence Maps from GFF3 file

After you have added an organism and its map sets, maps, sequences, and annotation with PersephoneShell, you may want to add the marker tracks. To add the markers, use the add command (see Add) to load a control file (see Control Files) with the data for the markers. This section shows the steps necessary to add a track with repeats to the rice chromosome maps. Please note, in Persephone terms, a marker is a named location on a map. It can be a single point or can cover a region of genome. As the markers are stored in the database, each of them is assigned a unique ID. This allows Persephone to list all locations of a single marker in multiple map sets, helping with inventory of genetic resources.

Please read an important note on marker identity that is based on the combination of marker name and SourceOrganismId.

Note

This section describes how to load markers from GFF files. See Add Markers (Delimited Text Files) for the steps to load markers using delimited TXT files.

Note

Loading the repeat regions is given here as an example of loading markers. In case the name of each repeat region is not important, it would be much more practical to load the repeat information in a form of a quantitative track - the compression ratio for the data and loading speed would be orders of magnitude better.

Please review a copy of the Wm82.a4.v1_rm-phytozome-gff.ini control file, which is included in the "Samples\Marker" folder and is shown below.


[ProcessRun]
; RunDescription: if specified, a custom description will be used. Will be ignored if a RunId is specified.
;                 otherwise, "Added markers for {MapSet Accession No.} from {Sources}." will be used.
RunDescription="Added repeats for Wm82.a4.v1 from http://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Gmax."

[MapSet]
; Either MapSetId or MapSetPath is required.
; MapSetId: id of a target map set.
;MapSetId=12345
; MapSetPath: path of a target map set.
MapSetPath="/Glycine max/Wm82.a4.v1"

[MarkerType]
; To add or update a marker type, specify a type name and description.
;TypeName=Description
;SNP="Used to measure methylation level."
REPEAT="Generic marker used to map internal repeats on genome."

[Marker]
; Source (required): a TXT file or GFF file located locally or remotely accessible via URL.
Source=$DATA/soybean/Gmax_508_Wm82.a4.v1.repeatmasked_assembly_v4.0.gff3.gz 
; Number format culture: specifies a culture name used to parse numbers in data. Default value is en - English.
;                        e.g. de - German, es - Spanish, fr - French. For more cultures, https://msdn.microsoft.com/en-us/goglobal/bb896001.aspx
;NumberFormatCulture="fr"
; FileType: {Text (delimited text file)|Gff}
FileType=Gff
; Origin (required): database or institution that the file originate
Origin="Phytozome"
; Coordinate system: 1 (one-based) / 0 (zero-based). Default value is 1.
CoordinateSystem=1
; MappingMethod: method to map markers in the file. e.g. BLAST, RepeatMasker
;                if not specified, 'Unknown' is used.
MappingMethod="RM"
; Tracks (required): comma delimited. Corresponding section(s) named the same must exist
;                    At least one track need to be specified.
Tracks="Wm82.a4.v1_repeat"
; SourceOrganismId: markers are suggested to be unique in a source organism.
;                   Specify a source organism if you want to lookup markers belonging to the organism.
;                   Otherwise, inferred by target MapSet.
;SourceOrganismId=1534
; BypassLookup: When adding new markers, PSH will try to reuse the markers with the same OrganismId and name, assuming they are the same markers
;               mapped onto different positions. To reuse them, a lookup table with existing markers is built.
;               If BypassLookup=true, PSH will skip this step and will issue a database query for each inserted marker in order to 
;               find an identical marker with the same OrganismId and name. When adding just a few markers
;               while the database contains millions of markers with the same OrganismId you can bypass the lookup and 
;               save time and memory. On the contrary, if you insert a large number of markers, leave
;               BypassLookup=false (default), and the lookup will be built from the markers in the database that have
;               the same OrganismId. To reduce the size of the lookup table, use NamePrefixesForLookup (see below).
;BypassLookup=true
; SearchAliases: Indicates if other names besides primary name is searched or not.  Default value is false.
;SearchAliases=false
; NamePrefixesForLookup: When the database contains millions of markers for this organism, the lookup table
;                 can take extra memory. To narrow down the list of markers to look up, provide
;                 the common prefix(es) for the marker names to be loaded, if possible. Only markers that have names beginning with the
;                 provided prefixes will be fetched from the database. Note, that psh will try to derive such name prefixes automatically.
;                 Use this instruction if the suggested prefixes are not specific enough and still result in a large lookup size.
;NamePrefixesForLookup=solcap,Sol
; SearchAliases: Indicates if other names besides primary name are searched or not.  Default value is false.
;SearchAliases=false
; Commit frequency: indicates how often the process commits markers. Every N markers.
CommitFrequency=1000

[Wm82.a4.v1_repeat]
; MarkerType (required): type of marker. If new, should be specified in MarkerType section. 
MarkerType="REPEAT"
; TrackName: track name to be displayed on the plate.
TrackName="RepeatMasker"
; TrackDescription: track description shared across maps in the MapSet.
TrackDescription="Repeat regions identified by RepeatMasker."
; TrackType: choose one among 
;            1. GENERIC_BP_TRACK: markers for physical maps.
;            2. MARKER_TRACK: markers for genetic maps.
;            3. HEAT_TRACK: markers for physical maps. Heatmap of physical distance.
;            4. DENSE_BP_MARKER_TRACK: dense marker track for physical maps. Heatmap will not work.
;            5. CYTOBAND_TRACK: cytoband markers. Each marker has to have 'GieStain' qualifier.    
TrackType=DENSE_BP_MARKER_TRACK
; TrackColor: {NamedColor|HTML hex code|R,G,B}
TrackColor=230,248,243
; GeneratedMarkerNamePrefix: in case there are no appropriate values to be used as marker names, generate names for each marker using this prefix.
; For example, if the feature names are not important, they can be named automatically, like TSS1, TSS2... using prefix 'TSS'.
;GeneratedMarkerNamePrefix=TSS

; AddModes: choose a mode to add name, mapping, qualifiers or sequence among
;           1. AddAnyway: Add regardless of duplication. Faster as it does not check.
;           2. AddOrDie: add if not exists; die (throw exception) otherwise.
;           3. AddOrUpdate: add if not exists; update otherwise.
;           4. AddOrSkip: add if not exists; skip otherwise.
MarkerNameAddMode=AddAnyway
MarkerSequenceAddMode=AddAnyway
MarkerQualifierAddMode=AddAnyway
MappingAddMode=AddAnyway
MappingQualifierAddMode=AddAnyway
; GffSource: Gff column 2. Database name or software that generated these features.
;            if not specified, all the sources will be included. 
;GffSource="RepeatMasker"
; GffType: Gff column 3. Feature type. A term from the full Sequence Ontology.
;          if not specified, all the types will be included. 
;GffType="similarity"
; GffMarkerNameAttributeKey (required at least one): Gff attribute key whose value contains marker name.
GffMarkerNameAttributeKey.ID=FULL_NAME
; GffMarkerSequenceAttributeKey: Gff attribute key whose value contains marker sequences.
;GffMarkerSequenceAttributeKey.Seq=SEQ
; GffMarkerQualifierAttributeKey: Gff attribute key whose value contains marker qualifiers.
;GffMarkerQualifierAttributeKey.AttributeKey=qualifierName((:displayText),dataType,dataFormat)
GffMarkerQualifierAttributeKey.Note="Note"
; GffMappingQualifierAttributeKey: Gff attribute key whose value contains mapping qualifiers.
;GffMappingQualifierAttributeKey.AttributeKey=qualifierName((:displayText),dataType,dataFormat)

; ParentGroupName: the new track will be placed under a parent node with this name. 
; To reduce the number of track nodes on the top level, group the tracks of similar type.
;ParentGroupName=marker tracks

; TrackQualifier: Add qualifiers that can help filtering the large track lists when using the Edit tracks interface in webPersephone
;TrackQualifier.Tissue=Liver
;TrackQualifier.Author=JHU

; IsShownFirst: if false, the track will not be shown by default when the map is opened for the first time
IsShownFirst=false
; IsSearchable: If true (default), the track data will be indexed for search. If false, the indexing will be skipped
;IsSearchable=false

[MapMapping]
; LoadListedMapsOnly: if true, only maps listed below will be loaded. If LoadListedMapsOnly is missing, psh will try to match map_names 
; not listed here to MAP_NAME (case-sensitive). The map will be automatically skipped if no match is found.
;LoadListedMapsOnly=true

; If no mapping is found in this section, it assumes that each MAP_NAME in file exactly matches a MAP_NAME in DB. 
; MapsIdentifiedBy: if all maps in the file instead of the map name are identified by their alternative IDs like MAP_ID, ACCESSION_NO or GENOME_DNA_ID,
; provide the mapping with just one line using either MapName, MapId, AccessionNo or GenomeDnaId, for example:
;MapsIdentifiedBy=AccessionNo

[DbSequences]
; Oracle only. The ID columns below are used in loading markers.
; If there is no sequence/trigger assigned to these columns, you must specify a sequence for them.
;PROCESS_RUN.RUN_ID=ID_SEQ
;ANALYSIS.ANALYSIS_ID=ID_SEQ
;MARKER_TYPE.MARKER_TYPE_ID=ID_SEQ
;DESCRIPTION.DESCR_ID=ID_SEQ
;TRACK.TRACK_ID=ID_SEQ
;TRACK_STYLE.TRACK_STYLE_ID=ID_SEQ
;MARKER.MARKER_ID=ID_SEQ
;MARKER_NAME.MARKER_NAME_ID=ID_SEQ
;MARKER_NAME_TYPE.MARKER_NAME_TYPE_ID=ID_SEQ
;MARKER_SEQUENCE.MARKER_SEQUENCE_ID=ID_SEQ
;MARKER_SEQUENCE_TYPE.MARKER_SEQUENCE_TYPE_ID=ID_SEQ
;MARKER_QUALIFIER.QUALIFIER_ID=ID_SEQ
;MARKER_QUALIFIER_NAME.QUALIFIER_NAME_ID=ID_SEQ
;QUALIFIER_DISPLAY.QUAL_ID=ID_SEQ
;MK_MAPPING.MAPPING_ID=ID_SEQ
;MAPPING_QUALIFIER.MAPPING_QUALIFIER_ID=ID_SEQ

Please carefully read the comments that are provided for each record, hopefully, they serve as a good documentation displayed right by hand.

Here is a sample GFF3 file:

##gff-version 3
Chr1        RM        repeat_region        1001        1052        .        .        .        ID=rm_1;Name=ORSiTRTM00000002;Note=Telomere%20sequence%2C%20putative
Chr1        RM        repeat_region        1053        1086        .        .        .        ID=rm_2;Name=%28CCCTAA%29n
Chr1        RM        repeat_region        1087        1116        .        .        .        ID=rm_3;Name=ORSiTRTM00000002;Note=Telomere%20sequence%2C%20putative
Chr1        RM        repeat_region        2285        2435        .        .        .        ID=rm_4;Name=ORSiTEMT00400001;Note=Explorer%20type%20MITE%2C%20putative
Chr1        RM        repeat_region        2303        2435        .        .        .        ID=rm_5;Name=ORSgTEMT00400007;Note=Explorer-like%20MITE%2C%20putative
Chr1        RM        repeat_region        3310        3340        .        .        .        ID=rm_6;Name=%28CAAAT%29n
Chr1        RM        repeat_region        8878        9124        .        .        .        ID=rm_7;Name=ORSgTEMT00101020;Note=Tourist-like%20MITE%2C%20putative
Chr1        RM        repeat_region        9072        9140        .        .        .        ID=rm_8;Name=ORSgTEMT00101037;Note=Tourist-like%20MITE%2C%20putative

When loading markers, we need to know the name of each marker, the name type (there could be multiple names for the same marker), the mapping coordinates, etc. The format of GFF3 files assumes fixed number of predefined columns, so we know in advance, that the map name is provided in the very first column (index=0), the mapping start and end come in the columns with index=3 and 4. The last column contains marker attributes that, if needed, PersephoneShell will parse and store in a form of qualifiers. To specify which attributes should be loaded, and optionally how they should be renamed, use a record in a form:

;GffMarkerQualifierAttributeKey.AttributeKey=qualifierName((:displayText),dataType,dataFormat)
GffMarkerQualifierAttributeKey.Note="note"

The record above instructs to find Note in the list of GFF attributes:

ID=rm_1;Name=ORSiTRTM00000002;Note=Telomere%20sequence%2C%20putative

and save the key-value pair as a qualifier called "note" (the URL encoding of the value will be decoded). Alternatively, the name of the qualifier "Note" can be shown with a different label, for example "Description". Save it as a short qualifier key, for example, "D", and create a single record in the database (table QUALIFIER_DISPLAY) telling to display the qualifier "D" as "Description". This may be helpful if the text for the qualifier key is supposed to be long, and you want to save on space needed for millions of markers. If you use the proper syntax (below), PersephoneShell will take care of this renaming for you.

GffMarkerQualifierAttributeKey.Note="D:Description"

By default, all qualifier values have type "string". If you want, you can specify the data type of the value. In addition to "string", the qualifier value can be treated as an integer or a floating number. This would ensure that the sorting or filtering of data grids works correctly. Specify the data type and data format similar to this form:

GffMarkerQualifierAttributeKey.bx="brix:Brix","double","0.000"

Other recognized formats are "int", "integer", "long". 

As usual, first, test the INI file by using '-t' switch:

PS> add marker -c add_MSU_osa1r7_rm.ini -t

and, if all tests are successful, load the data, usually with the verbose output:

PS> add marker -c add_MSU_osa1r7_rm.ini -v