Adding Markers (Delimited Text Files)
This section describes how to add markers using delimited text files to your Persephone database with the add command (see Add). The steps below show how to use the add command with the control file with instructions (see Control Files) to load the markers.
In general Persephone terms, a marker is a named location on a map. It can be just a single position or an interval. Markers should have at least one name which is considered a primary name, the other aliases are optional. Each name has a name type, it helps to distinguish different naming systems.
A marker can also have associated sequences: a probe(s) sequence, primers, sequences of related proteins, etc.
Please note that the markers can be loaded without mapping (omit the section [MapSet] in the INI file). Also, you can add marker mappings to existing tracks (ReuseTrack=true).
Marker identity
Markers have SourceOrganismId that describes which organism they are originating from. This is not just a house keeping record. SourceOrganismId has an important role that we need to explain in detail. The marker identity is based on the marker name AND its SourceOrganismId, which helps distinguish markers with somewhat generic names like 'M0001' derived from different species. When adding markers to the database, PersephoneShell checks first if a marker with identical name and SourceOrganismId is already present. If it finds such marker (and if AddMode is set to reuse existing markers), PersephoneShell, instead of inserting a marker will reference the found marker by its internal MarkerId and just add a new mapping location for this marker. Please keep this logic in mind when considering adding new markers to several map sets. When displaying two maps with marker tracks, Persephone checks if there are markers with the same MarkerId (read: same name and SourceOrganismId) present on both maps. If there are such matching markers, they will be automatically connected with a line. If in your practice you see a pair of maps with markers that appear identical and the connecting line is missing, the most likely explanation is that the markers belong to different SourceOrganismIds. If you omit SourceOrganismId in the INI file when loading the marker mappings, the OrganismId of the maps where the markers are mapped will be used as SourceOrganismId. For example, if you store repeat locations as markers and if you map the same repeats on different maps with different OrganismIds and do not want the markers to be linked, remove the line with SourceOrganismId from the INI file, thus forcing assigning the markers to different SourceOrganismIds.
For the terms, such as names and sequences, the Persephone database has separate tables. Other marker properties are stored in a form of key-value pairs, called qualifiers. Both the key and the value are stored as strings.
Note
This section describes how to load markers with Delimited TXT files. See Add Markers in the Use Case section for steps to load markers from GFF files.
Markers can be loaded with or without mapping. The command to load the markers is
add marker -c <controlFile.ini>
Loading markers with mapping
The marker tracks can be added to the sequence or genetic maps. Use the command add genetic_map to create genetic maps with markers. Here, we will show how to create marker tracks in the sequence maps.
A simple case, using a TAB-delimited text file
The minimum data set neccessary to create a marker track should provide
- marker name,
- map name,
- position. The position can contain just a single coordinate:
column index: 0 1 2
Primary Name |
Map |
Start |
cfn0529264 |
chr1 |
3390803 |
cfn1095161 |
chr1 |
7284162 |
cfn1103522 |
chr1 |
6398799 |
cfn1111691 |
chr1 |
2011511 |
cfn0530002 |
chr1 |
3911208 |
The corresponding INI file with instructions (abbreviated) is shown below:
[MapSet]
; Either MapSetId or MapSetPath is required.
; MapSetPath: path of a target map set.
MapSetPath=/Triticum aestivum/Wheat IWGSC v2.1
[MarkerType]
; To add or update a marker type, specify a type name and description.
;TypeName=Description
;DArT=Diversity Arrays Technology
[MarkerNameType]
FULL_NAME=Full feature name
[MarkerSequenceType]
;ASSAY_SEQ="Assay sequence"
[Marker]
; Source (required): a TXT file or GFF file accessed locally or remotely via URL.
Sources=$DATA/wheat/tabw280k.txt.gz
; FileType: {Text (delimited text file)|Gff}
FileType=Text
; Origin (required): database or institution that the file originate
Origin="URGI_INRA"
; Tracks (required): comma delimited. Corresponding section(s) named the same must exist
; At least one track need to be specified.
Tracks="Track1"
[Track1]
; TrackName: track name to be displayed on the plate.
TrackName="TaBW280K markers"
; TrackDescription: track description shared across all tracks with the same name in the MapSet.
TrackDescription="Axiom genotyping array TaBW280K probes mapped by gmap."
PrimaryMarkerNameType=FULL_NAME
; AddModes: choose a mode to add name, mapping, qualifiers or sequence among
; 1. AddAnyway: Add regardless of duplication. Faster as it does not check.
; 2. AddOrDie: add if not exists; die (throw exception) otherwise.
; 3. AddOrUpdate: add if not exists; update otherwise.
; 4. AddOrSkip: add if not exists; skip otherwise.
MarkerNameAddMode=AddOrSkip
MappingAddMode=AddAnyway
; Either marker type or marker type index to read from table should be provided.
; MarkerType: specify a marker type. (single)
MarkerType=SNP
; Parsing logic
; SkipHeaderLines: the number of lines to skip parsing
TextSkipHeaderLines=1
; Delimiter: specify one among Colon(:), Comma(,), Period(.), Hyphen(-), SemiColon(;), Slash(/), Tab(\t), VerticalBar(|)
TextDelimiter=Tab
; TextMapNameIndex (required for mapping): column index(0-based) for map names.
TextMapNameIndex=1
; StartIndex (required for mapping): column index(0-based) for start positions.
TextStartIndex=2
; EndIndex: column index(0-based) for end positions. Nullable for point markers.
TextEndIndex=3
; MarkerNameIndex (required): column index(0-based) for marker names.
TextMarkerNameIndex.0=FULL_NAME
The INI file above has the minimal set of instructions necessary to create the marker track. Please study the comments for each instruction. The most of the optional directives are removed from the file for the sake of clarity.
The file has three important sections:
[MapSet] - to specify which map set the track will be added to;
[Marker] - to describe the source data and list the track sections to be loaded as new tracks;
[track1] - this is the track section referenced in the [Marker] section's instruction Tracks=. The name of this section is arbitrary. Potentially, more than one section can be added to the control file, one section per track to be loaded. Although the design allows loading multiple tracks from one file, we are planning to remove this functionality and simplify the structure of the INI file allowing loading only one track per INI file. This way, there will be no need for an additional track section.
The text file has tab-delimited columns with information for placing markers on the genome: map name, marker name and position. The *index columns point to the columns that contain this information. Please note that the index is 0-based.
Each marker can have more than one name. The name types are introduced in the section [MarkerNameType] or reused from the database. The instruction
TextMarkerNameIndex.0=FULL_NAME
tells that the name of type 'FULL_NAME' can be found in the first column (index=0). One of the names should be set as primary. If there is only one possible name, the primary name will be set automatically.
A more advanced case, loading markers with qualifiers and extra names.
column index: 0 1 2 3 4 5 6
Primary Name |
Map |
Start |
End |
AA_# |
BB_# |
ProbeSetId |
cfn0529264 |
chr1 |
3390803 |
3390803 |
12 |
82 |
AX-89310613 |
cfn1095161 |
chr1 |
7284162 |
7284162 |
23 |
72 |
AX-89310678 |
cfn1103522 |
chr1 |
6398799 |
6398799 |
14 |
64 |
AX-89311776 |
cfn1111691 |
chr1 |
2011511 |
2011511 |
23 |
70 |
AX-89312023 |
cfn0530002 |
chr1 |
3911208 |
3911208 |
33 |
60 |
AX-89314886 |
Here is an example of adding the marker track to map set for the wheat genome. You can find this file wheat280k-txt.ini in the PersephoneShell package under Samples/Marker subfolder.
[ProcessRun]
; Run description: if specified, a custom description will be used. Will be ignored if a RunId is specified.
; otherwise, "Added markers for {MapSet Accession No.} from {Sources}." will be used.
;RunDescription="DArT markers from http://solanaceae.plantbiology.msu.edu/pgsc_download.shtml"
[MapSet]
; Either MapSetId or MapSetPath is required.
; MapSetId: id of a target map set.
;MapSetId=270702459
; MapSetPath: path of a target map set.
MapSetPath=/Triticum aestivum/Wheat IWGSC v2.1
[MarkerType]
; To add or update a marker type, specify a type name and description.
;TypeName=Description
;DArT=Diversity Arrays Technology
[MarkerNameType]
;BristolAffyCode=Bristol Affy Code
;ALIAS="Alias"
PROBESET_ID=ID for Axiom array
FULL_NAME=Full feature name
[MarkerSequenceType]
;ASSAY_SEQ="Assay sequence"
[Marker]
; Source (required): a TXT file or GFF file located locally or remotely accessible via URL.
Sources=$DATA/wheat/tabw280k.txt.gz
; FileType: {Text (delimited text file)|Gff}
FileType=Text
; Origin (required): database or institution that the file originate
Origin="URGI_INRA"
; MappingMethod: method to map markers in the file. e.g. BLAST, RepeatMasker
; if not specified, 'Unknown' is used.
MappingMethod="gmap"
; Tracks (required): comma delimited. Corresponding section(s) named the same must exist
; At least one track need to be specified.
Tracks="Track1"
; SourceOrganismId: markers are suggested to be unique in a source organism.
; Specify a source organism if you want to lookup markers belonging to the organism.
; Otherwise, inferred by target MapSet.
;SourceOrganismId=1534
; BypassLookup: When adding new markers, PSH will try to reuse the markers with the same OrganismId and name, assuming they are the same markers
; mapped onto different positions. To reuse them, a lookup table with existing markers is built.
; If BypassLookup=true, PSH will skip this step and will issue a database query for each inserted marker in order to
; find an identical marker with the same OrganismId and name. When adding just a few markers
; while the database contains millions of markers with the same OrganismId you can bypass the lookup and
; save time and memory. On the contrary, if you insert a large number of markers, leave
; BypassLookup=false (default), and the lookup will be built from the markers in the database that have
; the same OrganismId. To reduce the size of the lookup table, use NamePrefixesForLookup (see below).
;BypassLookup=true
; SearchAliases: Indicates if other names besides primary name is searched or not. Default value is false.
;SearchAliases=false
; NamePrefixesForLookup: When the database contains millions of markers for this organism, the lookup table
; can take extra memory. To narrow down the list of markers to look up, provide
; the common prefix(es) for the marker names to be loaded, if possible. Only markers that have names beginning with the
; provided prefixes will be fetched from the database. Note, that psh will try to derive such name prefixes automatically.
; Use this instruction if the suggested prefixes are not specific enough and still result in a large lookup size.
;NamePrefixesForLookup=solcap,Sol
; Commit frequency: indicates how often the process commits to the database. Every N markers.
CommitFrequency=1000
[Track1]
; TrackName: track name to be displayed on the plate.
TrackName="TaBW280K markers"
; ReuseTrack: when true, markers will be added to an existing track with this name
;ReuseTrack=true
; TrackDescription: track description shared across all tracks with the same name in the MapSet.
TrackDescription="Axiom genotyping array TaBW280K probes mapped by gmap."
; TrackColor: {NamedColor|HTML hex code|R,G,B}
TrackColor=255,0,0
PrimaryMarkerNameType=FULL_NAME
; AddModes: choose a mode to add name, mapping, qualifiers or sequence among
; 1. AddAnyway: Add regardless of duplication. Faster as it does not check.
; 2. AddOrDie: add if not exists; die (throw exception) otherwise.
; 3. AddOrUpdate: add if not exists; update otherwise.
; 4. AddOrSkip: add if not exists; skip otherwise.
MarkerNameAddMode=AddOrSkip
MarkerSequenceAddMode=AddOrSkip
MarkerQualifierAddMode=AddAnyway
MappingAddMode=AddAnyway
MappingQualifierAddMode=AddAnyway
; SkipHeaderLines: the number of lines to skip parsing
TextSkipHeaderLines=1
; CommentPrefix: comment prefix to skip parsing
;TextCommentPrefix="##"
; Delimiter: specify one among Colon(:), Comma(,), Period(.), Hyphen(-), SemiColon(;), Slash(/), Tab(\t), VerticalBar(|)
TextDelimiter=Tab
; Either marker type or marker type index to read from table should be provided.
; MarkerType: specify a marker type. (single)
MarkerType=SNP
; TextMarkerTypeIndex: column index(0-based) for marker types. (multiple)
;TextMarkerTypeIndex=2
; TextMapNameIndex (required for mapping): column index(0-based) for map names.
TextMapNameIndex=1
; StartIndex (required for mapping): column index(0-based) for start positions.
TextStartIndex=2
; EndIndex: column index(0-based) for end positions. Nullable for point markers.
TextEndIndex=3
; MarkerNameIndex (required): column index(0-based) for marker names.
TextMarkerNameIndex.0=FULL_NAME
TextMarkerNameIndex.6=ALIAS
; FilterIndex: column index(0-based) for filters delimited comma.
; if not specified, all the items will be included.
;TextFilterIndex=0
;TextFilterValue="Brachypodium distachyon"
; TextMarkerQualifierIndex: used to provide additional information for each marker.
TextMarkerQualifierIndex.4=AA_#
TextMarkerQualifierIndex.5=BB_#
; TrackQualifier: Add qualifiers that can help filtering the large track lists when using the Edit tracks interface in webPersephone
;TrackQualifier.Author=JHU
ParentGroupName=Markers
IsShownFirst=false
[MapMapping]
; If no mapping is found in this section, it assumes that each MAP_NAME in file exactly matches a MAP_NAME in DB.
; If map names in file are different from those in DB, map each MAP_NAME in file to its MAP_ID or ACCESION_NO in DB.
; Otherwise, marker will be created without mapping.
;MAP_NAME in file=MAP_ID or ACCESSION_NO in DB
;LoadListedMapsOnly=true
The column index values are 0-based, they specify the meaning of each column. For example,
TextMapNameIndex=4
tells that the fifth column (index=4) contains map names.
Another version of the column index instruction can be written as:
TextMarkerNameIndex.1=FULL_NAME
This means that the column with index 1 contains marker names that have type "FULL_NAME". A marker can have multiple names, each of them has its type, depending on the naming system used.
One of the names should be primary. If there is only one name, the primary name will be assigned automatically, but if there is a choice, the ambiguity should be resolved by this instruction:
PrimaryMarkerNameType=FULL_NAME
The markers and their positions should be listed in the data file (Tab-delimited text or GFF). For the sequence-based maps the coordinates are 1-based.
Add the markers in an interactive or command line mode (see Running PersephoneShell). In the interactive mode, enter the following:
PS> add markers -c $DATA/Samples/Marker/wheat280k-txt.ini -t
A verification message will be displayed.
As usual, please use '-t' switch first to test the files before loading. Then remove '-t' from the command line and optionally use '-v' for the verbose output.
Loading markers without mapping
Markers can be loaded without mapping. Use the file markersonly.ini placed under Samples/Marker subfolder of PersephoneShell.
To load markers only, make sure that your INI file does not have the section [MapSet].
[MarkerType]
; To add or update a marker type, specify a type name and description.
;TypeName=DescriptionOfTheType
SNP="Single Nucleotide Polymorphism"
[MarkerNameType]
; Markers can have multiple names. Add new name types in this section by providing the type and its description
SNP_ID="SNP ID"
panda_assay_id=panda ID
[MarkerSequenceType]
ASSAY_SEQ="Assay sequence"
REVERSE_PRIMER="reverse primer"
[Marker]
; Source (required): a TXT file located locally or remotely accessible via URL.
Sources="$DATA/marker/markersonly.txt"
; Origin (required): database or institution that the file originate
Origin="Test"
; TextSkipHeaderLines: the number of header lines to skip before parsing
TextSkipHeaderLines=1
; SourceOrganismId (required): markers are suggested to be unique in a source organism.
; Specify a source organism if you want to lookup markers belonging to the organism.
SourceOrganismId=4565
; BypassLookup: When adding new markers, PSH will try to reuse the markers with the same OrganismId and name.
; To reuse them, a lookup table with existing markers is built.
; If BypassLookup=true, PSH will skip this step and will issue a database query for each inserted marker in order to
; find an identical marker with the same OrganismId and name. When adding just a few markers
; while the database contains millions of markers with the same OrganismId you can bypass the lookup and
; save time and memory. On the contrary, if you insert a large number of markers, leave
; BypassLookup=false (default), and the lookup will be built from the markers in the database that have
; the same OrganismId. To reduce the size of the lookup table, use NamePrefixesForLookup (see below).
;BypassLookup=true
; SearchAliases: Indicates if other names besides primary name are searched or not. Default value is false.
;SearchAliases=false
; NamePrefixesForLookup: a lookup table is built in memory from the marker names stored in the database to speed up finding the existing markers.
; If a marker with identical name and OrganismId is found, its MarkerId will be reused. Building the lookup can be sometimes tricky:
; the size of the lookup can be prohibitively large.
; To reduce the lookup table size, PersephoneShell will try to find the common name prefix in the list of the new markers
; and use it to filter the lookup table. If the common prefix of the markers to be loaded is known in advance, it can be defined here.
;NamePrefixesForLookup="rs"
; Commit frequency: indicates how often the process commits markers. Every N markers.
CommitFrequency=1000
; AddModes: choose a mode to add name, mapping, qualifiers or sequence among
; 1. AddAnyway: Add regardless of duplication. Faster as it does not check.
; 2. AddOrDie: add if not exists; die (throw exception) otherwise.
; 3. AddOrUpdate: add if not exists; update otherwise.
; 4. AddOrSkip: add if not exists; skip otherwise.
;MarkerNameAddMode=AddOrSkip
;MarkerSequenceAddMode=AddAnyway
;MarkerQualifierAddMode=AddAnyway
; Either marker type or marker type index should be provided.
; MarkerType: specify a marker type. (single). All markers will have the same type specified here.
MarkerType="SNP"
; TextMarkerTypeIndex: column index(0-based) for marker types. (multiple). Each marker can have its own type.
; Each type should exist in the database or described in this control file in section [MarkerType]
;TextMarkerTypeIndex=2
; TextMarkerNameIndex (required): column index(0-based) for marker names.
; The following means that column 2 (third from the left) contains marker name of type "panda_assay_id"
TextMarkerNameIndex.2=panda_assay_id
; TextMarkerSequenceIndex: column index(0-based) for a marker sequence.
; The following means that the column 1 contains sequence of type "ASSAY_SEQ"
TextMarkerSequenceIndex.1=ASSAY_SEQ
TextMarkerSequenceIndex.3=primer_rev
; Qualifiers: used to add additional information.
; TextMarkerQualifierIndex.COL_INDEX(0-based)=qualifierName((:displayText),dataType,dataFormat)
; The following means that text in column 4 will be stored as qualifier "extraQ"
TextMarkerQualifierIndex.4=extraQ
TextMarkerQualifierIndex.0=snp_id
The sample data file for this INI file is shown below:
snp_id |
assay |
panda_assay_id |
primer_reverse |
extraQ |
marker1 |
accgggt[t/c]gggacct |
3390803 |
ccgtattacgggattcgga |
12 |
marker2 |
accccgt[a/c]gattaat |
7284162 |
ttaggcaatcggcataaat |
23 |