Adding Markers (Delimited Text Files)
This section describes how to add markers using delimited text files to your Persephone database with the add command (see Add). The steps below show how to use the add command with the control file (see Control Files) "add_Bdistachyon_Bd3-1xBd21.ini" to load markers (see the table below) for Brachypodium distachyon.
In general Persephone terms, a marker is a named location on a map. It can be just a single position or an interval. Markers should have at least one name which is considered a primary name, the other aliases are optional. Each name has a name type, it helps to distinguish different naming systems.
A marker can also have associated sequences: a probe(s) sequence, primers, sequences of related proteins, etc.
Marker identity
Markers have SourceOrganismId that describes which organism they are originating from. This is not just a house keeping record. SourceOrganismId has an important role that we need to explain in detail. The marker identity is based on the marker name AND its SourceOrganismId, which helps to distinguish markers with somewhat generic names like 'M0001' derived from different species. When adding markers to the database, PersephoneShell checks first if a marker with identical name and SourceOrganismId is already present. If it finds such marker (and if AddMode is set to reuse existing markers), PersephoneShell, instead of inserting a marker will reference the found marker by its internal MarkerId and just add a new mapping location for this marker. Please keep this logic in mind when considering adding new markers to several map sets. When displaying two maps with marker tracks, Persephone checks if there are markers with the same MarkerId (read: same name and SourceOrganismId) present on both maps. If there are such matching markers, they will be automatically connected with a line. If in your practice you see a pair of maps with markers that appear identical and the connecting line is missing, the most likely explanation is that the markers belong to different SourceOrganismIds. If you omit SourceOrganismId in the INI file when loading the marker mappings, the OrganismId of the maps where the markers are mapped will be used as SourceOrganismId. For example, if you store repeat locations as markers and if you map the same repeats on different maps with different OrganismIds and do not want the markers to be linked, remove the line with SourceOrganismId from the INI file, thus forcing assigning the markers to different SourceOrganismIds.
For the terms, such as names and sequences, the Persephone database has separate tables. Other marker properties are stored in a form of key-value pairs, called qualifiers. Both the key and the value are stored as strings but it is possible to give the value a different type, for example, a floating point number. This helps in sorting and filtering qualifiers based on their values.
Note
This section describes how to load markers with Delimited TXT files. See Add Markers in the Use Case section for steps to load markers from GFF files.
Species |
Marker Name |
Marker Type |
Population |
Linkage Group |
Position (cM) |
Brachypodium distachyon |
BdSSR544 |
SSR |
Bd3-1 x Bd21 |
a |
0 |
Brachypodium distachyon |
CD725461 |
COS |
Bd3-1 x Bd21 |
a |
15.9 |
Brachypodium distachyon |
Wheat61 |
COS |
Bd3-1 x Bd21 |
a |
23.4 |
Brachypodium distachyon |
INTR2-8 |
COS |
Bd3-1 x Bd21 |
a |
35.8 |
Loading markers with mapping
Review a copy of the "add_Bdistachyon_Bd3-1xBd21.ini" control file, which is included in "Samples\Marker" folder and is shown below. Please note that the data in the table above corresponds to the data in the INI file below. The column index values specify which column contains which category. For example,
TextMapNameIndex=4
tells that the fifth column (index=4) contains map names.
Another version of column index instruction can be written as:
TextMarkerNameIndex.1=FULL_NAME
This means that the column with index 1 contains marker names that have type "FULL_NAME". A marker can have multiple names, each of them has its type, depending on the naming system used.
The markers and their positions should be listed in a file (Tab-delimited text or GFF). The positions reference the maps loaded in the previous step. It is possible to derive the list of maps from the file with marker positions. In this case, the start and end coordinates of the maps will be based on the lowest and the highest marker position listed in that file. The example below assumes that the maps are already created in the database under map set called "Bd3-1 x Bd21 - 7".
[ProcessRun]
; Run description: if specified, a custom description will be used. Will be ignored if a RunId is specified.
; otherwise, "Added markers for {MapSet Accession No.} from {Sources}." will be used.
RunDescription="Added genetic markers for Brachypodium distachyon - population:Bd3-1 x Bd21 from http://pgdbj.jp/kazusa/jsp/mapSelect.do?change=Population&crop_id=9&population_id=10&gene_or_physical=1."
[MapSet]
; Either MapSetId or MapSetPath is required.
; MapSetId: id of a target map set.
;MapSetId=270702459
; MapSetPath: path of a target map set.
MapSetPath="/Brachypodium/Bd3-1 x Bd21 - 7"
[MarkerType]
; To add or update a marker type, specify a type name and description.
;TypeName=Description
SSR="Simple Sequence Repeat."
COS="Conserved Ortholog Set."
IFLP="Intron Fragment Length Polymorphism."
Indel="Insertion/Deletion."
[MarkerNameType]
[MarkerSequenceType]
[Marker]
; Source (required): a TXT file or GFF file located locally or remotely accessible via URL.
Source="Samples/Marker/Bdistachyon_Bd3-1xBd21.txt"
; FileType: {Text (delimited text file)|Gff}
FileType=Text
; Origin (required): database or institution that the file originate
Origin="PGDB"
; CoordinateSystem (required): 1 (one-based) / 0 (zero-based). Default value is 1.
CoordinateSystem=0
; MappingMethod: method to map markers in the file. e.g. BLAST, RepeatMasker
; if not specified, 'Unknown' is used.
;MappingMethod=""
; Tracks (required): comma delimited. Corresponding section(s) named the same must exist
; At least one track need to be specified.
Tracks="Markers"
; SourceOrganismId: markers are suggested to be unique in a source organism.
; Specify a source organism if you want to lookup markers belonging to the organism.
; Otherwise, inferred by target MapSet.
;SourceOrganismId=1534
; BypassLookup: When adding new markers, PSH will try to reuse the markers with the same OrganismId and name, assuming they are the same markers
; mapped onto different positions. To reuse them, a lookup table with existing markers is built.
; If BypassLookup=true, PSH will skip this step and will issue a database query for each inserted marker in order to
; find an identical marker with the same OrganismId and name. When adding just a few markers
; while the database contains millions of markers with the same OrganismId you can bypass the lookup and
; save time and memory. On the contrary, if you insert a large number of markers, leave
; BypassLookup=false (default), and the lookup will be built from the markers in the database that have
; the same OrganismId. To reduce the size of the lookup table, use NamePrefixesForLookup (see below).
;BypassLookup=true
; SearchAliases: Indicates if other names besides primary name is searched or not. Default value is false.
;SearchAliases=false
; NamePrefixesForLookup: When the database contains millions of markers for this organism, the lookup table
; can take extra memory. To narrow down the list of markers to look up, provide
; the common prefix(es) for the marker names to be loaded, if possible. Only markers that have names beginning with the
; provided prefixes will be fetched from the database. Note, that psh will try to derive such name prefixes automatically.
; Use this instruction if the suggested prefixes are not specific enough and still result in a large lookup size.
;NamePrefixesForLookup=solcap,Sol
; Commit frequency: indicates how often the process commits markers. Every N markers.
CommitFrequency=1000
[Markers]
; TrackName: track name to be displayed on the plate.
TrackName="Markers"
; TrackDescription: track description shared across maps in the MapSet.
TrackDescription="Bd3-1xBd21 markers."
; TrackType: choose one among
; 1. GENERIC_BP_TRACK: markers for physical maps.
; 2. MARKER_TRACK: markers for genetic maps.
; 3. HEAT_TRACK: markers for physical maps. Heatmap of physical distance.
; 4. DENSE_BP_MARKER_TRACK: dense marker track for physical maps. Heatmap will not work.
; 5. CYTOBAND_TRACK: cytoband markers. Each marker has to have 'GieStain' qualifier.
TrackType=MARKER_TRACK
; TrackColor: {NamedColor|HTML hex code|R,G,B} - color of marker glyphs
TrackColor=Cyan
; PrimaryMarkerNameType: markers can have multiple names, each name has its type, one of the types should store the primary name;
; The primary name is shown in the marker's label
PrimaryMarkerNameType=FULL_NAME
; GeneratedMarkerNamePrefix: in case there are no appropriate values to be used as marker names, generate names for each marker using this prefix.
; For example, if the feature names are not important, they can be named automatically, like TSS1, TSS2... using prefix 'TSS'.
;GeneratedMarkerNamePrefix=TSS
; AddModes: choose a mode to add name, mapping, qualifiers or sequence among
; 1. AddAnyway: Add regardless of duplication. Faster as it does not check.
; 2. AddOrDie: add if not exists; die (throw exception) otherwise.
; 3. AddOrUpdate: add if not exists; update otherwise.
; 4. AddOrSkip: add if not exists; skip otherwise.
MarkerNameAddMode=AddOrSkip
MarkerSequenceAddMode=AddOrSkip
MarkerQualifierAddMode=AddOrUpdate
MappingAddMode=AddAnyway
MappingQualifierAddMode=AddOrUpdate
; Commands for text-delimited format are below
; TextSkipHeaderLines: the number of lines to skip parsing
TextSkipHeaderLines=1
; CommentPrefix: comment prefix to skip parsing
;TextCommentPrefix="##"
; Delimiter: specify one among Colon(:), Comma(,), Period(.), Hyphen(-), SemiColon(;), Slash(/), Tab(\t), VerticalBar(|)
TextDelimiter=Tab
; Either marker type or marker type index should be provided.
; MarkerType: specify a marker type. (single). All markers will have the same type specified here.
;MarkerType="SSR"
; MarkerTypeIndex: column index(0-based) for marker types. (multiple). Each marker can have its own type.
; Each type should exist in the database or described in this control file in section [MarkerType]
TextMarkerTypeIndex=2
; MapNameIndex (required for mapping): column index(0-based) for map names.
TextMapNameIndex=4
; StartIndex (required for mapping): column index(0-based) for start positions.
TextStartIndex=5
; EndIndex: column index(0-based) for end positions. Nullable for point markers.
;TextEndIndex=5
; TextMappingScoreIndex: 0-based column index for mapping score
;TextMappingScoreIndex=6
; TextMarkerNameIndex (required): column index(0-based) for marker names.
; The following means that column 1 (second from the left) contains marker name of type "FULL_NAME"
TextMarkerNameIndex.1=FULL_NAME
;TextMarkerNameIndex.2=ALIAS
; TextMarkerSequenceIndex: column index(0-based) for a marker sequence.
; The following means that the column 10 contains sequence of type "ASSAY_SEQ"
;TextMarkerSequenceIndex.10=ASSAY_SEQ
; TextFilterIndex: column index(0-based) for filters delimited comma.
; if not specified, all the items will be included.
;TextFilterIndex=0
; TextFilterValues: only lines containing one of these values (separated by comma) in the column specified above will be considered
;TextFilterValues="Brachypodium distachyon"
; Qualifiers: used to add additional information.
; TextMarkerQualifierIndex.COL_INDEX(0-based)=qualifierName((:displayText),dataType,dataFormat)
; The following means that text in column 13 will be stored as qualifier "alleles"
;TextMarkerQualifierIndex.3=alleles
; The following means that column 14 will be stored as qualifier "OTV", shown to the end users as "Off-Target Variants"
; with the type of 'int' (recognized data types: string, int, double)
;TextMarkerQualifierIndex.14=OTV:Off-Target Variants,int
;TextMappingQualifierIndex.COL_INDEX(0-based)=qualifierName((:displayText),dataType,dataFormat)
; Qualifiers for mapping positions (as opposed to marker qualifiers - a marker can have several mappings)
;TextMappingQualifierIndex.12=uniqueMapping
; ParentGroupName: the new track will be placed under a parent node with this name.
; To reduce the number of track nodes on the top level, group the tracks of similar type.
ParentGroupName=markers
; TrackQualifier: Add qualifiers that can help filtering the large track lists when using the Edit tracks interface in webPersephone
;TrackQualifier.Tissue=Liver
;TrackQualifier.Author=JHU
; IsSearchable: If true (default), the track data will be indexed for search. If false, the indexing will be skipped
;IsSearchable=false
[MapMapping]
; LoadListedMapsOnly: if true, only maps listed below will be loaded, otherwise psh will try to match map_names in the file
; to map names in the db (case-sensitive) and if not found, the map will be skipped.
;LoadListedMapsOnly=true
; MapsIdentifiedBy: if all maps in the file instead of the map name are identified by their alternative IDs like MAP_ID, ACCESSION_NO or GENOME_DNA_ID,
; provide the mapping with just one line using either MapName, MapId, AccessionNo or GenomeDnaId, for example:
;MapsIdentifiedBy=AccessionNo
Copy the control file "add_Bdistachyon_Bd3-1xBd21.ini" to the directory where you installed PersephoneShell (e.g., C:\PersephoneShell).
Add the markers in an interactive or command line mode (see Running PersephoneShell). In the interactive mode, enter the following:
PS> add markers -c add_Bdistachyon_Bd3-1xBd21.ini
In the command line mode enter (use the proper connection name after -s):
C:\PersephoneShell> psh -s ********** add markers -c add_Bdistachyon_Bd3-1xBd21e.ini
A verification message will be displayed.
As usual, please use '-t' switch first to test the files before loading.