Create and Map Tags

The command 'create tag' will cut out small sub-sequence "tags" from the reference genomic DNA and create "fake" markers. These markers can be mapped onto other genomes (see 'create markermapping' command below), giving the means to link two aligned maps. Persephone will automatically connect identical markers mapped on different maps.

Command 'create tags'

To cut out the tag sequences PersephoneShell needs to know the reference genome, the size of the cut-out, the spacing between the tags and other parameters. All this information is supplied via the control INI file:

[ProcessRun]
; Run description: if specified, a custom description will be used
; otherwise, "Created sequence tags for {map set}" will be used.
;RunDescription="Create sequence tags for tomatos"

[Tags]
; MapSetPath or MapSetId (one is required): map set from which the sequence tags will be cut out
MapSetId=2
;MapSetPath=/Arabidopsis thaliana/TAIR10

; TrackName: If LoadToDb is true, a new marker track with this name will be created in this map set.
; Default: "Sequence tags"
TrackName=Tags

; TrackDescription. By default, "Sequence tags of size {size} cut out at step {step}" will be used
;TrackDescription=

; TagSize: Size of the sequence cut-out. Default: 100
TagSize=100
; TagStep: distance between the cut-out starts. Default:10000
TagStep=20000

; Temporary output files.
; BaseOutputFileName: (required) Base name for the output files: FASTA (*.fa), marker-mapping (*.tab) and INI (*.ini) file will be created using this name.
; Note, the FASTA file will be needed to map the tags onto other genomes.
BaseOutputFileName=$DATA/tomato/tags
; TagPrefix: This prefix will be added to each tag name, for example, for the default prefix "tag", tag names will be "tag1_1","tag1_1000"...
TagPrefix="tg"
; SampleMappingIniFileName: Create a sample INI file used for mapping the tags.
; If the path is not provided, the file will not be created
SampleMappingIniFileName=$DATA/tomato/tagMapping.ini
; LoadToDb - the tag positions will be loaded into the database. Default: false
LoadToDb=true

As usual, the source map set can be identified by the numeric MapSetId or by its MapSetPath.

Depending on the repeat content of the genome, TagSize can be different. For repeat-rich genomes, selecting longer sequences would help the matches to be more specific. The default tag size is 100 bp.

The process of creating tags will generate several files. Most of them will share a common base name. For example, if BaseOutputFileName is 'file', the following files will be created:

file.fa - FASTA file with the tag sequences.

file.tab - the tab-delimited file with positions of the tags.

file.ini - the INI file with instructions used by 'add markers' command that will load the markers (tags) to the database.

The coordinates of the cut will be stored in the *.tab file and, if LoadToDb is set to true, the command 'add markers' will be immediately issued, that will use the *.tab file as input. Following the instructions in the auto-generated INI file, it will create the marker entries in the database together with their positions on the source reference genome sequence.

The command 'create tags' will also prepare a sample INI file, specified as SampleMappingIniFileName, that, after minor editing, can be used to map the generated tag sequences onto a list of other genomes. The only line that needs modification lists MapSetIds for the map sets used in the mapping process:

; MapSetPath or MapSetId (one is required): map sets onto which the markers will be mapped.
; The values should be separated by comma, can be entered on several lines
;MapSetPath="/Arabidopsis thaliana/TAIR10, "
; MapSetId: list of MapSetIds separated by comma for which the mapping should be done.
MapSetId=<list of map sets here>

As always, use the command 'list mapset' to find MapSetId of the map sets. Enter the corresponding values separated by comma and save the INI file. Now you are ready to run the next command 'create markermapping'.

Note

The command 'create markermapping', besides the sequence tags, can be applied to other marker sequences that need to be mapped, provided they have the sequences in FASTA format.

Create the tags by issuing the command that follows this pattern (use switch -t to test the data first):

create tag -c tagTomato.ini -v

Command 'create markermapping'

If you have marker sequences, they can be mapped onto selected genomes. The command 'create tags' produces such sequences that can be used for mapping. Once the markers have been mapped and the corresponding tracks have been loaded to the database, the markers can be used for aligning syntenic regions of selected maps. Finding the intervals of consistent tag matching will help quickly identify larger portions of genomes with good DNA sequence similarity. Note, that in the process of loading a marker, PersephoneShell first checks the database and tries to find a marker with identical name and (!) source organism. If found, the marker in the database will be reused - only its additional mapping position will be recorded. This allows marker positions to be linked - a connector line will be drawn by Persephone between locations of the same marker (same internal MARKER_ID). That is why it is important to maintain the correct source organism for each marker.

The marker mapping is done by using either NCBI BLASTN or MagicBLAST.

To run sequence tag mapping issue command

create markermapping -c tagMapping.ini [-t|-v]

Here, tagMapping.ini is the file generated in the first step by the command 'create tag'. Majority of variables inherited from the first step, such as SourceOrganismId or MarkerFASTA, will be already pre-filled. Modify this file to include the correct MapSetIds or MapSetPaths of the genomes whose sequence will be used for mapping. For example,

; Use this file for mapping markers using command 'create markermapping'.
; You will need a file with marker FASTA sequences, their original SourceOrganismId
; and a list of map sets onto which the markers will be mapped.
[ProcessRun]
; Run description: if specified, a custom description will be used
; otherwise, "Created tag mapping for map sets" will be used.
;RunDescription="Map sequence tags"

[Markers]
; Info about the markers to be mapped. We need to know SourceOrganismId for the newly mapped markers:
; for each marker, the program will see if a marker with identical name and SourceOrganismId has been already
; loaded to the database, and, if yes, that marker entry will get a new additional mapping without creating
; a new marker.
SourceOrganismId=436086
; FASTA file with marker sequences. The headers contains marker names only
MarkerFASTA=/home/ubuntu/bin/psh/data/pyrus/tags.fa

[Mapping]
; MinMatchLength: minimal length of match. By default, the tag length is 100, so MinMatchLen is by default 98
MinMatchLength=98
; MinMatchPercent: lowest percent of identity of HSP to be considered
MinMatchPercent=98

; MaxHits: if number of hits on one map is more than MaxHits, all mappings on that map for a marker will be ignored. Default: 3
;MaxHits=2

; MapSetPath or MapSetId (one is required): map sets onto which the markers will be mapped.
; The values should be separated by comma, can be entered on several lines
;MapSetPath="/Arabidopsis thaliana/TAIR10, "
; MapSetId: list of MapSetIds separated by comma for which the mapping should be done.
MapSetId=278247791,278247792,278247793

; UseMagicBlast: if true, use NCBI MagicBLAST, otherwise use NCBI BLASTN (default)
;UseMagicBlast=true

; TrackName: If LoadToDb is true, a new marker track with this name will be created in this map set.
; Default: "Sequence tags"
TrackName=Tags
; TrackDescription. By default, "Sequence tags mapped by BLASTN" will be used
;TrackDescription=
; Temporary output files.
; BaseOutputFileName (required): Base name for the output files: BLASTN (*.blastn),
; marker-mapping (*.tab) and INI (*.ini) file will be created using this name.
; For each map set, the corresponding file name will have a suffix with marker's SourceMapSetId and
; the map set onto which the markers are mapped, e.g., tags_1_39.tab
BaseOutputFileName=/home/ubuntu/bin/psh/data/pyrus/tags
; BlastParams: parameters of BLASTN, most importantly -num_threads
;BlastParams="-word_size=28 -perc_identity=98 -qcov_hsp_perc=98 -culling_limit=3 -max_target_seqs=3 -max_hsps=3 -num_threads=4"
; LoadToDb - the marker positions will be loaded into the database. Default: false
LoadToDb=true

Note that BaseOutputFileName will be used to generate the output files for each listed genome by appending the corresponding MapSetId. So, for example, if your BaseOutputFileName from the first 'create tag' step was 'tags', the output files for the selected map sets in the example above will be similar to tags_ABC_278247791, which means that you can use the value of BaseOutputFileName in the generated file without change - PersephoneShell will not overwrite the output files from the first step even if BaseOutputFileName is the same in both steps.

As a result of the two commands, all processed map sets, including the reference genome, will be automatically linked in Persephone by common markers.

Note

Due to the way NCBI BLASTN optimizes the search using "adaptive chunk size", it is possible that BLASTN will run out of memory or appear frozen when finding the hits for a large number of small query sequences, such as the tags. Setting an environment variable BATCH_SIZE to some small value (minimum is 10000) may solve the problem.