Once the Organism record is created, you can start adding map sets. To add these components to an organism data set, you need to use the add command (see Add) that requires a control file (see Control Files) that specifies how the data files should be processed.

In Persephone data model, a map set contains maps. Maps contain tracks. The map sets can be of two main kinds: genetic (linkage groups) and physical (based on sequence). To create map sets and maps for genomes that have genomic sequences, use the add sequence command. The control INI file for this kind of load will have all the necessary information to create map set with empty maps (no tracks yet). Each map set should have Display Name (MapSetName - unique label under parent node in the map set tree), MapSetAccession (unique across the entire database), OrganismId (existing in the database), the place in the map set tree and the source (the institution name that published the maps).

The map set is placed in a tree structure shown in Persephone main application on the left. The tree resembles folder structure, which suggests the way how to form the full path of a map set in the tree. For example, "/Arabidopsis thaliana/TAIR10" is a path to the map set TAIR10 placed under parent node "/Arabidopsis thaliana".

This section shows an example with the steps necessary to add a map set, maps, and sequences to Oryza sativa japonica in your Persephone database. For the full description of various parameters of the add sequence command please consult this page.

Open a copy of the "add_MSU_osa1r7.ini" control file, which is included in the PersephoneShell "Samples\Sequence" folder and is shown below.

[ProcessRun]
; Run description: if specified, a custom description will be used. Will be ignored if a RunId is specified.
;                  otherwise, "Added sequences for {MapSet Accession No.} from {Sources}." will be used.
RunDescription="Added sequences for MSU_osa1r7 from http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/all.con"

[MapSet]
;------------------------------------------------------------------------------------------------
; 1. Using existing MapSet
;------------------------------------------------------------------------------------------------
; If either MapSetId or MapSetPath is specified, it adds sequences to the existing MapSet.
; otherwise, a new MapSet should be specified below.
; MapSetId: id of an existing map set
;MapSetId=247582504
; MapSetPath: path of an existing map set.
;MapSetPath="Oryza sativa/MSU_osa1r7"
;------------------------------------------------------------------------------------------------
; 2. Adding new MapSet
;------------------------------------------------------------------------------------------------
; Organism ID (required): organism ID should exist.
OrganismId=1
; Display name (required): a name shown in MapSetTree. Usually a assembly build name.
DisplayName="MSU_osa1r7"
; Description: by default, organism name + display name.
Description="Oryza sativa japonica MSU_osa1r7.
Loaded from http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/all.con"
; AccessionNo: accession of the genome build. See http://ncbi.nlm.nih.gov/genome
AccessionNo="MSU_osa1r7"
; Source ID: database or institution that the MapSet/sequence originate
SourceId="MSU"

[MapSetTree]
;------------------------------------------------------------------------------------------------
; 1. Using existing MapSetTree node
;------------------------------------------------------------------------------------------------
; Node ID: if specified, the MapSet with the new sequences will be placed on this node.
;NodeId=12345
;------------------------------------------------------------------------------------------------
; 2. Adding new MapSetTree node to a parent node
;------------------------------------------------------------------------------------------------
; Parent node ID: if specified, the MapSet with the new sequences will be placed under this parent node as a child.
;ParentNodeId=200313050
;------------------------------------------------------------------------------------------------
; 3. Adding new MapSetTree node under a new root node
;------------------------------------------------------------------------------------------------
; Root node name: usually an organism name. Ignored if the root name already exists.
RootNodeName="Oryza sativa japonica"
; Root node order number: order of the root node in the MapSetTree. By default, 0.
;RootNodeOrderNo=0

[Sequence]
; Sources (required): FASTA file(s) of genomic DNA sequence located locally or remotely accessible via URL.
Sources="d:\Download\rice\all.con"
; Commit frequency: indicates how often the process commits sequence in bp.
CommitFrequency=10000000
; FASTA header starts with '>' and map information delimited by delimiters follows.
; >{0}|{1}|{2}|{3}|{4}
; Delimiter: specify one among Colon(:), Comma(,), Period(.), Hyphen(-), SemiColon(;), Slash(/), Tab(\t), Space( ) and VerticalBar(|)
Delimiter=VerticalBar
; Map name index (required)
MapNameIndex=0
; Map accession index (required)
MapAccessionIndex=0
; Map description index
;MapDescriptionIndex=3
; Chromosome name index: if specified, chromosome table will be filled.
ChromosomeNameIndex=0
; Expected length index: if specified, the expected lengths will be compared to calculated sequence lengths.
;ExpectedLengthIndex=4
; Length filter: used to include only sequences whose length satisfies the criteria. Use '>' or '<' to load sequences longer/shorter than the given number (e.g,"<1000000")
;                if not specified, all the sequences in the source will be included.
;LengthFilter=">100000"
ChromosomeCriteriaRegEx=Chr\d+

[DbSequences]
; The ID columns below are used in loading sequences (physical maps).
; If there is no sequence/trigger assigned to these columns, you must specify a sequence for them.
;TABLE_NAME.COLUMN_NAME=SEQUENCE_NAME
;PROCESS_RUN.RUN_ID=ID_SEQ
;MAP_SET.MAP_SET_ID=ID_SEQ
;MK_MAP.MAP_ID=ID_SEQ
;CHROMOSOME.CHROMOSOME_ID=ID_SEQ
;GENOME_DNA_VW.GENOME_DNA_ID=ID_SEQ
;GENOME_DNA.GENOME_DNA_ID=ID_SEQ
;GDNA.GDNA_ID=SEQ_ID_SEQ



Important

The SourceId field (set to "MSU" in the example above) cannot contain spaces.

Download the sequence file from http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/all.con and remember its local path. Update the record Sources to point to the downloaded file, for example:

Sources="d:\Download\rice\all.con"

The Sources record can reference a stand-alone local file, like in this example. PersephoneShell can also recognize archive files compressed with gzip. Note, that sometimes the .gz files with deeply nested folder structure are not recognized properly. In such case, please uncompress the file using gzip program and reference the uncompressed file. The Sources record can also list multiple files, separated by comma or, instead of separate files, you can specify a folder. All files in that folder will be processed sequentially. It is also possible to load files from remote locations via http or ftp. In this case, please make sure that the URL starts with the protocol name, like 'http://'.

In our exercise, the FASTA file has 14 sequences: 12 regular chromosomes and two additional sequence entries (ChrUn and ChrSy). We want to mark sequences from Chr1 to Chr12 as chromosomes. This would help Persephone to correctly present the rice genome consisting of 12 chromosomes. You will see the graphical representation in Search or BLAST results. Some map sets can contain a mixture of a few chromosomes and thousands of scaffolds, so knowing which sequences should represent the core genome is important to avoid cluttering the graphics.

In the current example, we will use ChromosomeCriteriaRegEx record to specify a regular expression that all chromosome FASTA headers should match. This would separate chromosomes from the rest of the sequences. For this purpose, we can use an expression 'Chr\d+'.

Verify that the instructions in the INI file correctly recognize the parts of the FASTA header, which in our case here contain just one field, similar to:

>Chr1

(Please consult another example of parsing a more complicated FASTA header)