Maps, based on genomic sequences, are loaded in a form of a "physical" (sequence-based) map set. The INI control file should specify the location of the sequence files, the name and other attributes of a map set, its placement in the map set tree, parameters of parsing FASTA definition lines to extract map name, accession, etc. The source FASTA files may contain a mixture of chromosomes and scaffolds. The INI file can specify criteria on how to separate chromosomes from the rest of the sequences.

Each loading process has a ProcessRun record that captures the logistics of operation: the time stamp, the operator's user name, the type of data loaded, etc. Though it is possible to reuse an existing ProcessRun record by adding more data and associate it with an old RUN_ID, we recommend using a new ProcessRun for every load. Please read the comments in the sample control files, they are an essential part of the documentation. The sample files for loading sequences are stored in .\Samples\Sequence sub-folder.

[ProcessRun]
; Run description: if specified, a custom description will be used. Will be ignored if a RunId is specified.
;                  otherwise, "Added sequences for {MapSet Accession No.} from {Sources}." will be used.
RunDescription="Added sequences for IRGSP-1.0.31 from ftp://ftp.gramene.org/pub/gramene/archives/PAST_RELEASES/release50/data/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.31.dna.genome.fa.gz."


Normally, the sequence maps are added to a new map set for which we should provide the name, accession and an organism (please note, the corresponding Organism record should already exist in the database).


[MapSet]
;------------------------------------------------------------------------------------------------
; 1. Using existing MapSet
;------------------------------------------------------------------------------------------------
; If either MapSetId or MapSetPath is specified, it adds sequences to the existing MapSet.
; otherwise, a new MapSet should be specified below.
; MapSetId: id of an existing map set
;MapSetId=247582504
; MapSetPath: path of an existing map set.
;MapSetPath="Oryza sativa japonica/IRGSP-1.0.31"
;------------------------------------------------------------------------------------------------
; 2. Adding new MapSet
;------------------------------------------------------------------------------------------------
; Organism ID (required): organism ID should exist.
OrganismId=1
; Display name (required): a name shown in MapSetTree. Usually an assembly build name.
DisplayName="IRGSP-1.0.31"
; Description: by default, organism name + display name. Try to specify the source of the data. The new lines will be respected.
Description="Oryza sativa IRGSP-1.0.31
from ftp://ftp.gramene.org/pub/gramene/archives/PAST_RELEASES/release50/data/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.31.dna.genome.fa.gz"
; AccessionNo (required): accession of the genome build, should be unique across the database.
AccessionNo="IRGSP-1.0.31"
; Source ID (required): database or institution that the MapSet/sequence originate from
SourceId="Gramene"



The map set is placed in the map set tree. This can be done by specifying ParentNodeId or RootNodeName.

If you are going to reference an existing parent node in the map set tree, you will need to find its ID by running the command:

PS> list mapsettree -l
Arabidopsis thaliana (NodeId:1)
  Physical:TAIR10 (NodeId:2, MapSetId:1)
Oryza sativa japonica (NodeId:3)
  Physical:MSU_osa1r7 (NodeId:4, MapSetId:2)
Oryza sativa indica (NodeId:5)
  Physical:ASM465v1 (NodeId:14, MapSetId:13)
Sorghum bicolor (NodeId:18)
  Physical:Sorghum v.3.1 (NodeId:20, MapSetId:17)
  Genetic:BTx623-IS320C (NodeId:21, MapSetId:18)

Use NodeId (=3) as ParentNodeId, and the newly created map set will be placed as a child of Oryza sativa japonica node:


[MapSetTree]
;------------------------------------------------------------------------------------------------
; 1. Adding new MapSetTree node to a parent node
;------------------------------------------------------------------------------------------------
; Parent node ID: if specified, the MapSet with the new sequences will be placed under this parent node as a child.
ParentNodeId=3
;------------------------------------------------------------------------------------------------
; 2. Adding new MapSetTree node under a new root node
;------------------------------------------------------------------------------------------------
; Root node name: usually an organism name. Ignored if the root name already exists.
;RootNodeName="Oryza sativa japonica"
; Root node order number: order of the root node in the MapSetTree. By default, 0.
;RootNodeOrderNo=0

Alternatively, if the parent node is located on the Root level, you can specify the new position by using RootNodeName.

Now, we need to provide the method of extracting MapName, MapAccession, ChromosomeName and other map features from the FASTA definition lines.

Suppose, we have a FASTA definition line like this:

>Syng_TIGR_043 dna:scaffold scaffold:IRGSP-1.0:Syng_TIGR_043:1:4236:1


We want the map name (and accession) to be Syng_TIGR_043. To parse it out of the definition line, we first split the line by delimiter (Colon) and then take the fourth value (0-based, index=3):

[Sequence]
; Sources (required): FASTA file(s) of genomic DNA sequence located locally or remotely accessible via URL.
Sources="ftp://ftp.gramene.org/pub/gramene/archives/PAST_RELEASES/release50/data/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.31.dna.genome.fa.gz"
; Commit frequency: commits after reading this many nucleotides. Large numbers requre larger rollback segment, smaller numbers result in higher frequency of transactions
CommitFrequency=10000000
; FASTA header starts with '>' and provides map information delimited by delimiters follows (Delimiter=VerticalBar):
; >{0}|{1}|{2}|{3}|{4}
; Delimiter: specify one among Colon(:), Comma(,), Period(.), Hyphen(-), SemiColon(;), Slash(/), Tab(\t), Space( ) and VerticalBar(|)
Delimiter=Colon
; Map name index (required). Tells which part between the delimiters corresponds to map name. 0-based.
MapNameIndex=3
; If the part between delimiters is too long to be stored as the map name, it can be parsed and the map name can be extracted by using this regular expression
;MapNameRegEx=".*chromosome (..).*"
; Map accession index (required). Tells which part between delimiters corresponds to map accession. 0-based.
MapAccessionIndex=3
; If the part between delimiters is too long to be stored as the map accession, it can be parsed and the map accession can be extracted by using this regular expression
;MapAccessionRegEx=".*chromosome (..).*"
; Map description index
;MapDescriptionIndex=0
; Some sequences can be called chromosomes, some as scaffolds. Usually, the number of chromosomes is small and allows to show all of them as a representation of a genome
; ChromosomeCriteriaRegEx - if the fasta header matches the RegEx criteria, the entry will be called a chromosome
;ChromosomeCriteriaRegEx="chromosome"
; ChromosomeCriteriaLength - a formula to separate chromosomes from the rest of the sequences. It can be an alternative
; to ChoromosomeCriteriaRegEx
; For example, ChromosomeCriteriaLength=>5000000 will store sequences longer than 5,000,000 bp as chromosomes
;ChromosomeCriteriaLength=>5000000
; In case the sequence passes the criteria to be called a chromosome, the chromosome name will be read from the field with this index:
ChromosomeNameIndex=3
; ChromosomeNameRegEx - a regular expression applied to the field with the given index to extract the chromosome name
;ChromosomeNameRegEx="chromosome (.{2,3}),"
; Expected length index: if specified, the expected length of each sequence will be compared to the actual sequence length.
ExpectedLengthIndex=5
; MapNameFilterRegEx: Regular expression filter based on map name: include only sequences whose map name matches the given pattern.
; if not specified, all the sequences in the source will be included.
; Example below would load only sequences with name that start with 'Chr.'
;MapNameFilterRegEx="^Chr\..*"
; Length filter: used to include only sequences whose length satisfies the criteria. Use '>' or '<' to load sequences longer/shorter than the given number (e.g,"<1000000")
;                if not specified, all the sequences in the source will be included.
;LengthFilter=">100000"
; Sequences can be stored in the database (Oracle) or in the file system (MySql-compatible).
; In case of the file system, please make sure that the storage location is visible from the machine where the loading process is running.
; The path to this location will be also used by the API-server (Cerberus) or, in case of direct database connection, by the main application,
; running on the user's machine.
; When using the API-server, the storage location should be accessible to the server. The machines used for loading and for Cerberus can be different. In such case,
; use path remapping, specified in PersephoneShell's configuration file (see StorageMapping entry).
; StorageId: If present, specifies storage to add sequences to. Otherwise, default storage will be used. Use 'add storage' command to specify alternative storage locations;
;StorageId=1

Alternatively, the map name can be extracted by applying a regular expression to the entry with index=0 (Syng_TIGR_043 dna):

ChromosomeNameRegEx="^(.+) dna"

Similar logic using regular expressions can be applied to MapAccession and ChromosomeName.

The source FASTA file can contain a mixture of chromosomes and scaffolds. Persephone shows all chromosomes together as a graphical representation of the genome. They are also displayed at the top of the map list. To define which records should be stored as chromosomes, a couple of tricks can be used: separate them by name or by size.

To filter chromosomes by name, use ChromosomeCriteriaRegEx.

ChromosomeCriteriaRegEx="chromosome"

The record above instructs PersephoneShell to nominate records, containing "chromosome" in the FASTA definition line, as chromosomes.

Similarly, the chromosomes can be separated by a size expression ChromosomeCriteriaLength (sequences longer than 500,000 bp):

ChromosomeCriteriaLength=">500000"

Once the control file is ready, run PersephoneShell to test it first:

PS> add sequence -c pathToControlFile.ini -t

and, if all tests are successful, start the real loading:

PS> add sequence -c pathToControlFile.ini -v

After all the sequences have been loaded, it is time to add annotation or other tracks.