The PersephoneShell commands add, edit, and convert use INI-format control files to perform many important tasks, including setting parameters and loading data from local and remote files. INI files are simple text files with a basic structure composed of sections, properties, and values.

Note

For more details about the INI file format, please see the Wikipedia article "INI file".

You can use control files from the installed directories (see Sample Control Files). In addition, you can modify and create control files with a plain text editor like Microsoft Notepad (see Creating and Modifying Control Files). Please note, if you plan to modify or create control files, the data source formats vary depending on the object type (e.g., annotation, organism). See Data Source Formats Supported for more information.

Sample Control Files

In the PersephoneShell file tree there is a folder called Samples that has many sample control files. For example, the Samples folder includes the organism control file "add_Beta_vulgaris.ini", which can be used by the add command to add the organism Beta vulgaris subsp. vulgaris (sugar beet) to your Persephone database.

Creating and Modifying Control Files

You can use a standard text editor, such as Microsoft Notepad, to modify existing control files or create new control files. To modify an existing control file simply open it in your text editor, make your changes, and then save the file.

If you wish to modify existing or create new control files, please keep the following in mind.

  • Refer to the Wikipedia article "INI file" before you begin.
  • Use existing control files as reference, templates, or both.
  • Read the detailed comments (lines that begin with a semicolon [;]) in existing control files for guidelines and syntax rules.
  • Most of Section (e.g., "[MapSetTree]") and all property (e.g., "FileType") names are pre-defined and cannot be modified.
  • An ID value (e.g., "MapSetId") is a 64-bit unsigned integer.
  • A Boolean key (e.g., "SkipInconsistentRecords") can only be set to true or false. Other values such as Y/N or 0/1 are not allowed.
  • A string value with blank spaces should be enclosed in double quotes (" "). To enter a double quote mark in a string put a slash (\) before it (e.g., "the common name is \"Sugar Beet\"."). A value can contain line breaks. In this case, it definitely should be enclosed in double quotes.
  • The control file must be saved with an ".ini" extension. Other extensions (e.g., ".txt", ".bat") are not supported.
  • The data files with actual data are referenced in the INI file using Source or Sources keyword. Source can point to a single file or to a directory, in which case all files from the directory will be processed. The files can be compressed with gzip and have extension ".gz". Particular formats are listed for each of the object types.
  • The Description fields allow some simple HTML tags like <b> for bold, <i> for italic, <a> for hyperlinks. Please remember to "escape" the quotation marks around the URL in the hyperlink <a> tag using back slash:
    Description="The data downloaded from <a href=\"ftp://ftp.ncbi.nlm.nih.gov\"NCBI</a>".

    The map's or map set's description is shown in the  corresponding Properties window. The track description can also be shown in a tool tip that appears over the track panel on mouse over. The long multi-line text for this tool tip is trimmed, only the first line is shown:

Data Source Formats Supported

The table below lists data source formats supported by PersephoneShell. Click the object name for sample control file sections that show how the data source is loaded. Please note, comments have been removed for brevity.

Objects

Data Source Formats Supported

Sequence

FASTA, GenBank

Annotation

GFF, GenBank

Sequence Database
(Sequence + Annotation)

GenBank

Variations

VCF

Marker

GFF, GVF, TXT (CSV, TSV)

Organism

Command lines in the control file

Alignment

GFF, TXT (CSV, TSV), SAM

Synteny

GFF, CHAIN

Quantitative Data

WIG, bedGraph, bigWig

Expression

TXT

QTL

TXT

Orthologs

TXT

Common sections

[ProcessRun]

[ProcessRun]
; RunId: if specified, it uses the existing run.
;RunId=12345
; Run description: if specified, a custom description will be used. Will be ignored
; if a RunId is specified.; otherwise, "Added annotations for {MapSet Accession No.} from {Sources}." will be used.
RunDescription="Loaded annotations for BrapaFPsc_v1.3 from http://phytozome.jgi.doe.gov/pz/portal.html#!bulk?org=Org_BrapaFPsc"

A Process RUN can be considered as a batch job with an ID and description. The objects inserted during one run can later be deleted together using 'delete run' command. Each loading process creates a record in the PROCESS_RUN database table. If you plan to load several sets of data that should be considered as part of one job, for the first load you need to provide description of the job and reference the newly-created RunId in the later sessions. The subsequent loads of logically-linked related data should specify the existing RunId using RunId= construct. In this case the RunDescription field should be commented out, and RunId, that is common for the parts of the data, should be used. PersephoneShell can list existing RUNs with 'list runs -l' command. The switch '-l' is used to display extra fields in a 'long' output format with a column header.

[MapSet]

[MapSet]
; Either MapSetId or MapSetPath is required.
; MapSetId: id of a target map set.
;MapSetId=247848026
; MapSetPath: path of a target map set.
MapSetPath="/Brassica rapa/BrapaFPsc_v1.3"

Tracks with the data of different types are added to the maps of an existing map set. The map set can be referenced by MapSetId (a numeric identifier) or by MapSetPath (a literal path that includes location of the map set in the map set tree).

Note

You can use PersephoneShell to help yourself with typing the full path to the map set. For example, type 'list maps' on the prompt and then (after adding a space) start typing the first root node of the branch that leads to the map set. For example, if full path of map set called TAIR10, is "/Arabidopsis thaliana/TAIR10", you can start typing 'list maps /Ara', then press TAB to auto-complete "/Arabidopsis thaliana", then type '/'. Next, pressing TAB would cycle through all map sets in '/Arabidopsis thaliana' node, one of them should be 'TAIR10'. Copy the resultant map set path together with the double quotes and paste it into the control file.

To get MapSetId, you can use 'list mapset -l' command. Again, using '-l' would force PersephoneShell to use the long output format that prints the column header of the table.

Note

You can use '-p' command switch with a filter pattern to limit the list of map sets. For example, to show info about 'TAIR10' you can type 'list mapset -p TA*'

[MapMapping]

[MapMapping]
; If no mapping is found in this section, it assumes that each MAP_NAME in file exactly matches a MAP_NAME in DB.
; If map names in file are different from those in DB, map each MAP_NAME in file to a MAP_ID or an ACCESION_NO in DB.
; Otherwise, annotation won't be added.
; MapsIdentifiedBy: one of MapId, AccessionNo, GenomeDnaId, MapName (default)
;MapsIdentifiedBy=GenomeDnaId
; LoadListedMapsOnly: if true, only data for the maps listed in this section will be added.
; If false, PersephoneShell will still try to match names from the file to maps in the database
; using MAP_NAME, MAP_ID or ACCESSION_NO and will report an error if no match is found
LoadListedMapsOnly=true
;MAP_NAME in file=MAP_ID or ACCESSION_NO or MAP_NAME in DB
83=chromosome 1
84=chromosome 2
85=chromosome 3
86=chromosome 4
87=chromosome 5
88=chromosome 6
89=chromosome 7
90=chromosome 8
91=chromosome 9
92=chromosome 10

It is quite common when an annotation file uses names for the sequences that are different from the names in the database. In this case, you should create a name translation table in the MapMapping section. For each sequence name in the file provide MAP_NAME, MAP_ID or ACCESSION_NO of corresponding map in the database. If the file references a map that is not listed in the table, depending on the value of the flag LoadListedMapsOnly, the data lines will be skipped (true) or a fall-back procedure of finding maps without the translation table will be used, and if a matching map is not found, an error will be reported (see the comments in the code above).

Alternatively, if you know that the maps are identified by their MapId, AccessionNo or GenomeDnaId, you can use an instruction telling that the names in the file are some specific kind of identifier:

MapIdentifiedBy=GenomeDnaId

The marker mapping files can be produced by running BLASTN. The file format for the BLAST library files compiled from the output of BlastDbExporter contain GenomeDnaId - a number identifying genomic sequences in the database. The tabular output format of BLASTN would contain the subject sequence identifiers (GenomeDnaId) and it would be convenient to use the instruction above telling PersephoneShell that the column with map names actually are GenomeDnaIds. It this case, the name translation table would not be needed.