Build
If you plan to load multiple map sets that are available in data sets with common format, it is useful to engage the command build ini. It will accept a template file with placeholders, which will be filled with values from a tab-delimited text file. For example, if your pangenome data set consists of 20 genomes, the data file should contain 20 lines, one line per map set. You can add different organisms or sequences or gene annotations. For that, you will need multiple INI files, which can be generated using the command build ini.
Provide a template file where placeholders have the form of {number}, such as {0}, {1}, etc. The values from the rows and columns in the data files will be embedded into the corresponding placeholders. The values from the column 0 will fill {0}, and so forth.
For instance, here is a template file for loading organism information:
[Organism]
; Organism ID (optional, if not specified, it will be autogenerated)
OrganismId={6}
; Look up taxonomy information in http://www.ncbi.nlm.nih.gov/taxonomy
; Taxnomy ID (required)
TaxonomyId={6}
; Alternative ID: user defined ID
;AlternativeId="x"
; Scientific name (required)
ScientificName={2} {3}
; Common name (optional)
{7}
;If plant, specify if the organism is monocot(0) or eudicot(1)
PlantClassification=1
The tab-delimited text data file (organisms.txt) is shown here as a table:
Solins1 |
v1.1 |
Solanum |
insanum |
Sins1 |
Sinsanum.fasta.gz |
2056095 |
; |
Solgig1 |
v1.2 |
Solanum |
giganteum |
Sgig1 |
Sgiganteum.fasta.gz |
374017 |
; |
Solvio1 |
v1.1 |
Solanum |
violaceum |
Svio1 |
Sviolaceum.fasta.gz |
329803 |
; |
Sollin1 |
v1.2 |
Solanum |
linearifolium |
Slin1 |
Slinnaeanum.fasta.gz |
329777 |
CommonName=sodom-apple |
Solang8 |
v1.1 |
Solanum |
anguivi |
Sang8 |
Sanguivi.fasta.gz |
329760 |
; |
Solrob1 |
v1.2 |
Solanum |
robustum |
Srob1 |
Srobustum.fasta.gz |
238982 |
; |
Solqui2 |
v1.3 |
Solanum |
quitoense |
Squi2 |
Squitoense.fasta.gz |
227725 |
CommonName=lulo |
The column with index 6 contains TaxonomyId that will also be used as OrganismId. The values will replace the placeholders {6}, which, as you can see, appear more than once in the template.
As the CommonName is optional and is not available for all the organisms, the placeholder {7} will be filled with the instruction with the common name or with the comment symbol.
The command to build the INI files need to know the template file, the data file, the way to name the output INI files, the output directory, etc. Place these parameters into another INI file (buildOrganisms.ini):
[Build]
;FileNameIndex (0-based): define which column contains the base file name for generated INI files. For example, if the column contains 'Solins1',
; the file 'Solins1.ini' will be created in the folder pointed by OutputDir
FileNameIndex=0
;OutputDir: specify the output folder for the generated INI files
OutputDir=$DATA/Samples/Organism/pangenome
;DataFile: the tab-delimited text file with lines containing the data for each genome.
DataFile=$DATA/tomato/organisms.txt
;TemplateFile: the template INI file with placeholders in the form of {0},{1},... referencing the corresponding columns of the tab-delimited data file.
TemplateFile=$DATA/tomato/templateOrganisms.ini
;TextDelimiters=Tab
The first instruction FileNameIndex=0 specifies the column that contains the base file name for the newly created INI files. As a result, the files like Solins1.ini will be created in the output directory $DATA/samples/Organism/pangenome.
To generate the INI files, run the command
PS> build ini -c buildOrganisms.ini
A similar approach can be used to generate INI files for loading FASTA sequences and gene annotation.