Adding Organisms
Step 1: Adding Organisms
Build INI files for adding organisms
The metadata for all entries of the pangenome can be downloaded from the supplemental materials published in the paper https://www.nature.com/articles/s41588-026-02506-0:
|
ID |
NAME |
TAXONOMY |
CATEGORY |
COUNTRY |
REGION |
|
AM001 |
WI2757 |
Cucumis sativus var. sativus |
Cultivar |
USA |
America |
|
AM002 |
WI7012 |
Cucumis sativus var. sativus |
Cultivar |
USA |
America |
|
AM003 |
WI7037 |
Cucumis sativus var. sativus |
Cultivar |
USA |
America |
|
AM006 |
True Lemon |
Cucumis sativus var. sativus |
Cultivar |
USA |
America |
|
AM011 |
WI7150 |
Cucumis sativus var. sativus |
Cultivar |
USA |
America |
|
AM014 |
WI7167 |
Cucumis sativus var. xishuangbannanesis |
Xishuangbanna |
China |
East Asia |
|
AM015 |
WI7204 |
Cucumis sativus var. sativus |
Cultivar |
Israel |
Central/West Asia |
|
AM016 |
Poinsett 76 |
Cucumis sativus var. sativus |
Cultivar |
USA |
America |
...
The assemblies in the cucumber pangenome are associated with three different varieties: Cucumis sativus var. hardwickii, Cucumis sativus var. xishuangbannanesis, and Cucumis sativus var. sativus
The information about the three organisms (the taxonomy number) can be fetched from the NCBI taxonomy pages and placed into a tab-delimited text file with two columns
organisms.txt:
|
Cucumis sativus var. sativus |
869827 |
|
Cucumis sativus var. xishuangbannanesis |
2219226 |
|
Cucumis sativus var. hardwickii |
319220 |
The file has two columns. To build the INI files for the command add organism, we need a template INI file, which we will place in cucumber data subfolder and name it templateOrganisms.ini:
templateOrganisms.ini:
[Organism]
; Organism ID (optional, normally the same as TaxonomyId. If not specified, it will be auto-generated)
OrganismId={1}
; Look up taxonomy information in http://www.ncbi.nlm.nih.gov/taxonomy
; Taxonomy ID (optional)
TaxonomyId={1}
; Alternative ID: user defined ID
;AlternativeId=""
; Scientific name (required)
ScientificName={0}
; Common name (optional)
CommonName="cucumber"
;PlantClassification: (optional). If plant, specify if the organism is monocot(0) or eudicot(1).
PlantClassification=1
Remember that we address the columns by a 0-based index, so the two columns in the file have index 0 and 1. The placeholder {0} will be replaced with the value from the first column of the text file. For the first organism entry,
ScientificName={0}
will become:
ScientificName=Cucumis sativus var. sativus
The command build ini has its own INI file, which specifies the output directory, the data, and template files:
build-organism.ini:
[Build]
;FileNameIndex (0-based): define which column contains the base file name for generated INI files. For example, if the column contains 'Solins1',
; the file 'Solins1.ini' will be created in the folder pointed by OutputDir
FileNameIndex=0
;OutputDir: specify the output folder for the generated INI files
OutputDir=$DATA/Samples/Organism/cucumber-pangenome
;DataFile: the tab-delimited text file with lines containing the data for each genome.
DataFile=$DATA/cucumber/organisms.txt
;TemplateFile: the template INI file with placeholders in the form of {0},{1},... referencing the corresponding columns of the tab-delimited data file.
TemplateFile=$DATA/cucumber/templateOrganisms.ini
FileNameIndex defines which column in the data file should be used for naming the generated INI files.
Now, when all the necessary files are ready, we can run the command to generate the control files for adding organisms:
PS> build ini -c $DATA/cucumber/build-organisms.ini
Output INI files will be placed in /data/Samples/Organism/cucumber-pangenome
Built file /shared/Samples/Organism/cucumber-pangenome/Cucumis sativus var. sativus.ini
Built file /shared/Samples/Organism/cucumber-pangenome/Cucumis sativus var. xishuangbannanesis.ini
Built file /shared/Samples/Organism/cucumber-pangenome/Cucumis sativus var. hardwickii.ini
Adding organisms in a batch
To load multiple organisms, we will use the generated INI files and run the command add organism in a loop. Here is an example of executing multiple similar commands by using the bash command for. This is done from the linux command line. When running the commands from the folder with PersephoneShell, the command line will look similar to this:
for a in /data/Samples/Organism/cucumber-pangenome/*.ini; do psh add organism -c "$a" -t; done
Please note that the INI file names contain spaces, so when the name is placed on the command line, it should be enclosed in the quotation marks: "$a".
Here, when we have just three organisms to add, we can monitor the success of the tests manually. If the command is repeated multiple times, we can automate the monitoring. When executed from the linux prompt, PersephoneShell returns to the OS command line reporting the exit code after each command. In case of success, the return code is 0, otherwise it is 1. We can use this to capture the status of each command into a separate file and then see if there are any non-zero return codes.
for a in /data/Samples/Organism/cucumber-pangenome/*.ini; do psh add organism -c "$a" -t; echo $? $a >> status; done
The command
echo $? $a >> status
saves the error codes to the first column of the file status:
0 /data/Data/Samples/Organism/cucumber-pangenome/Cucumis sativus var. hardwickii.ini
0 /data/Data/Samples/Organism/cucumber-pangenome/Cucumis sativus var. sativus.ini
0 /data/Data/Samples/Organism/cucumber-pangenome/Cucumis sativus var. xishuangbannanesis.ini
Note
When using the Docker container, run the command loop from inside the container. Start in the directory /data/psh. The command to run PersephoneShell requires mono: mono psh.exe -S persephone add organism... Put the INI files in a sub-folder of the shared folder, visible from the container as /data/Data/....
After confirming that all tests have been successful, remove the flag -t and run the loading process.