Adding Organisms

Step 1: Adding Organisms

Build INI files for adding organisms

The metadata for all entries of the pangenome can be downloaded from the supplemental materials published in the paper https://www.nature.com/articles/s41588-026-02506-0:

ID	NAME	TAXONOMY	CATEGORY	COUNTRY	REGION
AM001	WI2757	Cucumis sativus var. sativus	Cultivar	USA	America
AM002	WI7012	Cucumis sativus var. sativus	Cultivar	USA	America
AM003	WI7037	Cucumis sativus var. sativus	Cultivar	USA	America
AM006	True Lemon	Cucumis sativus var. sativus	Cultivar	USA	America
AM011	WI7150	Cucumis sativus var. sativus	Cultivar	USA	America
AM014	WI7167	Cucumis sativus var. xishuangbannanesis	Xishuangbanna	China	East Asia
AM015	WI7204	Cucumis sativus var. sativus	Cultivar	Israel	Central/West Asia
AM016	Poinsett 76	Cucumis sativus var. sativus	Cultivar	USA	America

...

The assemblies in the cucumber pangenome are associated with three different varieties: Cucumis sativus var. hardwickii, Cucumis sativus var. xishuangbannanesis, and Cucumis sativus var. sativus

The information about the three organisms (the taxonomy number) can be fetched from the NCBI taxonomy pages and placed into a tab-delimited text file with two columns

organisms.txt:

Cucumis sativus var. sativus	869827
Cucumis sativus var. xishuangbannanesis	2219226
Cucumis sativus var. hardwickii	319220

The file has two columns. To build the INI files for the command add organism, we need a template INI file, which we will place in cucumber data subfolder and name it templateOrganisms.ini:

templateOrganisms.ini:

[Organism]
; Organism ID (optional, normally the same as TaxonomyId. If not specified, it will be auto-generated)
OrganismId={1}
; Look up taxonomy information in http://www.ncbi.nlm.nih.gov/taxonomy
; Taxonomy ID (optional)
TaxonomyId={1}
; Alternative ID: user defined ID
;AlternativeId=""
; Scientific name (required)
ScientificName={0}
; Common name (optional)
CommonName="cucumber"
;PlantClassification: (optional). If plant, specify if the organism is monocot(0) or eudicot(1).
PlantClassification=1

Remember that we address the columns by a 0-based index, so the two columns in the file have index 0 and 1. The placeholder {0} will be replaced with the value from the first column of the text file. For the first organism entry,

ScientificName={0}

will become:

ScientificName=Cucumis sativus var. sativus

The command build ini has its own INI file, which specifies the output directory, the data, and template files:

build-organism.ini:

[Build]
;FileNameIndex (0-based): define which column contains the base file name for generated INI files. For example, if the column contains 'Solins1',
; the file 'Solins1.ini' will be created in the folder pointed by OutputDir
FileNameIndex=0

;OutputDir: specify the output folder for the generated INI files
OutputDir=$DATA/Samples/Organism/cucumber-pangenome

;DataFile: the tab-delimited text file with lines containing the data for each genome.
DataFile=$DATA/cucumber/organisms.txt

;TemplateFile: the template INI file with placeholders in the form of {0},{1},... referencing the corresponding columns of the tab-delimited data file.
TemplateFile=$DATA/cucumber/templateOrganisms.ini

FileNameIndex defines which column in the data file should be used for naming the generated INI files.

Now, when all the necessary files are ready, we can run the command to generate the control files for adding organisms:

PS> build ini -c $DATA/cucumber/build-organisms.ini
Output INI files will be placed in /data/Samples/Organism/cucumber-pangenome
Built file /shared/Samples/Organism/cucumber-pangenome/Cucumis sativus var. sativus.ini
Built file /shared/Samples/Organism/cucumber-pangenome/Cucumis sativus var. xishuangbannanesis.ini
Built file /shared/Samples/Organism/cucumber-pangenome/Cucumis sativus var. hardwickii.ini

Adding organisms in a batch

To load multiple organisms, we will use the generated INI files and run the command add organism in a loop. Here is an example of executing multiple similar commands by using the bash command for. This is done from the linux command line. When running the commands from the folder with PersephoneShell, the command line will look similar to this:

for a in /data/Samples/Organism/cucumber-pangenome/*.ini; do psh add organism -c "$a" -t; done

Please note that the INI file names contain spaces, so when the name is placed on the command line, it should be enclosed in the quotation marks: "$a".

Here, when we have just three organisms to add, we can monitor the success of the tests manually. If the command is repeated multiple times, we can automate the monitoring. When executed from the linux prompt, PersephoneShell returns to the OS command line reporting the exit code after each command. In case of success, the return code is 0, otherwise it is 1. We can use this to capture the status of each command into a separate file and then see if there are any non-zero return codes.

for a in /data/Samples/Organism/cucumber-pangenome/*.ini; do psh add organism -c "$a" -t; echo $? $a >> status; done

The command

echo $? $a >> status

saves the error codes to the first column of the file status:

0 /data/Data/Samples/Organism/cucumber-pangenome/Cucumis sativus var. hardwickii.ini
0 /data/Data/Samples/Organism/cucumber-pangenome/Cucumis sativus var. sativus.ini
0 /data/Data/Samples/Organism/cucumber-pangenome/Cucumis sativus var. xishuangbannanesis.ini

Note

When using the Docker container, run the command loop from inside the container. Start in the directory /data/psh. The command to run PersephoneShell requires mono: mono psh.exe -S persephone add organism... Put the INI files in a sub-folder of the shared folder, visible from the container as /data/Data/....

After confirming that all tests have been successful, remove the flag -t and run the loading process.

Next: Step 2. Add genomic sequences