Step 1: Adding Organisms

Build INI files for adding organisms

The metadata for all entries of the pangenome can be downloaded from the supplemental materials published in the paper https://www.nature.com/articles/s41588-026-02506-0:

ID

NAME

TAXONOMY

CATEGORY

COUNTRY

REGION

AM001

WI2757

Cucumis sativus var. sativus

Cultivar

USA

America

AM002

WI7012

Cucumis sativus var. sativus

Cultivar

USA

America

AM003

WI7037

Cucumis sativus var. sativus

Cultivar

USA

America

AM006

True Lemon

Cucumis sativus var. sativus

Cultivar

USA

America

AM011

WI7150

Cucumis sativus var. sativus

Cultivar

USA

America

AM014

WI7167

Cucumis sativus var. xishuangbannanesis

Xishuangbanna

China

East Asia

AM015

WI7204

Cucumis sativus var. sativus

Cultivar

Israel

Central/West Asia

AM016

Poinsett 76

Cucumis sativus var. sativus

Cultivar

USA

America

...

The assemblies in the cucumber pangenome are associated with three different varieties: Cucumis sativus var. hardwickii, Cucumis sativus var. xishuangbannanesis, and Cucumis sativus var. sativus

The information about the three organisms (the taxonomy number) can be fetched from the NCBI taxonomy pages and placed into a tab-delimited text file with two columns 

organisms.txt:

Cucumis sativus var. sativus

869827

Cucumis sativus var. xishuangbannanesis

2219226

Cucumis sativus var. hardwickii

319220

The file has two columns. To build the INI files for the command add organism, we need a template INI file, which we will place in cucumber data subfolder and name it templateOrganisms.ini

templateOrganisms.ini:


[Organism]
; Organism ID (optional, normally the same as TaxonomyId. If not specified, it will be auto-generated)
OrganismId={1}
; Look up taxonomy information in http://www.ncbi.nlm.nih.gov/taxonomy
; Taxonomy ID (optional)
TaxonomyId={1}
; Alternative ID: user defined ID
;AlternativeId=""
; Scientific name (required)
ScientificName={0}
; Common name (optional)
CommonName="cucumber"
;PlantClassification: (optional). If plant, specify if the organism is monocot(0) or eudicot(1). 
PlantClassification=1

Remember that we address the columns by a 0-based index, so the two columns in the file have index 0 and 1. The placeholder {0} will be replaced with the value from the first column of the text file. For the first organism entry,

ScientificName={0}

will become:

ScientificName=Cucumis sativus var. sativus

The command build ini has its own INI file, which specifies the output directory, the data, and template files:

build-organism.ini:


[Build]
;FileNameIndex (0-based): define which column contains the base file name for generated INI files. For example, if the column contains 'Solins1',
; the file 'Solins1.ini' will be created in the folder pointed by OutputDir
FileNameIndex=0

;OutputDir: specify the output folder for the generated INI files
OutputDir=$DATA/Samples/Organism/cucumber-pangenome

;DataFile: the tab-delimited text file with lines containing the data for each genome. 
DataFile=$DATA/cucumber/organisms.txt

;TemplateFile: the template INI file with placeholders in the form of {0},{1},... referencing the corresponding columns of the tab-delimited data file.
TemplateFile=$DATA/cucumber/templateOrganisms.ini

FileNameIndex defines which column in the data file should be used for naming the generated INI files.

Now, when all the necessary files are ready, we can run the command to generate the control files for adding organisms:

PS> build ini -c $DATA/cucumber/build-organisms.ini
Output INI files will be placed in /data/Samples/Organism/cucumber-pangenome
  Built file /shared/Samples/Organism/cucumber-pangenome/Cucumis sativus var. sativus.ini
  Built file /shared/Samples/Organism/cucumber-pangenome/Cucumis sativus var. xishuangbannanesis.ini
  Built file /shared/Samples/Organism/cucumber-pangenome/Cucumis sativus var. hardwickii.ini


Adding organisms in a batch

To load multiple organisms, we will use the generated INI files and run the command add organism in a loop. Here is an example of executing multiple similar commands by using the bash command for. This is done from the linux command line. When running the commands from the folder with PersephoneShell, the command line will look similar to this:

for a in /data/Samples/Organism/cucumber-pangenome/*.ini; do psh add organism -c "$a" -t; done

Please note that the INI file names contain spaces, so when the name is placed on the command line, it should be enclosed in the quotation marks: "$a".

Here, when we have just three organisms to add, we can monitor the success of the tests manually. If the command is repeated multiple times, we can automate the monitoring. When executed from the linux prompt, PersephoneShell returns to the OS command line reporting the exit code after each command. In case of success, the return code is 0, otherwise it is 1. We can use this to capture the status of each command into a separate file and then see if there are any non-zero return codes.

for a in /data/Samples/Organism/cucumber-pangenome/*.ini; do psh add organism -c "$a" -t; echo $? $a >> status; done

The command

echo $? $a >> status

saves the error codes to the first column of the file status:

0 /data/Data/Samples/Organism/cucumber-pangenome/Cucumis sativus var. hardwickii.ini
0 /data/Data/Samples/Organism/cucumber-pangenome/Cucumis sativus var. sativus.ini
0 /data/Data/Samples/Organism/cucumber-pangenome/Cucumis sativus var. xishuangbannanesis.ini

Note

When using the Docker container, run the command loop from inside the container. Start in the directory /data/psh. The command to run PersephoneShell requires mono: mono psh.exe -S persephone add organism... Put the INI files in a sub-folder of the shared folder, visible from the container as /data/Data/....

After confirming that all tests have been successful, remove the flag -t and run the loading process.

Next: Step 2. Add genomic sequences