Step 2. Adding sequences

Building INI files for adding sequences in a batch

Our plan is to use the command build ini again, this time to generate the INI files for adding sequences. We will first prepare the template file with placeholders to be filled with the variable values from the data file. Let's look at the first lines of the data file again: 

metadata.txt

AM001

WI2757

Cucumis sativus var. sativus

Cultivar

USA

America

869827

AM002

WI7012

Cucumis sativus var. sativus

Cultivar

USA

America

869827

AM003

WI7037

Cucumis sativus var. sativus

Cultivar

USA

America

869827

AM006

True Lemon

Cucumis sativus var. sativus

Cultivar

USA

America

869827

AM011

WI7150

Cucumis sativus var. sativus

Cultivar

USA

America

869827

AM014

WI7167

Cucumis sativus var. xishuangbannanesis

Xishuangbanna

China

East Asia

2219226

In this version of the file, we have added OrganismId as the last column, matching the scientific name of the organism in the third column.

Before deciding on the rules for parsing out AccessionNo and MapName from the FASTA headers, let's look inside the sequence files by running the command analyze fasta:

PS> analyze fasta http://cucurbitgenomics.org/v2/ftp/pan-genome/cucumber/assembly/AM001_genome.fa.gz

         LENGTH  HEADER
----------------------------------------------------------------------------------------------------------------------
     36,879,732  chr1
     41,308,811  chr2
     42,096,956  chr3
     41,368,229  chr4
     34,517,769  chr5
     33,342,413  chr6
     26,698,637  chr7
      8,865,575  chr0

AM001_genome.fa Total   8 records       265,078,122 bp

As we can see, the FASTA headers in the sequence files are very simple: chr1, chr2, etc. We do not need to provide any rules for parsing the headers. Map's AccessionNo and MapName will be copied as the entire FASTA header, which is the default behavior - no need to provide any rules. What we need to specify is the criteria for separating chromosomes from non-chromosomes. Some of the assemblies have scaffolds in addition to the chromosomes. We will nominate sequences longer than 5,000,000 bp as chromosomes:

ChromosomeCriteriaLength=">5000000"

The template INI file for adding sequences (with comments removed for brevity) could look like this: 

templateSequence.ini

[ProcessRun]

[MapSet]
OrganismId={6}
DisplayName=Cucumber {1}
Description="{2}
{3} {1}
Region: {4}, {5}
https://www.nature.com/articles/s41588-026-02506-0

Downloaded from http://cucurbitgenomics.org/v2/ftp/pan-genome/cucumber/assembly/{0}_genome.fa.gz"
AccessionNo={0}
SourceId="cucurbitgenomics"

[MapSetTree]
RootNodeName="Cucumis sativus (pangenome)"
RootNodeOrderNo=0

[Sequence]
Sources="http://cucurbitgenomics.org/v2/ftp/pan-genome/cucumber/assembly/{0}_genome.fa.gz"
ChromosomeCriteriaLength=">5000000"


We will use the first column (index 0) as AccessionNo:

AccessionNo={0}

Note how we build the Description field combining multiple fields from the data file.

The source file is provided via a URL, also built by using a placeholder. 

To generate the INI files for adding sequences we will provide the INI file for the command build ini

build-seq.ini


[Build]
;FileNameIndex (0-based): define which column contains the base file name for generated INI files. For example, if the column contains 'Solins1',
; the file 'Solins1.ini' will be created in the folder pointed by OutputDir
FileNameIndex=0

;OutputDir: specify the output folder for the generated INI files
OutputDir=$DATA/Samples/Sequence/cucumber-pangenome

;DataFile: the tab-delimited text file with lines containing the data for each genome. 
DataFile=$DATA/cucumber/metadata.txt

;TemplateFile: the template INI file with placeholders in the form of {0},{1},... referencing the corresponding columns of the tab-delimited data file.
TemplateFile=$DATA/cucumber/templateSequence.ini

Now, we can run the command to build the INI files for all 36 assemblies:

build ini -c $DATA/cucumber/build-seq.ini

Adding sequences in a batch

As in the previous section, we can use the command loop to repeat the command add sequence for all assemblies:

for a in /data/Samples/Sequence/cucumber-pangenome/*.ini; do psh add sequence -c "$a" -t; echo $? $a >> status-seq; done

Note

When using the Docker container, run the command loop from inside the container. Start in the directory /data/psh. The command to run PersephoneShell requires mono: mono psh.exe -S persephone add sequence... Put the INI files in a sub-folder of the shared folder, visible from the container as /data/Data/....

Verify that the tests are successful (check the error codes in the file named this time status-seq) and, if there are no errors, load the sequence into the database:

for a in /data/Samples/Sequence/cucumber-pangenome/*.ini; do psh add sequence -c "$a" -v; echo $? $a >> status-seq; done