Adding Sequences

Step 2. Adding sequences

Building INI files for adding sequences in a batch

Our plan is to use the command build ini again, this time to generate the INI files for adding sequences. We will first prepare the template file with placeholders to be filled with the variable values from the data file. Let's look at the first lines of the data file again:

metadata.txt

AM001	WI2757	Cucumis sativus var. sativus	Cultivar	USA	America	869827
AM002	WI7012	Cucumis sativus var. sativus	Cultivar	USA	America	869827
AM003	WI7037	Cucumis sativus var. sativus	Cultivar	USA	America	869827
AM006	True Lemon	Cucumis sativus var. sativus	Cultivar	USA	America	869827
AM011	WI7150	Cucumis sativus var. sativus	Cultivar	USA	America	869827
AM014	WI7167	Cucumis sativus var. xishuangbannanesis	Xishuangbanna	China	East Asia	2219226

In this version of the file, we have added OrganismId as the last column, matching the scientific name of the organism in the third column.

Before deciding on the rules for parsing out AccessionNo and MapName from the FASTA headers, let's look inside the sequence files by running the command analyze fasta:

PS> analyze fasta http://cucurbitgenomics.org/v2/ftp/pan-genome/cucumber/assembly/AM001_genome.fa.gz

LENGTH HEADER
----------------------------------------------------------------------------------------------------------------------
36,879,732 chr1
41,308,811 chr2
42,096,956 chr3
41,368,229 chr4
34,517,769 chr5
33,342,413 chr6
26,698,637 chr7
8,865,575 chr0

AM001_genome.fa Total 8 records 265,078,122 bp

As we can see, the FASTA headers in the sequence files are very simple: chr1, chr2, etc. We do not need to provide any rules for parsing the headers. Map's AccessionNo and MapName will be copied as the entire FASTA header, which is the default behavior - no need to provide any rules. What we need to specify is the criteria for separating chromosomes from non-chromosomes. Some of the assemblies have scaffolds in addition to the chromosomes. We will nominate sequences longer than 5,000,000 bp as chromosomes:

ChromosomeCriteriaLength=">5000000"

The template INI file for adding sequences (with comments removed for brevity) could look like this:

templateSequence.ini

[ProcessRun]

[MapSet]
OrganismId={6}
DisplayName=Cucumber {1}
Description="{2}
{3} {1}
Region: {4}, {5}
https://www.nature.com/articles/s41588-026-02506-0

Downloaded from http://cucurbitgenomics.org/v2/ftp/pan-genome/cucumber/assembly/{0}_genome.fa.gz"
AccessionNo={0}
SourceId="cucurbitgenomics"

[MapSetTree]
RootNodeName="Cucumis sativus (pangenome)"
RootNodeOrderNo=0

[Sequence]
Sources="http://cucurbitgenomics.org/v2/ftp/pan-genome/cucumber/assembly/{0}_genome.fa.gz"
ChromosomeCriteriaLength=">5000000"

We will use the first column (index 0) as AccessionNo:

AccessionNo={0}

Note how we build the Description field combining multiple fields from the data file.

The source file is provided via a URL, also built by using a placeholder.

To generate the INI files for adding sequences we will provide the INI file for the command build ini:

build-seq.ini

[Build]
;FileNameIndex (0-based): define which column contains the base file name for generated INI files. For example, if the column contains 'Solins1',
; the file 'Solins1.ini' will be created in the folder pointed by OutputDir
FileNameIndex=0

;OutputDir: specify the output folder for the generated INI files
OutputDir=$DATA/Samples/Sequence/cucumber-pangenome

;DataFile: the tab-delimited text file with lines containing the data for each genome.
DataFile=$DATA/cucumber/metadata.txt

;TemplateFile: the template INI file with placeholders in the form of {0},{1},... referencing the corresponding columns of the tab-delimited data file.
TemplateFile=$DATA/cucumber/templateSequence.ini

Now, we can run the command to build the INI files for all 36 assemblies:

build ini -c $DATA/cucumber/build-seq.ini

Adding sequences in a batch

As in the previous section, we can use the command loop to repeat the command add sequence for all assemblies:

for a in /data/Samples/Sequence/cucumber-pangenome/*.ini; do psh add sequence -c "$a" -t; echo $? $a >> status-seq; done

Note

When using the Docker container, run the command loop from inside the container. Start in the directory /data/psh. The command to run PersephoneShell requires mono: mono psh.exe -S persephone add sequence... Put the INI files in a sub-folder of the shared folder, visible from the container as /data/Data/....

Verify that the tests are successful (check the error codes in the file named this time status-seq) and, if there are no errors, load the sequence into the database:

for a in /data/Samples/Sequence/cucumber-pangenome/*.ini; do psh add sequence -c "$a" -v; echo $? $a >> status-seq; done