Adding Annotation

Step 3: Adding gene annotation

Building INI files for adding annotation in a batch

After we have loaded the sequences, let's add gene annotation from provided GFF3 files.

The build-annot.ini file lists instructions for building the INI files:

build-annot.ini

[Build]
;FileNameIndex (0-based): define which column contains the base file name for generated INI files. For example, if the column contains 'Solins1',
; the file 'Solins1.ini' will be created in the folder pointed by OutputDir
FileNameIndex=0

;OutputDir: specify the output folder for the generated INI files
OutputDir=$DATA/cucumber/annot-ini

;DataFile: the tab-delimited text file with lines containing the data for each genome.
DataFile=$DATA/cucumber/metadata.txt

;TemplateFile: the template INI file with placeholders in the form of {0},{1},... referencing the corresponding columns of the tab-delimited data file.
TemplateFile=$DATA/cucumber/templateAnnot.ini

Here is the template INI file for building the control files:

templateAnnot.ini

[ProcessRun]

[MapSet]
MapSetPath=/Cucumis sativus (pangenome)/Cucumber {1}

[Method]
EVM=120,220,250

[Annotation]
Sources=http://cucurbitgenomics.org/v2/ftp/pan-genome/cucumber/annotation/{0}.gff3.gz
CommitFrequency=1000
Method="EVM"
TrackName="EVM gene models"

TrackDescription="Gene annotation from http://cucurbitgenomics.org/
File: http://cucurbitgenomics.org/v2/ftp/pan-genome/cucumber/annotation/{0}.gff3.gz"

QualifierTypes=gene,mRNA

QualifierAttributeKey.gene:Name=geneName
QualifierAttributeKey.mRNA:Name=transcriptName
GroupNameQualifier=parent_gene_id

[QualifierLinks]

[AnnotationSearches]

GeneName=geneName,transcriptName

[MapMapping]

The placeholder {1} will be replaced with the values from the second column (a zero-based 1) that was used for naming map sets. The values from the first column will populate the placeholder {0} to form the GFF file names.

Adding annotation in a batch

The GFF files we ingest contain very little gene‑level metadata in the form of attributes that can be captured as qualifiers. To preserve the essential identifiers, we extract the gene name from gene features (gene:Name) and store it as the qualifier geneName. Likewise, we take the transcript name from mRNA features (mRNA:Name) and store it as transcriptName.

For consistency across the entire database, we follow our internal convention:

• geneName holds the shared name for a gene, typically common to all splice variants.

• transcriptName holds the unique identifier for each gene model.

This ensures uniform handling of gene and transcript identifiers throughout the system.

Before building the INI files, let's use a simple trick to save some typing. PersephoneShell has a command cd to change the current directory. Let's switch to the directory with the file build-annot.ini containing the instructions for building a set of INI files for the command add annotation.

cd $DATA/cucumber

Now, the commands that reference control files in this directory or its sub-directories will be shorter:

PS> build ini -c build-annot.ini
Output INI files will be placed in /data/Data/cucumber/annot-ini
Built file /data/Data/cucumber/annot-ini/9110gt.ini
Built file /data/Data/cucumber/annot-ini/9930_v3.ini
...

The command to engage all the newly generated control files for pangenome is in test mode:

add annotation -c annot-ini/*.ini -t

At the end we will see the summary of the test:

37 INI files were successful
/data/Data/cucumber/seq-ini/9110gt.ini
/data/Data/cucumber/seq-ini/9930_v3.ini
...

Now, we can safely execute the loading command:

add annotation -c annot-ini/*.ini -v

Step 4. Creating orthologs