Create Orthologs

Let's agree that we are loosely using the term ortholog: in our jargon it may also cover homologs and paralogs, as the program does not know in advance which protein data sets are being analyzed - they may come from different species or from different versions of annotation of the same genome. In our terms, by creating "orthologs" we mean searching for the pairs of best matching proteins in the two map sets. If a protein B is the best match for a given protein A and, reciprocally, protein A is the best match for protein B, such a pair will be considered orthologs.

There are two ways of running the process of finding orthologs:

Use the command line only

The simplest way to find orthologs between two map sets is using the command line:

create orthologs 111 222 -v

where 111 and 222 are MapSetIds of both map sets, used here as an example. They can be obtained from the command 'list mapset', which will show MapSetId and the map set name.

Note

We recommend using diamond as a faster alternative to blastp, which shows a similar sensitivity. The switch between the programs is done in the configuration file. To start using diamond, set OrthologFinder="DIAMOND". See the example here. The custom parameters for diamond can be provided via the PersephoneShell's variable DiamondParams.

The steps in finding the orthologs automatically performed by PersephoneShell are:

create a query file with all proteins from the first map set. They can be extracted from the BLAST index files or from the database. The program will try to see if there are multiple candidate tracks with genes, which can be translated into proteins, and if there is no ambiguity, the process will commence without asking additional questions. If there is a choice of the tracks, they will be listed, and the program will wait until the user makes the proper track selection.
run the OrthologFinder (BLASTP or DIAMOND) vs. the second map set and store the result into a temporary *.blastp file (or *.diamond).
analyze the OrthologFinder output, find the best matches between the orthologs and store the result into a *.ort file.
load the ortholog gene pairs into the database

A typical output of the process that runs DIAMOND would look like this:

PS> create ortholog 256 251 -v
Map set pair: 256-251
- Found map set '/Vaccinium darrowii/USDA_Vadar_1.0_pri-1'
- Found map set '/Punica granatum/Punica granatum ASM765513v2'
Looking for pre-existing orthologs...
Extracting protein sequences from compressed BLAST file /data/BlastDB/256_Genbank1_P.psq...
/data/NCBIBlast/blastdbcmd -db "/data/BlastDB/256_Genbank1_P" -entry all -out "/data/Temp/_blast/256_Genbank1_P"...
Exit code:0
- Protein sequences for map set /Vaccinium darrowii/USDA_Vadar_1.0_pri-1, track Genbank gene models-1 dumped to /data/Temp/_blast/256_Genbank1_P
Running DIAMOND...
/data/NCBIBlast/diamond blastp -q "/data/Temp/_blast/256_Genbank1_P" -d "/data/BlastDB/251_Genbank_P" --outfmt 6 --out "/data/Temp/_blast/256_Genbank1_P_251_Genbank_P.diamond" --mid-sensitive -e 1e-5 --max-target-seqs 5 --max-hsps 2...
diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 20
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Temporary directory: /data/Temp/_blast

... (diamond output here)

- DIAMOND output saved at /data/Temp/_blast/256_Genbank1_P_251_Genbank_P.diamond
Analyzing DIAMOND results...
- Collected: Queries 30692, Subjects 29595
Writing orthologs to /data/Temp/Orthologs_256_251.ort...
- Created ORT file /data/Temp/Orthologs_256_251.ort
- Process Run (1381) has been successfully created.
4,070 lines processed
8,166 lines processed
12,246 lines processed
16,323 lines processed

DATA_VERSION updated

As always, first, test the command using the switch -t. PersephoneShell will check if BLAST (or DIAMOND) binaries are installed, if a pre-existing set of orthologs for this pair of map sets is already present in the database. At this stage, the FASTA file with the query proteins will be generated and, if needed, the BLAST (or DIAMOND) index file will also be prepared.

To calculate the orthologs for a set of genomes, we introduced a complex parameter that allows specifying lists or ranges or MapSetIds. For example, for newly loaded genomes with MapSetId between 250 and 255, you can pass them to the command like this:

create ortholog 250..255

This will create orthologs between all possible pairs of genomes in the given range, such as 250-250, 250-251, etc. Please note that this also includes calculating paralogs when the matches are found between the genes within the same map set.

The command also accepts two sets of IDs, e.g.:

create ortholog 1,3,5..10 100,120

This will analyze pairs between IDs in the the first group and the second: 1-100, 1-120, 3-100, 3-120,...

Use a command with INI file

Note

We recommend using diamond as a faster alternative to blastp that shows a similar sensitivity. The switch between the programs is done in the configuration file. To start using diamond, set OrthologFinder="DIAMOND". See the example here. The custom parameters for diamond are provided via the PersephoneShell's variable DiamondParams.

An alternative version of the command create orthologs accepts parameters specified in the INI file. It can be used if the analyzed map sets contain more than one annotation track. In such cases, the command that does not use the control file will ask to select the correct track. To avoid extra interaction, specify the track names in the INI file. Additionally, the control file can contain the aligner program and its parameters.

Another reason of using this version of the command is an option to see the intermetiate files before loading the orthologs into the database. You can set the flag LoadToDb to false and study the *.ort file.

Here are the steps performed by the command 'create orthologs':

Export all protein sequences from a map set 1 (a query) from a specific track. This will be done by first looking for a BLAST data file used for BLASTP searches that Persephone client runs in real time. If such file exists, the program will use the BLAST binary tools to extract the protein sequences from the compressed data files. If, for some reason, the BLASTP index is missing, PersephoneShell will generate the proteins using the exon coordinates and the underlying genomic sequence.
Accordingly, the corresponding BLAST or DIAMOND index with proteins from the second map set will be generated if needed. It will be used as the subject.
A BLASTP (or DIAMOND) search will be performed, which compares the query proteins to the proteins from the other genome (a subject, a target). The default set of BLASTP parameters (which can be replaced in the file psh.exe.config or in the INI file) is:

-evalue 1e-5 -max_target_seqs 5 -max_hsps 1 -word_size 4 -threshold 99

The default set of DIAMOND parameters is:

-e 1e-5 --max-target-seqs 5 --max-hsps 2

A routine finding the best reciprocal match between the proteins is engaged after BLASTP or DIAMOND has written the result to a temporary file.
The result of this analysis is stored into a file with extension .ort, which should be specified in the corresponding INI file.
At this stage the process can stop. In such case, after reviewing the .ort file, you can load this information into the database using the command 'add ortholog'. Alternatively, if LoadToDb in the INI file is set to true, the result of the establishing the orthologous links will be loaded into the database as a part of the command 'create orthologs'.

Here is the sample INI file for creating orthologs (please read the comments):

[ProcessRun]
; Run description: if specified, a custom description will be used
; otherwise, "Created orthologs between {MapSet1 accession} and {MapSet2 accession}" will be used.
;RunDescription="Create orthologs between A.thaliana and rice indica"

[Orthologs]
; Orthologs will be calculated by running reciprocal BLASTP or DIAMOND
; Two sets of proteins are provided by specifying their MapSetPath and a track name from which proteins can be generated
MapSetPath1="/Arabidopsis thaliana/TAIR10"
TrackName1="Ensembl"
MapSetPath2="/Oryza sativa indica/Indica ASM465v1"
TrackName2="BGI gene models"
; Temporary output file; one line per ortholog link
; File where the orthologs will be recorded. Please use .ort extension, psh will warn if it is not
OrtOutputFile=$DATA/ara-indi.ort
; OrtOutputFormat: Output format of the ort file with ortholog pairs. Should be one of
; AnnotId, MapsetName, MapsetAccession, OrthologUpdater
; AnnotId - 5 columns: First 2 columns are ANNOT_IDs of the proteins, then score, evalue, percent identity
; MapsetName - 4 columns: MapSetName1, TranscriptName1, MapSetName2, TranscriptName2. Optionally 3 more columns: score, evalue, percent identity
; MapsetAccession - 4 columns: MapSetAccession1, TranscriptName1, MapSetAccession2, TranscriptName2. Optionally 3 more columns: score, evalue, percent identity
; OrthologUpdater - special format accepted by OrthologUpdater tool for binary compression
OrtOutputFormat=AnnotId

; GeneNameQualifier - use this qualifier to uniquely find gene by its name (for the format MapsetName, MapsetAccession).
; If not provided, a record from ORGANISM_CONFIG will be used (see the command 'list annotation_search')

;GeneNameQualifier1=protein_id
;GeneNameQualifier2=transcript_id

; OrthologFinder setting in psh config defines which program, BLASTP or DIAMOND, to use for aligning the protein sequences.
; Use the set of parameters BlastParams or DiamondParams that correspond to the active program
BlastParams="-evalue 1e-5 -max_target_seqs 5 -max_hsps 2 -word_size 4 -threshold 99"
DiamondParams="-e 1e-5 --max_target_seqs 5 --max_hsps 2"

;NumberThreads=4
; LoadToDb: If set to true, the results of 'create ortholog' will be saved directly into the database. The intermediate .ort file will still be produced.
; If you want to delete the orthologs later, you can do it by the command 'delete ortholog'
;LoadToDb=true

Please pay attention to the last instruction with the flag LoadToDb. It controls if the results of calculations go straight to the database or will stay in a temporary .ort file. If you need to review the orthologs before loading them to the database, set LoadToDb to false and later, if everything is OK, use the command 'add ortholog' feeding the .ort file as the data source. Alternatively, if you are satisfied with the ortholog pairs, set the flag LoadToDb to true and repeat the process. The ortholog pairs will be loaded to the database at the end of the command execution.

If you need to delete the orthologs later, run the command delete ortholog and select the corresponding RecordId of the ortholog data set or provide the RecordId (111) as a parameter:

PS> delete ortholog 111