Let's agree that we are loosely using the term ortholog:  in our jargon it may also cover homologs and paralogs, as the program does not know in advance which protein data sets are being analyzed - they may come from different species or from different versions of annotation of the same genome. In our terms, by creating "orthologs" we mean searching for the pairs of best matching proteins in the two map sets. If a protein B is the best match for a given protein A and, reciprocally, protein A is the best match for protein B, such a pair will be considered orthologs.

There are two ways of running the process of finding orthologs:

Use the command line only

The simplest way to find orthologs between two map sets is using the command line:

create orthologs 111 222 -v

where 111 and 222 are MapSetIds of both map sets, used here as an example. They can be obtained from the command 'list mapset' that will show MapSetId and the map set name.

Important

In this scenario, the proteins are extracted from the database (or from the BLAST index file) under their internal IDs (ANNOT_ID). If you plan to reuse the calculated orthologous pairs in other databases, it would be better to go the second route described below - using the INI file. This way, you can save the pairs using the gene names that, unlike the internal database IDs, are usually the same in different databases. This will save you from running the lengthy BLASTP jobs for every database. To create the *.ort file with all ortholog pairs, run 'create orthologs' command saving the result into a specified file. You can load this file into the other databases using the command 'add orthologs'.


Note

We recommend using diamond as a faster alternative to blastp that shows a similar sensitivity. The switch between the programs is done in the configuration file. To start using diamond, set OrthologFinder="DIAMOND". See the example here. The custom parameters for diamond can be provided via the PersephoneShell's variable DiamondParams.

The steps in finding the orthologs automatically performed by PersephoneShell are:

  • create a query with all proteins from the first map set. They can be extracted from the BLAST index files or from the database. The program will try to see if there are multiple candidate tracks with genes that can be translated into proteins, and if there is no ambiguity, the process will commence without asking additional questions. If there is a choice of the tracks, they will be listed, and the program will wait when the user makes the proper track selection.
  • run the OrthologFinder (BLASTP or DIAMOND) vs. the second map set and store the result into a temporary *.blastp file (or *.diamond).
  • analyze the OrthologFinder output, find the best matches between the orthologs and store the result into a *.ort file.
  • load the ortholog gene pairs into the database

A typical output would look like this:


PS> create ortholog 1 2 -v
    - Found map set '/Arabidopsis thaliana/TAIR10'
    - Found map set '/Oryza sativa/IRGSP-1.0.31'
    Looking for pre-existing orthologs...
    Extracting protein sequences from compressed BLAST file /home/ec2-user/blastdata/microtest/1_MRNA_P.psq...
        /home/ec2-user/bin/blast/blastdbcmd -db "/home/ec2-user/blastdata/microtest/1_MRNA_P" -entry all -out "/home/ec2-user/tmp/_blast/1_MRNA_P"...
    Exit code:0
    - Protein sequences for map set /Arabidopsis thaliana/TAIR10, track Ensembl dumped to /home/ec2-user/tmp/_blast/1_MRNA_P
    Running blastp...
        /home/ec2-user/bin/blast/blastp -query "/home/ec2-user/tmp/_blast/1_MRNA_P" -db "/home/ec2-user/blastdata/microtest/2_IRGSP_P" -outfmt 6 -out "/home/ec2-user/tmp/_blast/1_MRNA_P_2_IRGSP_P.blastp" -evalue 1e-5 -max_target_seqs 5 -max_hsps 1 -word_size 4 -threshold 100 ...
    Exit code:0
        - BLAST output saved at /home/ec2-user/tmp/_blast/1_MRNA_P_2_IRGSP_P.blastp
    Analyzing BLAST results...
    - Collected: Queries 40189, Subjects 23387
    Writing orthologs to /home/ec2-user/tmp/Orthologs_1_2.ort...
    - Created ORT file /home/ec2-user/tmp/Orthologs_1_2.ort
    - Process Run (4) has been successfully created.
2,728 lines processed
5,455 lines processed
8,182 lines processed
10,909 lines processed

    DATA_VERSION updated

As always, first, test the command using the switch -t. PersephoneShell will check if BLAST (or DIAMOND) binaries are installed, if a pre-existing set of orthologs for this pair of map sets is already present in the database. At this stage, the FASTA file with the query proteins will be generated and, if needed, the BLAST (or DIAMOND) index file will also be prepared. 

Use a command with INI file

Note

We recommend using diamond as a faster alternative to blastp that shows a similar sensitivity. The switch between the programs is done in the configuration file. To start using diamond, set OrthologFinder="DIAMOND". See the example here. The custom parameters for diamond are provided via the PersephoneShell's variable DiamondParams.

Here are the steps performed by the command 'create orthologs':

  • Export all protein sequences from a map set 1 (a query) form a specific track. This will be done by first looking for a BLAST data file used for BLASTP searches that Persephone client runs in real time. If such file exists, the program will use the BLAST binary tools to extract the protein sequences from the compressed data files. If, for some reason, the BLASTP index is missing, PersephoneShell will generate the proteins using the exon coordinates and the underlying genomic sequence.
  • The corresponding BLAST or DIAMOND index with proteins from the second map set will be generated if needed. It will be used as the subject.
  • A BLASTP (or DIAMOND) search will be performed, that compares the query proteins to the proteins from the other genome (a subject, a target). The default set of BLASTP parameters (that can be replaced in the file psh.exe.config) is:

-evalue 1e-5 -max_target_seqs 5 -max_hsps 1 -word_size 4 -threshold 99

  • The default set of DIAMOND parameters is:

-e 1e-5 --max-target-seqs 5 --max-hsps 2

  • A routine, finding the best reciprocal match between the proteins is engaged after BLASTP or DIAMOND has written the result to a temporary file.
  • The result of this analysis is stored into a file with extension .ort that should be specified in the corresponding INI file. 
  • At this stage the process can stop. In such case, after reviewing the .ort file, you can load this information into the database using the command 'add ortholog'. Alternatively, if LoadToDb in the INI file is set to true, the result of the establishing the orthologous links will be loaded into the database as a part of the command 'create orthologs'.

Here is the sample INI file for "creating orthologs" (please read the comments):


[ProcessRun]
; Run description: if specified, a custom description will be used
;                  otherwise, "Created orthologs between {MapSet1 accession} and {MapSet2 accession}" will be used.
;RunDescription="Create orthologs between A.thaliana and rice indica"

[Orthologs]
; Orthologs will be calculated by running reciprocal BLASTP
; Two sets of proteins are provided by specifying their MapSetPath and a track name from which proteins can be generated
MapSetPath1="/Arabidopsis thaliana/TAIR10"
TrackName1="Ensembl" 
MapSetPath2="/Oryza sativa indica/Indica ASM465v1"
TrackName2="BGI gene models"
; Temporary output file; one line per ortholog link
; File where the orthologs will be recorded. Please use .ort extension, psh will warn if it is not
OrtOutputFile=$DATA/ara-indi.ort
; OrtOutputFormat: Output format of the ort file with ortholog pairs. Should be one of
; AnnotIds, MapsetName, MapsetAccession, OrthologUpdater
; AnnotIds - 5 columns: First 2 columns are ANNOT_IDs of the proteins, then score, evalue, percent identity
; MapsetName - 4 columns: MapSetName1, TranscriptName1, MapSetName2, TranscriptName2. Optionally 3 more columns: score, evalue, percent identity
; MapsetAccession - 4 columns: MapSetAccession1, TranscriptName1, MapSetAccession2, TranscriptName2. Optionally 3 more columns: score, evalue, percent identity
; OrthologUpdater - special format accepted by OrthologUpdater tool for binary compression
OrtOutputFormat=OrthologUpdater
; GeneNameQualifier - use this qualifier to uniquely find gene by its name.
; If not provided, a record from ORGANISM_CONFIG will be used
GeneNameQualifier1=protein_id
GeneNameQualifier2=transcript_id

; OrthologFinder setting in psh config defines which program, BLASTP or DIAMOND, to use for aligning the protein sequences.
; Use the set of parameters BlastParams or DiamondParams that correspond to the active program
BlastParams="-evalue 1e-5 -max_target_seqs 5 -max_hsps 2 -word_size 4 -threshold 99"
DiamondParams="-e 1e-5 --max_target_seqs 5 --max_hsps 2"

;NumberThreads=4
; LoadToDb: If set to true, the results of 'create ortholog' will be saved directly into the database. The intermediate .ort file will still be produced.
; If you want to delete the orthologs later, find the corresponding RUN_ID ('list run') and issue 'delete run'
;LoadToDb=true

If you already have BLASTP output files comparing two sets of proteins from two map sets, you can use our tool Blast2Orthologs. It will basically derive the "ortholog" pairs and save them into an .ort file. A couple of requirements for the format of the BLASTP output:

- BLASTP should use its tabular output, specified by -outfmt 6 option.

- the first two columns should use ANNOT_IDs (internal unique IDs of gene models) or gene names.