Orthologs are added by using command

add orthologs -c <path to control file>


The links between genes, stored in the database, can be derived by using different methods outside of Persephone. The database would just store an information that a pair of proteins are somehow related, most commonly because they are highly similar. If a protein B is the best match for a given protein A and, reciprocally, protein A is the best match for protein B, such a pair will be considered orthologs. The database can also store other matches with lower similarity score but only the best matching proteins will be called orthologs and shown linked by a connector line. Ortholog pairs are also used to generate Synteny matrix.

To load the orthologs, prepare a file with several columns separated by TAB. PersephoneShell recognizes several formats:

  • AnnotId - 5 columns: AnnotId1, AnnotId2, Score, Evalue, PercentIdentity
    The first two columns are required, the other 3 are optional. AnnotIds represent protein's original gene model. The FASTA file with proteins identified by AnnotId can be produced by BlastDbExporter tool.

;sample
123456        98765        123        0.0        99.4
234566        98766        90.5        1e-6        96.2
...


  • MapSetAccGeneName - 4 columns: MapSetAccession1, GeneName1, MapSetAccession2, GeneName2
    All 4 columns are required. A pair of MapSetAccession and GeneName should uniquely identify a gene model. GeneName is a value stored as one of the gene's qualifiers. The name of such qualifier should be either specified (GeneNameQualifier=) in the INI file for loading orthologs or, if it is not provided, a record in ORGANISM_CONFIG table will be used to find the qualifier that stores the gene name (see Adding annotation, section [AnnotationSearches]).

;sample
TAIR10        AT1G07110.1        MSU_osa1r7        13105.m00781
TAIR10        AT1G07120.1        MSU_osa1r7        13103.m02191
...

  • MapSetNameGeneName - 4 columns: MapSetName1, GeneName1, MapSetName2, GeneName2
    All 4 columns are required. A pair of MapSetName and GeneName should uniquely identify a gene model. The logic is similar to the item above, just instead of map set accession, the map set name is used.

The INI control file is very simple:

[ProcessRun]
RunDescription="Load orthologs between rice japonica and rice indica"

[Orthologs]
; TAB-delimited file; one line per ortholog link
Source="r:\orthologs_jap_ind.txt"

Analysis="BLASTP"

; Input format (required) can be of types: AnnotId, MapSetAccGeneName or MapSetNameGeneName
; AnnotId - first two columns are required but 3 more are optional. AnnotIds represent protein's original gene model
;   AnnotId1, AnnotId2, Score, Evalue, PercentIdentity
; MapSetAccGeneName - 4 columns are required:
;   MapSetAccession1, GeneName1, MapSetAccession2, GeneName2
; MapSetNameGeneName - 4 columns are required:
;   MapSetName1, GeneName1, MapSetName2, GeneName2
; If GeneNameQualifier is not specified (see below), make sure that ORGANISM_CONFIG contains an entry that specifies which qualifier stores gene name, which would
; allow to unambiguously convert gene name to ANNOT_ID
InputFormat=MapSetAccGeneName
; GeneNameQualifier - use this qualifier to uniquely find gene by its name. If the qualifier is not provided, ORGANISM_CONFIG entries will be used
GeneNameQualifier=locus_tag
; Instead of GeneNameQualifier, which specifies common qualifier storing gene names, GeneNameQualifier1 and GeneNameQualifier2 can be used
; for the genes in the left or in the right columns respectively.
;GeneNameQualifier1=transcriptName
;GeneNameQualifier2=Name
; TrackName1 and TrackName2 can be used in addition to GeneNameQualifier if the gene name belongs to genes in more than one track.
; For example, you can have two tracks "CDS" and "MRNA" that show the genes with the same name nomenclature. So, there could be more than one
; gene model corresponding to a given name. Use TrackName to disambiguate
;TrackName=CDS