Orthologs are added by using command
add orthologs -c ...
The links between genes, stored in the database, can be derived by using different methods outside of Persephone. The database would just store an information that a pair of proteins are somehow related, most commonly because they are highly similar. If a protein B is the best match for a given protein A and, reciprocally, protein A is the best match for protein B, such a pair will be considered orthologs. The database can also store other matches with lower similarity score but only the best matching proteins will be called orthologs and shown linked by a connector line. Ortholog pairs are also used to generate Synteny matrix.
To load the orthologs, prepare a file with several columns separated by TAB. PersephoneShell recognizes several formats:
- AnnotId - 5 columns: AnnotId1, AnnotId2, Score, Evalue, PercentIdentity
The first two columns are required, the other 3 are optional. AnnotIds represent protein's original gene model. The FASTA file with proteins identified by AnnotId can be produced by BlastDbExporter tool.
123456 98765 123 0.0 99.4
234566 98766 90.5 1e-6 96.2
- MapSetAccGeneName - 4 columns: MapSetAccession1, GeneName1, MapSetAccession2, GeneName2
All 4 columns are required. A pair of MapSetAccession and GeneName should uniquely identify a gene model. GeneName is a value stored as one of the gene's qualifiers. The name of such qualifier should be either specified (GeneNameQualifier=) in the INI file for loading orthologs or, if it is not provided, a record in ORGANISM_CONFIG table will be used to find the qualifier that stores the gene name (see Adding annotation, section [AnnotationSearches]).
TAIR10 AT1G07110.1 MSU_osa1r7 13105.m00781
TAIR10 AT1G07120.1 MSU_osa1r7 13103.m02191
- MapSetNameGeneName - 4 columns: MapSetName1, GeneName1, MapSetName2, GeneName2
All 4 columns are required. A pair of MapSetName and GeneName should uniquely identify a gene model. The logic is similar to the item above, just instead of map set accession, the map set name is used.
The INI control file is very simple:
RunDescription="Load orthologs between rice japonica and rice indica"
; TAB-delimited file; one line per ortholog link
; Input format (required) can be of types: AnnotId, MapSetAccGeneName or MapSetNameGeneName
; AnnotId - first two columns are required but 3 more are optional. AnnotIds represent protein's original gene model
; AnnotId1, AnnotId2, Score, Evalue, PercentIdentity
; MapSetAccGeneName - 4 columns are required:
; MapSetAccession1, GeneName1, MapSetAccession2, GeneName2
; MapSetNameGeneName - 4 columns are required:
; MapSetName1, GeneName1, MapSetName2, GeneName2
; If GeneNameQualifier is not specified (see below), make sure that ORGANISM_CONFIG contains an entry that specifies which qualifier stores gene name, which would
; allow to unambiguously convert gene name to ANNOT_ID
; GeneNameQualifier - use this qualifier to uniquely find gene by its name. If the qualifier is not provided, ORGANISM_CONFIG entries will be used
; Instead of GeneNameQualifier, which specifies common qualifier storing gene names, GeneNameQualifier1 and GeneNameQualifier2 can be used
; for the genes in the left or in the right columns respectively.
; TrackName1 and TrackName2 can be used in addition to GeneNameQualifier if the gene name belongs to genes in more than one track.
; For example, you can have two tracks "CDS" and "MRNA" that show the genes with the same name nomenclature. So, there could be more than one
; gene model corresponding to a given name. Use TrackName to disambiguate