Adding new Annotation Qualifiers
Disambiguation
Please distinguish two versions of the command 'add annotation_qualifier'. One version is used to add qualifiers provided in a datafile, the other version (this page) creates new qualifiers based on qualifiers already loaded into the database.
Sometimes it is convenient to create additional qualifiers by processing the values of the existing qualifiers already loaded to the database. For example, to create a URL to a gene page at NCBI (e.g., https://www.ncbi.nlm.nih.gov/gene/?term=192670) we would need a numeric ID such as 192670. At the same time, the existing qualifiers may have this information embedded into a qualifier (e.g., db_xref) in a form of "GeneID:192670", so we could use a regular expression to extract the numeric part of the ID and store the value in a newly created qualifier. This new qualifier can be used in generating a URL.
The syntax of the command that performs this task is
add annotation_qualifier {mapSetId | path}
This instruction is executed interactively, from the PersephoneShell's command prompt. The map set for which the new qualifier will be created should be specified by its MapSetId or path. For example, if MapSetId=10, the command will be:
PS> add annotation_qualifier 10
Adding a new annotation qualifier for map set '/Homo sapiens/GRCh38'
Existing qualifiers:
Track 'Ensembl':
[0] Name ( 95,224 records) : A1BG
[1] description ( 94,434 records) : 1,4-alpha-glucan branching enzyme 1 [Source:HGNC Symbol;Acc:HGNC:4180]
[2] gene_id ( 95,224 records) : ENSG00000000003
Track 'GENSCAN':
[3] Name ( 46,548 records) : GENSCAN00000000001
Track 'Gnomon gene models':
[4] Dbxref (113,756 records) : GeneID:1,Genbank:NM_130786.3,HGNC:HGNC:5,MIM:138670
[5] Name (113,756 records) : NM_000014.5
[6] gene (113,756 records) : A1BG
[7] product (113,756 records) : 1,4-alpha-glucan branching enzyme 1
[8] transcript_id (113,756 records) : NM_000014.5
Which qualifier will be used as the source for the new qualifier? Type the line [number]:4
We are interested in extracting '1' from GeneID:1 stored under qualifier Dbxref. Enter 4 to select qualifier on line 4.
Qualifier: Dbxref
Example values:
GeneID:389199,Genbank:NM_203423.2
GeneID:392490,Genbank:NM_207422.2
GeneID:55025,Genbank:XM_017012879.1
GeneID:149373,Genbank:NM_001256615.1
GeneID:729800,Genbank:XM_024448280.1
GeneID:403312,Genbank:NM_001301851.1
GeneID:645177,Genbank:NM_001321724.2
GeneID:647264,Genbank:XM_011535339.1
GeneID:645202,Genbank:NM_001365372.1
GeneID:645202,Genbank:XM_024450126.1
GeneID:653067,Genbank:XM_017029747.1,HGNC:HGNC:25400,MIM:300289,MIM:300744,MIM:300745
GeneID:653067,Genbank:XM_017029746.1,HGNC:HGNC:25400,MIM:300289,MIM:300744,MIM:300745
GeneID:653067,Genbank:NM_001097604.2,HGNC:HGNC:25400,MIM:300289,MIM:300744,MIM:300745
GeneID:653067,Genbank:NM_001097605.2,HGNC:HGNC:25400,MIM:300289,MIM:300744,MIM:300745
GeneID:100507436,Genbank:NM_001289152.2,HGNC:HGNC:7090,IMGT/GENE-DB:MICA,MIM:600169
GeneID:100507436,Genbank:NM_001289153.2,HGNC:HGNC:7090,IMGT/GENE-DB:MICA,MIM:600169
GeneID:100507436,Genbank:NM_001289154.2,HGNC:HGNC:7090,IMGT/GENE-DB:MICA,MIM:600169
GeneID:100507436,Genbank:NM_001177519.3,HGNC:HGNC:7090,IMGT/GENE-DB:MICA,MIM:600169
GeneID:7441,Genbank:NM_001303509.1,HGNC:HGNC:12709,IMGT/GENE-DB:VPREB1,MIM:605141
GeneID:29802,Genbank:NM_013378.3,HGNC:HGNC:12710,IMGT/GENE-DB:VPREB3,MIM:605017
Regular expression to extract values from Dbxref:
A simple regular expression that will extract the numeric part of GeneID:1 would be GeneID:(\d+)
Regular expression to extract values from Dbxref: GeneID:(\d+)
389199
392490
55025
149373
729800
403312
645177
647264
645202
645202
653067
653067
653067
653067
100507436
100507436
100507436
100507436
7441
29802
Tested 113,756 qualifier values
Extracted 113,756 values, (19,908 distinct)
Print all 113,756 extracted values on screen? (Y/N)
The program will display examples of the extracted values and will print the number of successful extractions, counting how many values are unique. It will also give an option to review ALL the extracted values.
Do you want to use this regular expression? (Y/N)Y
Name for the new qualifier? GeneID
Fetching source qualifiers...
Inserting new qualifiers...:
113756 qualifiers inserted
DATA_VERSION updated
To verify the results of inserting the new qualifier GeneID, issue the command 'list annotation_qualifier':
PS> list annotation_qualifier 10
Map set:GRCh38
Track 'Ensembl':
[0] Name ( 95,224 records) : A1BG
[1] description ( 94,434 records) : 1,4-alpha-glucan branching enzyme 1 [Source:HGNC Symbol;Acc:HGNC:4180]
[2] gene_id ( 95,224 records) : ENSG00000000003
Track 'GENSCAN':
[3] Name ( 46,548 records) : GENSCAN00000000001
Track 'Gnomon gene models':
[4] Dbxref (113,756 records) : GeneID:1,Genbank:NM_130786.3,HGNC:HGNC:5,MIM:138670
[5] GeneID (113,756 records) : 1
[6] Name (113,756 records) : NM_000014.5
[7] gene (113,756 records) : A1BG
[8] product (113,756 records) : 1,4-alpha-glucan branching enzyme 1
[9] transcript_id (113,756 records) : NM_000014.5
Now you can use the new qualifier to construct the external URL using the command 'add qualifier_link'.