Adding SequenceDatabase

SequenceDatabase is a term that refers to sequences and annotations together. The Feature Table file format such as GenBank (*.gb, *.gbs, *.gbk) or EMBL (*.dat) is widely used across public data repositories.

Note

See the DDBJ/ENA/GenBank Feature Table Definition for more information.

A SequenceDatabase of Brassica napus chromosome A1 (shown below) contains features such as source, gene, mRNA, CDS, and ncRNA. The source feature defines organism and chromosome meta data. A valid protein coding gene may consist of gene, mRNA, and CDS. RNA genes may be defined as single features such as ncRNA, miscRNA, etc.

LOCUS NW_013650291 46747018 bp DNA linear CON 31-AUG-2015
DEFINITION Brassica napus cultivar ZS11 chromosome A1 genomic scaffold,
Brassica_napus_assembly_v1.0 A01, whole genome shotgun sequence.
ACCESSION NW_013650291 GPS_010492900
VERSION NW_013650291.1 GI:919453751
DBLINK BioProject: PRJNA293435
Assembly: GCF_000686985.1
BioSample: SAMN02742820
KEYWORDS WGS; RefSeq.
SOURCE Brassica napus (rape)
ORGANISM Brassica napus
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
Pentapetalae; rosids; malvids; Brassicales; Brassicaceae;
Brassiceae; Brassica.
REFERENCE 1 (sites)
AUTHORS Lowe,T.M. and Eddy,S.R.
TITLE tRNAscan-SE: a program for improved detection of transfer RNA genes
in genomic sequence
JOURNAL Nucleic Acids Res. 25 (5), 955-964 (1997)
PUBMED 9023104
REMARK This is the methods paper for tRNAscan-SE.
COMMENT REFSEQ INFORMATION: The reference sequence is identical to
CM002759.1.
Assembly name: Brassica_napus_assembly_v1.0
The genomic sequence for this RefSeq record is from the
whole-genome assembly released by the BGI-Shenzhen on 2014/05/15.
The original whole-genome shotgun project has the accession
JMKK00000000.1.
FEATURES Location/Qualifiers
source 1..46747018
/organism="Brassica napus"
/mol_type="genomic DNA"
/cultivar="ZS11"
/db_xref="taxon:3708"
/chromosome="A1"
gene complement(1834..2645)
/gene="LOC106353263"
/note="Derived by automated computational analysis using
gene prediction method: Gnomon."
/db_xref="GeneID:106353263"
mRNA complement(join(1834..2065,2142..2274,2387..2645))
/gene="LOC106353263"
/product="protein WVD2-like 1"
/note="Derived by automated computational analysis using
gene prediction method: Gnomon. Supporting evidence
includes similarity to: 6 Proteins"
/transcript_id="XM_013792982.1"
/db_xref="GI:923516411"
/db_xref="GeneID:106353263"
CDS complement(join(1834..2065,2142..2274,2387..2645))
/gene="LOC106353263"
/note="Derived by automated computational analysis using
gene prediction method: Gnomon."
/codon_start=1
/product="protein WVD2-like 1"
/protein_id="XP_013648436.1"
/db_xref="GI:923516412"
ncRNA join(478959..479017,479093..479180,479284..479320)
/ncRNA_class="lncRNA"
/gene="LOC106369563"
/product="uncharacterized LOC106369563"
/note="Derived by automated computational analysis using
gene prediction method: Gnomon. Supporting evidence
includes similarity to: 100% coverage of the annotated
genomic feature by RNAseq alignments, including 2 samples
with support for all annotated introns"
/transcript_id="XR_001274166.1"
/db_xref="GI:923520022"
/db_xref="GeneID:106369563"

ProcessRun Section

The ProcessRun section contains information about data loading. It is highly recommended to create a new process run each time you load the data. An example ProcessRun section is shown below.

[ProcessRun]
; Run description: if specified, a custom description will be used,
; otherwise, "Added sequences and annotation for {MapSet Accession No.} from {Sources}." will be used.
RunDescription="Added Brassica_napus_assembly_v1.0 genome from http://ftp.ncbi.nlm.nih.gov/genomes/Brassica_napus."

MapSet and MapSetTree Sections

The MapSet and MapSetTree sections are required. Sample MapSet and MapSetTree sections are shown below. Please see Add Map Set, Maps, and Sequences for more details.

[MapSet]
;------------------------------------------------------------------------------------
; 1. Using existing MapSet
;------------------------------------------------------------------------------------
; MapSet ID: if specified, it adds sequences to the existing MapSet.
; otherwise, a new MapSet should be specified below.
;MapSetId=12345
;------------------------------------------------------------------------------------
; 2. Adding new MapSet
;------------------------------------------------------------------------------------
; Organism ID (required): organism ID should exist.
OrganismId=138011
; Display name (required): a name shown in MapSetTree. Usually a assembly build name.
DisplayName="Brassica_napus_assembly_v1.0"
; Description: by default, organism name + display name.
Description="Brassica_napus assembly v1.0"
; AccessionNo (required): accession of the genome build. Re http://ncbi.nlm.nih.gov/genome
AccessionNo="Brassica_napus_assembly_v1.0"
; SourceId (required)
SourceId="GenBank"

[MapSetTree]
;------------------------------------------------------------------------------------
; 1. Adding new MapSetTree node to a parent node
;------------------------------------------------------------------------------------
; Parent node ID: if specified, the MapSet with the new sequences will be placed under this parent node as a child.
; ParentNodeId=251067
;------------------------------------------------------------------------------------
; 2. Adding new MapSetTree node under a new root node
;------------------------------------------------------------------------------------
; Root node name(s): usually an organism name. Ignored if the root name already exists.
; A new node can be placed as a child of a root node
RootNodeName="/Brassica/Brassica napus"
; Root node order number: order of the root node in the MapSetTree. By default, 0.
;RootNodeOrderNo=0

Method Section

The Method is used to add or update annotation methods. A sample Method section is shown below. Please see Add Gene Annotations for more details.

[Method]
; To add a new annotation method, specify a method name and a color to be shown in the annotation track.
; You don't need to provide this info if the method already exists in the database.
; Supported color name is available in http://www.flounder.com/csharp_color_table.htm
; If a method already exists, it will be updated. Otherwise, a new method will be added.
;Name={NamedColor|HTML hex code|R,G,B}
MRNA=203,128,147

SequenceDatabase Section

The SequenceDatabase section defines file type, sources, track sections, and other annotations.

[SequenceDatabase]
; Format: GenBank or Embl
Format=GenBank
; Sources (required): GenBank file(s) of genomic DNA sequence located locally or remotely
accessible via URL.
Sources="http://ftp.ncbi.nlm.nih.gov/genomes/Brassica_napus/CHR_A1/bna_ref_Brassica_napus_assembly_v1.0_chrA1.gbk.gz",
...
"http://ftp.ncbi.nlm.nih.gov/genomes/Brassica_napus/CHR_C9/bna_ref_Brassica_napus_assembly_v1.0_chrC9.gbk.gz"
; Tracks (required): comma delimited. Corresponding section(s) named the same must exist
; At least one track need to be specified.
Tracks="GenBank"
; PopulateChromosome: indicates if the chromosome table is populated or not. (true by default)
;PopulateChromosome=false
; Commit frequency: indicates how often the process commits annotations. Every N annotations.
CommitFrequency=1000

Each track section defined above in the Tracks parameter in SequenceDatabase section describes which data will be loaded from the data sources.

GenBank Section

The following shows an example [Genbank] section.

[GenBank]
; Method (required): annotation method. If new, should be specified in METHOD section.
Method="MRNA"
; Track name (required): name of track
TrackName="GenBank"
; TrackDescription: track description
;TrackDescription="GenBank"
; FeatureKey: Feature key of annotation items.
;FeatureKey="exon"
; Qualifier: qualifiers to be loaded
;Qualifier.key.qualifier="ID","description"
Qualifier.mRNA:gene="gene"
Qualifier.mRNA:product="product"
Qualifier.mRNA:note="note"
Qualifier.mRNA:transcript_id="transcript_id"
Qualifier.mRNA:db_xref="db_xref"
; IsSearchable: If true (default), the track data will be indexed for search. If false, the indexing will be skipped
;IsSearchable=false

You can optionally specify FeatureKey as a specific feature to load. This option is very rarely used, as, if FeatureKey is not specified, PersephoneShell automatically parses gene structure from standard records gene, mRNA, CDS, and so on.

To load qualifiers from different features, specify key and qualifier as a pair. It is possible to rename the qualifiers and assign a display text for the key. For example, /protein_id qualifier in the CDS feature can be specified as

Qualifier.CDS.protein_id="Protein ID"

The first part of this assignment tells which qualifier in the file to process (protein_id). The second part assigns the qualifier key ("Protein ID") that will be used to store the key-value pair in the database.

QualifierLinks and AnnotationSearches Sections

Sample QualifierLinks and AnnotationSearches sections are shown below. Please see Add Gene Annotations for more information about these sections.

[QualifierLinks]
; Link qualifier name-value to external source.
;QUALIFIER_NAME=PLACEHOLD_URL
protein_id="http://www.ncbi.nlm.nih.gov/protein/%s"
transcript_id="http://www.ncbi.nlm.nih.gov/nuccore/%s"
[AnnotationSearches]
; Add qualifier name-value to search term [GeneName|GeneFunction]
;SEARCH_TERM=QUALIFIER_NAME
GeneName="gene"
GeneFunction="product"