SequenceDatabase is a term that refers to sequences and annotations together. The Feature Table file format such as GenBank (*.gb, *.gbs, *.gbk) or EMBL (*.dat) is widely used across public data repositories.

Note

See the DDBJ/ENA/GenBank Feature Table Definition for more information.

A SequenceDatabase of Brassica napus chromosome A1 (shown below) contains features such as source, gene, mRNA, CDS, and ncRNA. The source feature defines organism and chromosome meta data. A valid protein coding gene may consist of gene, mRNA, and CDS. RNA genes may be defined as single features such as ncRNA, miscRNA, etc.

LOCUS       NW_013650291        46747018 bp    DNA     linear   CON 31-AUG-2015
DEFINITION  Brassica napus cultivar ZS11 chromosome A1 genomic scaffold,
            Brassica_napus_assembly_v1.0 A01, whole genome shotgun sequence.
ACCESSION   NW_013650291 GPS_010492900
VERSION     NW_013650291.1  GI:919453751
DBLINK      BioProject: PRJNA293435
            Assembly: GCF_000686985.1
            BioSample: SAMN02742820
KEYWORDS    WGS; RefSeq.
SOURCE      Brassica napus (rape)
  ORGANISM  Brassica napus
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
            Pentapetalae; rosids; malvids; Brassicales; Brassicaceae;
            Brassiceae; Brassica.
REFERENCE   1  (sites)
  AUTHORS   Lowe,T.M. and Eddy,S.R.
  TITLE     tRNAscan-SE: a program for improved detection of transfer RNA genes
            in genomic sequence
  JOURNAL   Nucleic Acids Res. 25 (5), 955-964 (1997)
   PUBMED   9023104
  REMARK    This is the methods paper for tRNAscan-SE.
COMMENT     REFSEQ INFORMATION: The reference sequence is identical to
            CM002759.1.
            Assembly name: Brassica_napus_assembly_v1.0
            The genomic sequence for this RefSeq record is from the
            whole-genome assembly released by the BGI-Shenzhen on 2014/05/15.
            The original whole-genome shotgun project has the accession
            JMKK00000000.1.
FEATURES             Location/Qualifiers
     source          1..46747018
                     /organism="Brassica napus"
                     /mol_type="genomic DNA"
                     /cultivar="ZS11"
                     /db_xref="taxon:3708"
                     /chromosome="A1"
     gene            complement(1834..2645)
                     /gene="LOC106353263"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Gnomon."
                     /db_xref="GeneID:106353263"
     mRNA            complement(join(1834..2065,2142..2274,2387..2645))
                     /gene="LOC106353263"
                     /product="protein WVD2-like 1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Gnomon. Supporting evidence
                     includes similarity to: 6 Proteins"
                     /transcript_id="XM_013792982.1"
                     /db_xref="GI:923516411"
                     /db_xref="GeneID:106353263"
     CDS             complement(join(1834..2065,2142..2274,2387..2645))
                     /gene="LOC106353263"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Gnomon."
                     /codon_start=1
                     /product="protein WVD2-like 1"
                     /protein_id="XP_013648436.1"
                     /db_xref="GI:923516412"
     ncRNA           join(478959..479017,479093..479180,479284..479320)
                     /ncRNA_class="lncRNA"
                     /gene="LOC106369563"
                     /product="uncharacterized LOC106369563"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Gnomon. Supporting evidence
                     includes similarity to: 100% coverage of the annotated
                     genomic feature by RNAseq alignments, including 2 samples
                     with support for all annotated introns"
                     /transcript_id="XR_001274166.1"
                     /db_xref="GI:923520022"
                     /db_xref="GeneID:106369563"

ProcessRun Section

The ProcessRun section contains information about data loading. It is highly recommended to create a new process run each time you load the data. An example ProcessRun section is shown below.

[ProcessRun]
; Run description: if specified, a custom description will be used. Will be ignored if a RunId is specified.
;                  otherwise, "Added sequences and annotations for {MapSet Accession No.} from {Sources}." will be used.
RunDescription="Added Brassica_napus_assembly_v1.0 genome from http://ftp.ncbi.nlm.nih.gov/genomes/Brassica_napus."


MapSet and MapSetTree Sections

The MapSet and MapSetTree sections are required. Sample MapSet and MapSetTree sections are shown below. Please see Add Map Set, Maps, and Sequences for more details.

[MapSet]
;------------------------------------------------------------------------------------
; 1. Using existing MapSet
;------------------------------------------------------------------------------------
; MapSet ID: if specified, it adds sequences to the existing MapSet.
;            otherwise, a new MapSet should be specified below.
;MapSetId=12345
;------------------------------------------------------------------------------------
; 2. Adding new MapSet
;------------------------------------------------------------------------------------
; Organism ID (required): organism ID should exist.
OrganismId=138011
; Display name (required): a name shown in MapSetTree. Usually a assembly build name.
DisplayName="Brassica_napus_assembly_v1.0"
; Description: by default, organism name + display name.
Description="Brassica_napus assembly v1.0"
; AccessionNo: accession of the genome build. Re http://ncbi.nlm.nih.gov/genome
AccessionNo="Brassica_napus_assembly_v1.0"
; SourceId
SourceId="GenBank"
[MapSetTree]
;------------------------------------------------------------------------------------
; 1. Using existing MapSetTree node
;------------------------------------------------------------------------------------
; Node ID: if specified, the MapSet with the new sequences will be placed on this node.
;NodeId=12345
;------------------------------------------------------------------------------------
; 2. Adding new MapSetTree node to a parent node
;------------------------------------------------------------------------------------
; Parent node ID: if specified, the MapSet with the new sequences will be placed under this
 parent node as a child.
ParentNodeId=251067112
;------------------------------------------------------------------------------------
; 3. Adding new MapSetTree node under a new root node
;------------------------------------------------------------------------------------
; Root node name: usually an organism name. Ignored if the root name already exists.
;RootNodeName="Homo sapiens"
; Root node order number: order of the root node in the MapSetTree. By default, 0.
;RootNodeOrderNo=0

Method Section

The Method is used to add or update annotation methods. A sample Method section is shown below. Please see Add Gene Annotations for more details.

[Method]
; To add a new annotation method, specify a method name, CDS flag and a color to be shown in the annotation track.
; CDS flag is a boolean value that indicates if the annotation method is for CDS or not.
; Supported color name is available in http://www.flounder.com/csharp_color_table.htm
; If a method already exists, it will be updated. Otherwise, a new method will be added.
;Name=IsCds,{NamedColor|HTML hex code|R,G,B}
MRNA=false,203,128,147

SequenceDatabase Section

The SequenceDatabase section defines file type, sources, track sections, and other annotations.

[SequenceDatabase]
; Format: GenBank or Embl
Format=GenBank
; Sources (required): GenBank file(s) of genomic DNA sequence located locally or remotely
 accessible via URL.
Sources="http://ftp.ncbi.nlm.nih.gov/genomes/Brassica_napus/CHR_A1/bna_ref_Brassica_napus_assembly_v1.0_chrA1.gbk.gz",
...
"http://ftp.ncbi.nlm.nih.gov/genomes/Brassica_napus/CHR_C9/bna_ref_Brassica_napus_assembly_v1.0_chrC9.gbk.gz"
; Tracks (required): comma delimited. Corresponding section(s) named the same must exist
;                    At least one track need to be specified.
Tracks="GenBank"
; PopulateChromosome: indicates if the chromosome table is populated or not. (true by default)
;PopulateChromosome=false
; IgnoreWrongAnnots: ignore wrongly translated annotations. True by default.
;IgnoreWrongAnnots=false
; Commit frequency: indicates how often the process commits annotations. Every N annotations.
CommitFrequency=1000

Each track section defined above in the Tracks parameter in SequenceDatabase section describes which data will be loaded from the data sources.

GenBank Section

The following shows an example [Genbank] section.

[GenBank]
; Method (required): annotation method. If new, should be specified in METHOD section.
Method="MRNA"
; Track name (required): name of track
TrackName="GenBank"
; TrackDescription: track description
;TrackDescription="GenBank"
; FeatureKey: Feature key of annotation items.
;FeatureKey="exon"
; Qualifier: qualifiers to be loaded
;Qualifier.key.qualifier="ID","description"
Qualifier.mRNA:gene="gene"
Qualifier.mRNA:product="product"
Qualifier.mRNA:note="note"
Qualifier.mRNA:transcript_id="transcript_id"
Qualifier.mRNA:db_xref="db_xref"

You can optionally specify FeatureKey as a specific feature to load. This option is very rarely used, as, if FeatureKey is not specified, PersephoneShell automatically parses gene structure from standard records gene, mRNA, CDS, and so on.

To load qualifiers from different features, specify key and qualifier as a pair. It is possible to rename the qualifiers and assign a display text for the key. For example, /protein_id qualifier in the CDS feature can be specified as

Qualifier.CDS.protein_id="Protein ID","GenBank Protein ID".

The first part of this assignment tells which qualifier in the file to process (protein_id). The second part assigns the qualifier key ("Protein ID") that will be used to store the key-value pair in the database, and the display text that replaces the key when shown in Persephone.

QualifierLinks and AnnotationSearches Sections

Sample QualifierLinks and AnnotationSearches sections are shown below. Please see Add Gene Annotations for more information about these sections.

[QualifierLinks]
; Link qualifier name-value to external source.
;QUALIFIER_NAME=PLACEHOLD_URL
protein_id="http://www.ncbi.nlm.nih.gov/protein/%s"
transcript_id="http://www.ncbi.nlm.nih.gov/nuccore/%s"
[AnnotationSearches]
; Add qualifier name-value to search term [GeneName|GeneFunction]
;SEARCH_TERM=QUALIFIER_NAME
GeneName="gene"
GeneFunction="product"

The command line

The command for adding the sequence database (sequence with annotation tracks) is:

add sequencedatabase -c <path to control file>

As usual, first, run it in the test mode using -t switch, then, if the tests are successful, load the data with the verbose output (-v).