Use Case
A typical data flow for loading a genome with multiple tracks that can be linked to other genomes starts with creating an Organism record. All maps should have a reference to an Organism. After adding the organism record, a Map Set with multiple Maps can be created. It can be based on the genomic sequences (physical maps) or genetic linkage groups (genetic maps). Tracks with gene annotation are usually added to the physical map sets first. Additionally, the maps can have tracks of other types, including quantitative tracks, such as RNA-seq data, marker tracks, QTLs, etc.
The maps can be linked to other maps by common markers, orthologous genes, or "ribbons" between syntenic regions. If you load multiple genomes with high sequence similarity to each other, we recommend creating artificial markers (Tags) on each of the genomes to be linked. Identical markers are automatically linked when the maps are shown on the screen. If you plan connecting distant genomes, the Tags will not work and you will need to create ortholog sets for each pair of genomes, which will require running multiple jobs for ortholog calculation.
This section provides a use case for adding Glycine max (soybean) to your Persephone database using PersephoneShell.
To add soybean genomic data let's perform the following tasks:
- Add an Organism
- Download the genomic information for Glycine max. You will need these files for Steps 3, 5, and 6 below. These files should be saved to a location where they can be accessed by PersephoneShell.
- Add a Map Set, Maps, and Sequences (add sequence)
- Verify Addition of Maps, Map Sets, and Map Set Trees (list)
- Add Gene Annotations (add annotation)
- Add functional annotation (create function)
- Calculate orthologs and paralogs (create ortholog).
- Add Markers (add marker)
Tip
If you need to delete any of the data you have loaded, perform the steps outlined in Deleting Loaded Data or to start with an empty database, use the 'init' command to wipe all the data from the database and refresh the schema.
Please also familiarize yourself with Persephone's data hierarchy, which is described below.
Persephone Data Hierarchy
The figure below shows some of the Persephone objects and their relationship.
Data Hierarchy
The highest-level model object is Organism which contains taxonomy information, such as scientific name and common name. The corresponding genomic data for Organism can be organized into multiple map sets, which usually represent different assembly builds. A map set consists of chromosomes, scaffolds, or genetic maps, also known as maps. A physical map, such as a chromosome or a scaffold, represents a genomic sequence with features located in base pairs (bp; 1-based), while a genetic map is represented as distances between genetic markers and gene loci in centiMorgan (cM; floating-point numbers).
The mapped features are displayed in a form of tracks. A map can contain multiple tracks of different kinds including
- annotation tracks with gene models,
- marker tracks with SNP markers or repeats,
- quantitative tracks with RNA-seq coverage or methylation profiles,
- variation tracks with multiple genotypes,
- QTL tracks,
- NGS read alignments,
- syntenic blocks,
- tracks with expression data, etc.