This section explains how to load an expression data to existing gene annotations of a map set. Persephone supports transcript-level gene expression. This means that the expression is stored as a set with one value per transcript. There can be different sources of the expression values, with RNA-seq being one of the most popular methods. To allow comparison of values from different experiments, their data should be normalized before loading and the normalization should use the same method. It is up to the researchers who load the data to decide which normalization technique to use. To make it backward compatible we left IsNormalized variable in the control file. By default (if IsNormalized is not specified) the value is true, which means that PersephoneShell will assume that the data is already normalized before loading. The older INI files may contain IsNormalized set to false. If you try to use them, the newer version of PersephoneShell (after February 20, 2017) will refuse working, and will produce a warning that the data should be pre-normalized. In the older version, PersephoneShell was normalizing the data (when IsNormalized=false) by calculating the Z-score based on all values in the experiment. We found that using only one method of normalization is too restrictive and let the scientists chose their own method beforehand.

Currently, only delimited text files are supported. They may contain sample name and additional information on first several rows. The first few columns may contain transcript name or coordinates.

 

BV_A

BV_B

BV_D

BV_H

BV_K

BV_L

...

 

DM

DM

DM

DM

DM

DM

...

 

Control for salt and mannitol

Salt (150mM NaCl, 24hr)

Control for IAA, GA3, BAP, and ABA

BAP (10uM, 24hr)

Heat (24hr, 35C)

Leaves

...

PGSC0003DMG400000001

0.2011

0.6912

0.3464

0.46329

0.52498

0.7274

...

PGSC0003DMG400000002

0.48955

0.10511

0.17085

0.80867

8.9386

0.24704

...

PGSC0003DMG400000003

0.499

0.4404

0.0588

0.42734

0.8567

0.00842

...

PGSC0003DMG400000004

0.5521

0.6303

0.1432

0.72328

0.0803

0.0372

...

PGSC0003DMG400000005

0.34473

0.21822

0.72429

0.03762

0.55906

0.97425

...

PGSC0003DMG401000006

0.40259

0.41331

0.28136

0.10187

0.05898

0.60103

...

PGSC0003DMG402000006

0.161589

0.381256

0.445893

0.135183

0.366729

0.0810092

...

 

Control File

A control file for expression data consists of 5 sections: ProcessRun, MapSet, Expression, MapMapping, and DbSequences.

ProcessRun

If you want to load data onto an existing run, specify a RunId. Otherwise, provide a custom description. An empty description will generate a default string from given process type and file source.

[ProcessRun]
;RunId=12345 
RunDescription="Added FPKM values of all the representative transcripts across 40 DM libraries from http://solanaceae.plantbiology.msu.edu/pgsc_download.shtml."

MapSet

A target MapSet for loaded data can be specified either by a MapSetId or a MapSetPath. A MapSetId can be obtained by 'list mapsettree -l' command. A MapSetPath represents a tree node path from the root node delimited by slash '/'.

[MapSet]
;MapSetId=232287170
MapSetPath="/Solanum tuberosum (potato)/DM_v4.03"

Expression

The Expression section describes where a file source is located and how to be parsed. Choose at least one delimiter for a delimited text file. CommitFrequency indicates how often the loading process is committed in database.

[Expression]
; Source (required): a TXT file or Excel file located locally or remotely accessible via URL.
Source="G:\Genome\Plants\Solanum tuberosum PGSC_DM_v4.03\DM_RH_RNA_Seq_FPKM_with_gene_coordinates.txt"
; FileType: {Text (delimited text file)|Excel}
FileType=Text
; Delimiters: specify delimiters among Colon(:), Comma(,), Period(.), Hyphen(-), SemiColon(;), Slash(/), Tab(\t), VerticalBar(|)
Delimiters=Tab
; Commit frequency: indicates how often the process commits expression data. Every N transcripts.
CommitFrequency=10000

Once a target MapSet is chosen, PersephoneShell looks up matching transcript either by its name or its location. The transcript names in file should match one of the annotation qualifiers. You can find these qualifiers in the Annotation Details form on an annotation track. TranscriptNameIndex is the 0-based column index in the source file will contain the transcript name. AnnotationQualifierName is the matching name of the qualifier that will contain the transcript name.


;-----------------------------------------------------------------------------
; Transcript (column indices)
;   : matching a gene annotation with a data row can be done by
;-------------------------
; 1) Matching exact name
; TranscriptNameIndex (either TranscriptName or Map coordinate is required): column index(0-based) for transript names.
TranscriptNameIndex=0
; AnnotationQualifierName (required when TranscriptNameIndex is specified)
AnnotationQualifierName="ID" 


Considering the example above, if the text file contains a transcript name, for example, "Gene001", PersephoneShell will search for a qualifier "ID" with value "Gene001" in the given map set and will identify its internal transcript ID.

Sample name, tissue, stage, and additional information (qualifiers) can be captured by assigning index of columns in the delimited text file. Note that a row index and qualifier name to be created is formatted as "SampleQualifierIndices.ROW_INDEX(0-based)=QUAL_NAME".


;-----------------------------------------------------------------------------
; Sample (row indices): sample name and other information
;-------------------------
; SampleNameIndex (required): row index(0-based) for sample names.
SampleNameIndex=0
; SampleTissueIndex: row index(0-based) for sample tissue information.
SampleTissueIndex=1
; SampleStageIndex: row index(0-based) for sample stage information.
SampleStageIndex=2
; SampleQualifiers: row indices(0-based) for additional information.
;SampleQualifierIndices.ROW_INDEX(0-based)=QUAL_NAME
;SampleQualifierIndices.1="QualifierName"


Now you specify row and column index where data begin.

;-----------------------------------------------------------------------------
; Data (row/column index)
;-------------------------
; DataStartRowIndex (required): row index(0-based) that expression data begin
DataStartRowIndex=3
; DataStartColumnIndex (required): column index(0-based) that expression data begin
DataStartColumnIndex=1
; IsNormalized: indicates if expression level is normalized or not.
IsNormalized=true


The MapMapping section is used to match the map name in the file with one in the database.

[MapMapping]
; If no mapping is found in this section, it assumes that each MAP_NAME in file exactly matches a MAP_NAME in DB.
; If map names in file are different from those in DB, map each MAP_NAME in file to its
MAP_ID or ACCESION_NO in DB.
; Otherwise, marker will be created without mapping.
;MAP_NAME in file=MAP_ID or ACCESSION_NO in DB
;chr01=ST4.03ch01
;chr02=ST4.03ch02
;chr03=ST4.03ch03
;chr04=ST4.03ch04
;chr05=ST4.03ch05
;chr06=ST4.03ch06
;chr07=ST4.03ch07
;chr08=ST4.03ch08
;chr09=ST4.03ch09
;chr10=ST4.03ch10
;chr11=ST4.03ch11
;chr12=ST4.03ch12
;chr00=ST4.03ch00

DbSequence

The DbSequence section can be used to manually specify an Oracle sequence for incrementing numbers. By default, all of the ID columns are auto-generated unless you block them.

[DbSequences]
; The ID columns below are used in variant loading
; If there is no sequence/trigger assigned to these columns, you must specify a sequence for them.
;TABLE_NAME.COLUMN_NAME=SEQUENCE_NAME