Adding Expression Data

Persephone supports transcript-level gene expression data. This means that the expression is stored as a set with one value per transcript. There can be different sources of the expression values, with RNA-seq being one of the most popular methods. To allow comparison of values from different experiments, the data should be normalized before loading, and the normalization should use the same method. It is up to the researchers who load the data to decide which normalization technique to use.

Note

To make it backward compatible we left IsNormalized variable in the control file. By default (if IsNormalized is not specified) the value is true, which means that PersephoneShell will assume that the data is already normalized before loading. The older INI files may contain IsNormalized set to false. If you try to use them, the newer version of PersephoneShell (after February 20, 2017) will refuse working, and will produce a warning that the data should be pre-normalized. In the older version, PersephoneShell was normalizing the data (when IsNormalized=false) by calculating the Z-score based on all values in the experiment. We found that using only one method of normalization is too restrictive and let the scientists chose their own method beforehand.

As the other control files, the control INI file for expression also has common sections like ProcessRun, MapSet, etc. Please see more details here.

Currently, only delimited text files are supported. They may contain sample name and additional information in the first several rows. The first three rows of each column in the file are reserved for

Line 0. Sample name
Line 1. Tissue
Line 2. Developmental stage

The following lines can contain additional qualifiers (see line 3 with treatment)

The first column (index:0) should contain the transcript names:

sample (required) -->	BV_A	BV_B	BV_D	BV_H	BV_K	BV_L	...
tissue (required) -->	leaves	leaves	leaves	leaves	leaves	leaves	...
stage (required) -->	unknown	unknown	unknown	unknown	unknown	unknown
treatment	Control for salt and mannitol	Salt (150mM NaCl, 24hr)	Control for IAA, GA3, BAP, and ABA	BAP (10uM, 24hr)	Heat (24hr, 35C)		...
PGSC0003DMG400000001	0.2011	0.6912	0.3464	0.46329	0.52498	0.7274	...
PGSC0003DMG400000002	0.48955	0.10511	0.17085	0.80867	8.9386	0.24704	...
PGSC0003DMG400000003	0.499	0.4404	0.0588	0.42734	0.8567	0.00842	...
PGSC0003DMG400000004	0.5521	0.6303	0.1432	.	0.0803	0.0372	...
PGSC0003DMG400000005	0.34473	0.21822	0.72429	0.03762	0.55906	0.97425	...
PGSC0003DMG401000006	0.40259	0.41331	0.28136	0.10187	0.05898	0.60103	...
PGSC0003DMG402000006	0.161589	0.381256	0.445893	0.135183	0.366729	0.0810092	...

Note: the values in the light yellow cells above will be ignored. Here, they contain the text just for reference.

Expression section

The Expression section describes where a file source is located and how it should be parsed. Choose at least one delimiter for a delimited text file. CommitFrequency indicates how often the loading process is committed in database.

[Expression]
; Source (required): a TXT file or Excel file located locally or remotely accessible via URL.
Source="$DATA/DM_RH_RNA_Seq_FPKM.txt"
; The floating point numbers should use decimal point in the US format, like 12345.99
; FileType: {Text (delimited text file)|Excel}
FileType=Text
; Commit frequency: indicates how often the process commits expression data. Every N transcripts.
CommitFrequency=10000
; Delimiters: specify delimiters among Colon(:), Comma(,), Period(.), Hyphen(-), SemiColon(;), Slash(/), Tab(\t), VerticalBar(|)
Delimiters=Tab
; AnnotationQualifierName (required). Which qualifier contains the transcript name
AnnotationQualifierName="Id"

Please note that floating point numbers should use decimal point in the US format, like 1234.99. The missing data should be designated as a dot (.)

The transcripts should be identified by a unique transcript name. AnnotationQualifierName specifies which qualifier is used to store the transcript name for the given map set.

Considering the code above (the line with AnnotationQualifierName), if the text file contains a transcript name, for example, "Gene001", PersephoneShell will search for a qualifier "Id" with value "Gene001" in the given map set and will identify its internal transcript ID. Please make sure that you are using the qualifier that contains unique transcript names.

Note

If the value stored under selected qualifier is found in more than one gene model, a warning will be displayed. In this case, all gene models that share this gene name will be assigned the same expression value.

Remember, the sample name, tissue, stage are required and should be specified in the first three rows. The program will try to catch when the tissue values look suspicious. If all of them are numbers, most likely, the row does not contain any tissue names but the expression values. Any additional information (qualifiers and values) can be provided in the data file in the rows that follow the first three. In the example above, the qualifier "treatment" (sky-blue cell) will be created for each sample.

; DataStartRowIndex (required): row index(0-based) that expression data begin
DataStartRowIndex=3

DataStartRowIndex tells the position of the first row where the expression values begin.

Load the expression data using the PersephoneShell's command 'add expression':

add expression -c controlFile.ini -v