This section describes how to add variant calls to the database with the add command (see Add). Currently, Persephone supports Single Nucleotide Polymorphisms [SNPs] and Insertion/Deletion [indels] and can read VCF files with tetraploid variants. The steps below show how to use the add command with the control file (see Control Files) "add_GRCh37.p13_1000genomes.ini" to load 1000 genomes Variant Call Format (VCF) files.

  1. Review the "add_GRCh37.p13_1000genomes.ini" control file, which is included in the PersephoneShell file archive "Samples/Variant" folder and is shown below.

[ProcessRun]
; RunDescription: if specified, a custom description will be used
;                 otherwise, "Added variants for {MapSet Accession No.} from {Sources}." will be used.
RunDescription="Added 1000 genomes VCFs from http://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502."
 
[MapSet]
; Either MapSetId or MapSetPath is required.
; MapSetId: id of a target map set.
MapSetId=240044500
; MapSetPath: path of a target map set.
; MapSetPath="/Homo sapiens/GRCh37.p13"
 
[Variant]
; Sources (required): VCF files only. Can be local file or a URL
Sources=ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 

; IncludedSamples: Comma-delimited sample names in the VCF source to be included.
;                  if not specified, all the samples in the file will be included.
;                  To list sample names in VCF, run 'bcftools query -l VCF_FILE'
;                  SampleNamesToFilter is obsolete.
IncludedSamples= HG00096,HG00097,HG00099,HG00100,HG00101,HG00102,HG00103,HG00104,HG00105,HG00106,HG00108 
; ExcludedSamples: Comma-delimited sample names in the VCF source to be excluded.
;                  if not specified, no sample in the file will be excluded.
;ExcludedSamples=HG00096

; SkipOverlappingPhaseSets: when false (default), the overlapping phase sets will stop the program execution.
; When it is true, the overlapping phase sets may result in inconsistency of the assigned phase sets.
; The ovelap of PS is detected when the one phase set is placed inside another.
;SkipOverlappingPhaseSets=true

; LineCountInTile: Number of lines from vcf file which will be stored together in one tile.
; A higher number means higher memory consumption on parsing and later on decompression, especially if the vcf file has a lot of samples 
; and a user must see all of them in some scenario.
; A lower number means less memory but a larger percent of compressed data files will be taken by the index data, and it will take longer to parse and compress.
; Default value is 16384
;LineCountInTile=16384

[MapMapping]
; If no mapping is found in this section, psh assumes that each MAP_NAME in file exactly matches a MAP_NAME in DB. 
; If map names in file are different from those in DB, map each MAP_NAME in the file to a MAP_NAME in DB.
; The manual mapping is below:
;MAP_NAME_IN_FILE=MAP_NAME in DB
;Chr1=Chr.1
;Chr2=Chr.2
;Chr3=Chr.3
; Check 'printmapping' command that may help generating the name mapping tables.
; Alternatively, use MapIdentifiedBy
; MapsIdentifiedBy: if all maps in the file instead of the map name are identified by their alternative IDs like MAP_ID, ACCESSION_NO or GENOME_DNA_ID,
; provide the mapping with just one line using either MapName, MapId, AccessionNo or GenomeDnaId, for example:
; MapsIdentifiedBy: one of MapId, AccessionNo, GenomeDnaId, MapName (default)
;MapsIdentifiedBy=AccessionNo

; LoadListedMapsOnly: if true, only data for the maps listed in this section will be added.
; If false, PersephoneShell will still try to match names from the file to maps in the database
; using MAP_NAME, and if the map is not found, the data line will be skipped
;LoadListedMapsOnly=true

;MAP_NAME_IN_FILE=MAP_NAME_IN_DB
1=Chr.1
2=Chr.2
3=Chr.3
4=Chr.4
...

  1. Assume that the control file "add_GRCh37.p13_1000genomes.ini" is located in the directory where you installed PersephoneShell (e.g., C:\PersephoneShell).
  2. You can add the variants in the interactive or the command line mode (see Running PersephoneShell). In the interactive mode, enter:

PS> add variants -c add_GRCh37.p13_1000genomes.ini

In the command line mode the command would be (use the proper connection name after -s)

C:\PersephoneShell> psh -s ********** add variants -c add_GRCh37.p13_1000genomes.ini

A verification message will be displayed.

Note

Adding the variants from VCF file is usually a lengthy procedure. It requires the test mode (-t) to be executed first, during which the data from VCF file is analyzed and compressed on a local disk. The following loading command would use the binary blocks assembled during the test phase. Running the adding command without test will trigger the test anyway.