This page lists the latest changes to the functionality of PersephoneShell.

June, 2024

  • RUN_ID is printed in the command list orthologs. Its value should be used when deleting the orthologs by using the command 'delete run'.
  • The command edit mapset can be executed in a batch mode by using the control INI file. This allows editing multiple map sets at once. Specify the modified values in the INI and apply the command to multiple map sets.
  • The command create ortholog accepts a range of MapSetIds, which is useful for interlinking multiple genomes at once.

May, 2024

  • Further updated storing the variant data. Now the source VCF is partitioned vertically and horizontally, thus giving an excellent performance for queries that show all alleles of a single variant site or a single genotyping sample. Use the command update variant for switching to the new format.

March, 2024

  • Allow using $DATA variable on the command line when specifying the file paths. Useful for analyzing FASTA files located in the data directory. The auto-complete by the TAB key will also work.
  • Preserve PSH (and WebPersephone) configuration by using the file custom.config. This is useful for the Docker solution when the configuration files get overwritten during the software upgrades.

February, 2024

December, 2023

  • The command searchindex is decoupled from other commands so that rebuilding an index does not block the data modification commands, such as add or edit. To eliminate possible data inconsistency caused by concurrent loading and indexing, a new command was introduced:
  • searchindex deepsync - compare the data sets in the database and Solr index taking into account not only the tracks' presence/absence but also the number of features in each track.
  • Seriously improved the performance and the way of storing Variants. A command update variants will convert the old format to the new one which stores Variants outside of the database.
  • Moved the setting for BLAST data folder to psh.exe.config or custom.config

November, 2023

  • Introduced a flag to force building histograms for BAM files loaded from remote URLs.
  • Report if an annotation qualifier is indexed in the test mode of the command 'add annotation'.
  • Improved performance of deleting index for given tracks.
  • New command list config to see the internal variables and active folders.

June, 2023

  • Genetic maps will be automatically sorted during loading based on the instruction MapOrder.
  • Introduced PersephoneShell's internal help system. Just type help with the command and its parameters.

May, 2023

  • Adjust interface colors using the command color.
  • New command edit map_order. Useful for reordering genetic maps.
  • Introduced a new command create function. For each gene, try to find a best matching protein from SwissProt and save the match as a qualifier suggesting the gene function.
  • A couple of new list commands: list stats and list qualifier_filter

March, 2023

  • Introduced the new command analyze. The first implemented sub-commands are 'db' and 'fasta'. The command 'analyze db' will check if all database parameters are optimal. The other sub-command 'analyze fasta <fastaFile>' will list all entries in the FASTA file with the size of each sequence.
  • The command 'add bam' will work with the CSI index.
  • The base modification values can now be loaded as quantitative track from a bedmethyl (bed9+2) file.

February, 2023

  • The command edit map_name can now use conversion table provided in a form of a text file. Just specify the old and the new names for each map in a file with two columns.

December, 2022

  • Introduced new commands to backup and restore the entire data set stored in the database and in the file system. This facilitates migration between the back end systems.
  • Added a set of commands to add/edit/delete track qualifiers. The new interface in WebPersephone will allow filtering large lists of tracks using different criteria, such as track names, data type or track qualifiers.

October, 2022

  • Limit the search indexing to certain qualifiers. Specify the rules to filter only those qualifiers that you want to index. This should significantly reduce the size (and improve the performance) of the Solr index.

August, 2022

  • New command 'edit translation_code'. Change the translation table to correctly generate protein sequences for the organisms or organelles that use non-standard translation code.

July, 2022

  • New command 'delete annotation_search'. Some qualifiers can have an extra role like gene name or function. This role can be removed by this command.

June, 2022

  • Use regular expressions applied to the entire FASTA header when filtering the sequences for the command 'add sequence'
  • Reduce the size of re-indexed data by specifying the tracks to be processed during rebuilding the Solr index for a map set. See the command 'searchindex rebuild <mapset> <--tracks tracknames>'
  • Significantly improved the performance of the Solr search indexer. Rewrote the logic of synchronization to avoid memory overflow and speed up the entire process.
  • Introduced a new command 'add bam'. It allows adding tracks with NGS alignments (bam files) to the database.

May, 2022

  • New command list variant. It will list the VCF files loaded into the system and will count the number of variants in each of its genotyping samples.
  • Create new annotation qualifiers on the fly, when loading gene models from gff. The INI file will have an instruction New. It allows creating new qualifiers by extracting substrings from the text of given gff attributes.
  • Extract sub-strings from a longer qualifier values and convert them to hyperlinks.
  • Added QualifierFilter to the SearchIndex node in the configuration of PersephoneShell. This will allow indexing of selected qualifiers only, thus reducing the size of the Solr index.
  • Commands list marker_qualifier and list annotation_qualifier will help deciding whether a marker worth indexing for the Solr search. 

April, 2022

  • Handle abutting exons. Some annotation files contain adjacent (abutting) exons with an intron of size 0. NCBI told us that this convention is used to tell that the aligned NGS reads contain insertion at this position. We allow parsing such exons with three options: throw an error, merge the exons or load both.

March, 2022

  • Improved performance of loading VCF data. The new format for storing variants (SNPs and indels) allows fast genetic distance calculation between genotypes.
  • Introduced the new interactive command 'delete variant'. It offers selection of the original VCF file name and optional genotyping samples.
  • Allow specifying OrderNo for loading sequences. You can force alphabetic sorting of map sets under a parent node by specifying for them the same OrderNo.
  • Edit tracktreenode allows to modify the element's color

January, 2022

November, 2021

  • Parsing gff - ParentType (usually "mRNA") is not required anymore. If it is not provided in the INI file, the parent type will be auto-detected by taking direct parents of exons, CDS, etc. This should work also in case when the gff file contains a mixture of parent-child relationships. Some exon records can have "mRNA" as the parent, some records can have "gene" and the program will figure out the right parent type for each gene group.
  • Support ribbon color for synteny ribbons loaded from the text files

October, 2021

  • Introduced DIAMOND as a fast alternative to BLASTP for calculating the gene ortholog pairs. Use the command 'install diamond' and modify the PersephoneShell configuration file to start using DIAMOND. See the comparison of the results produced by BLASTP and DIAMOND.
  • Added ability to filter gff file by the value in its second column. It usually contains the method or source of the annotation. Add the instruction

Column2=maker

to force the gff parser to consider only lines that have 'maker' in the second column. See the sample file for more details.

September, 2021

  • Tracks now can be "protected" from search indexing. This will save you time and the disk space. If a track contains features not suitable for search, mark it with the flag IsSearchable=false.

August, 2021

  • Qualifier values can be displayed with multiple hyper-links extracted from the text using regular expressions
  • Loading/deleting/indexing of annotation tracks is dramatically faster. This will be especially obvious when working with gff files that use multiple parents for their items (e.g., exon).
  • A concurrent writing session check (and lock) is added to avoid adding/updating/deleting the data in simultaneous sessions. An error message will be displayed if you try to modify the data and another writing session is detected.

June, 2021

May, 2021

  • added a new command export protein
  • added validation for duplicate keys in INI files. For example, either Source= or Sources= is allowed, but not both at the same time.
  • loading ribbons in test mode will produce a set of histograms showing how many ribbons will be loaded with different filter settings (IgnoreChainsSmallerThan, IgnoreGapsSmallerThan)

April, 2021

  • It is now possible to create one-exon features from gff file by providing Type= in the INI files for the command add annotation.

; Type: GFF type (column 3) of annotation items. If not specified, both exon (SO:0000147) and CDS (SO:0000316) will be parsed.
; This line is typically left commented out
; The types of one-exon features can be defined here. To load multiple types, e.g., lncRNA, tRNA into one track,
; list them separated by comma. These items will be loaded as they are - no child records will be analyzed.
; ParentItem should be commented out
Type=lnc_RNA,tRNA

March, 2021

  • Call the Solr index update after each command that modifies the data (add, edit, delete). The update is called automatically, so, most of the time you do not need to worry about keeping the Solr index in sync. Still, sometimes, for example, after manual editing the data in the database, or after changing certain properties, a standalone command to update the index manually is needed. The new command is searchindex
  • The command edit sample now allows editing the genotyping samples in a bulk by using a tab-delimited file.
  • Improved the speed of loading synteny ribbons by 300x, due to using the database bulk insert. This is especially obvious on MariaDb.
  • The output of the command list run can be filtered by the process type. Use the new switch -T:

PS> list run -T annotation
3: Load annotations for B73 RefGen_v5
1 run

The process type after -T should be one of the targets of the add command: sequence, annotation, synteny, ortholog, marker, map, variant, quantitative, qtl.

  • The files in Amazon S3 storage can be used as the source of the data. In addition to the local files and http: and ftp: URLs, you can now use s3://bucket/filename format. The credentials to access the S3 bucket will be read from the instance profile (e.g., EC2 with the role) or can be provided in the PersephoneShell configuration file.
  • When PersephoneShell runs in the batch mode, no header is printed on screen.
  • In case there are no appropriate values to be used as marker names, generate names for each marker using a specified prefix. For example, if the feature names are not important, they can be named automatically, like TSS1, TSS2... See the sample in add marker command (look for GeneratedMarkerNamePrefix).

February, 2021

  • Parser for bigWig (add quantitative) properly handles files with unsorted values. It sorts the file data blocks in the run time. The file records for one chromosome should all come together though.
  • edit qualifier_tab - the command to add or edit qualifier tabs. Some qualifiers can be displayed in separate tabs.

October, 2020

  • Allow missing values for the expression data files. The missing data should be designated as a dot (.).
  • Oracle DB sequences needed for generation of new IDs in the databases where the triggers are not provided, can now be specified in a separate file, common for all operations. No need to specify the DB sequences in each INI file, if you work with Oracle.
  • Upgraded to the latest Oracle driver (Oracle.ManagedDataAccess.dll) version 4.122.19.1. This should help improving the performance of the database operations.
  • The INI files for loading tracks can have a new instruction IsShownFirst

; IsShownFirst: if false, the track will not be shown by default when the map is opened for the first time
IsShownFirst=false

  • list orthologs - the command to list all map sets "linked" by orthologous genes.

September, 2020

  • deleting a map set or an annotation track will delete also the corresponding BLAST index files - no need for cleanup blast_folder command in the future. 
  • edit map_names - rename the maps in a bulk using regular expressions and common prefix.
  • edit chromosomes - designate some maps as chromosomes after the maps have been loaded. Added an ability to assign the order of the chromosomes.
  • cleanup blast_folder - a new command to delete BLAST index files that are associated with map sets or annotation tracks that are not in use anymore.
  • cleanup temp_blast_folder - command that will remove files from the temporary folder used for blast results during ortholog pair discovery and mapping the markers. 
  • When the synteny ribbons are loaded from chain files, a few parameters can filter out small chains and help ignoring small gaps: IgnoreChainsSmallerThan, IgnoreGapsSmallerThan, LoadChainFineStructure.
  • We rewrote the way the synteny ribbons are stored. In the latest implementation, a special track of type SYNTENY_TRACK is added to the maps that will have the synteny ribbons attached. A command 'update ribbons' will convert the data already in the database to the new format.

August, 2020

We provide a few new commands to add new values from the command prompt without using the control file. The records are added interactively, just type the answers to a couple of questions or make selection from multiple choices.

  • add new annotation_qualifier - add new qualifiers by extracting text from the existing qualifiers using regular expressions. This is particularly useful when creating URLs to jump to external web pages.
  • add qualifier_link - create a hyper-link using a qualifier value
  • add annotation_search - create a new record to nominate a qualifier representing a gene name or function
  • list annotation_qualifer - list annotation qualifiers for a map set with example values and counts. It is possible to list all values for a selected qualifier.
  • list qualifier_link - list all qualifier links that will be used for a given map set
  • list annotation_search - list the qualifiers that are used to represent a gene name or function
  • add bed - create tracks by loading BED files
  • Use ParentGroupName to insert new tracks under a parent group. This will help reducing the number of track nodes in the top level, simplifying the graphics for each map. For now, 4 track types are supported: gene annotation, markers, quantitative and BED tracks. This is an example of what you need to add to the INI file:

; ParentGroupName: the new track will be placed under a parent node with this name. 
; To reduce the number of track nodes on the top level, group the tracks of similar type.
ParentGroupName=gene models

Please check the sample files to see more detailed instructions for particular track types.

  • Handle GFF files that have gene records split into several lines which happens when dealing with trans-splicing.

June, 2020

A new commands starting with 'create' have been added. In general, while the command 'add' adds the data from the data files, the command 'create' generates new data, optionally loading it to the database.

  • create tags - cut out short sequences (~100 bp, adjustable) from a reference genome sequence, that later can be mapped onto sequences of closely-related organisms
  • create markermapping - use marker sequences (or the sequence tags extracted by 'create tags') to create new marker tracks. The common markers mapped on different maps will be automatically linked.
  • Create BLAST files on the fly - when loading FASTA files with genomic sequences or when loading the gene annotation.