The fast search engine is provided by Apache Solr. The server running on Java should be installed separately (unless you are using the pre-configured Docker container).

This page will give details on populating the Solr index using PersepshoneShell.

To enable the Solr functions in PersephoneShell, a few adjustments of the configuration file psh.exe.config (Docker: /data/Data/psh.exe.config) might be necessary. Here is a typical XML node responsible for setting up Solr:


  <SearchIndex>
    <Connection Name="test" Url="http://localhost:8983/solr/" CoreName="test" />
    <Connection Name="prod" Url="http://localhost:8983/solr/" CoreName="prodCore" />
  </SearchIndex>

  The code above specifies different rules for different connections named "test" and "prod".

Each connection should have its own Solr core. For example, the Connection named "prod" references the CoreName "prodCore".

QualifierFilter

Please be aware that the Solr text index has a limit of 2 billion items. When PersephoneShell sends the data to Solr, the items include gene models (annotation), markers, QTLs, maps, map sets and tracks. Each of them can have qualifiers and, with millions of gene models or markers, the number of qualifiers in the index can be quite considerable. 

For small data sets, which are common when using our Docker solution, the number of Solr items is unlikely to approach the 2-billion-record limit. In most cases, reducing the number of indexed items is unnecessary, and full indexing of all possible qualifiers is feasible without issue.

Note

By default, annotation and marker qualifiers that can be parsed as a number are not indexed for search.

If your data set contains hundreds of map sets and the features have multiple qualifiers, it makes sense to limit the indexed objects. To minimize the index size, you can use two methods:

  • mask some of the tracks from indexing.
  • specify qualifiers that must be ignored or indexed by providing QualifierFilters.

The node QualifierFilter defines the rules for the qualifiers to be included into or excluded from the index. The order of the rules is important - the later records have priority. To index just a few designated qualifiers and avoid indexing the rest, first, disable indexing for all qualifiers for specific feature types. For example, to stop indexing marker qualifiers use this:

<QualifierFilter Type="MARKER" Indexing="false" />

After this, explicitly list the qualifiers that need to be indexed, for example:

<QualifierFilter Type="MARKER" Name="CLNDN" Indexing="true"/>

The instruction above turns the indexing on for the marker qualifier "CLNDN".

Accordingly, if you want to index all qualifiers except some, first, allow the indexing for all qualifiers and then set Indexing to off for the selected ones.

<QualifierFilter Type="MARKER" Name="alleles" Indexing="false"/>

A general format for the QualifierFilter is: 

<QualifierFilter Type="{ANNOT/MARKER/empty}" Name="{qualname/empty}" Indexing="{true/false}" FloatingPoint="{true/empty}" Integer="{true/empty}" />

Use the command 'list marker_qualifier' or 'list annotation_qualifier' to see the values stored under particular qualifiers. This will help you decide whether the indexing of some qualifiers can be skipped. The command 'list qualifier_filter' will show the filtering rules.

Here is an example of the section SearchIndex with common qualifier filters. Note that the qualifier names are case-sensitive:

<SearchIndex>
    <Connection Name="prod" Url="http://localhost:8983/solr/" CoreName="prodCore" >
          <QualifierFilter Type="MARKER" Indexing="false"/>
          <QualifierFilter Type="MARKER" Name="CLNDN" Indexing="true"/>
          <QualifierFilter Type="MARKER" Name="rsId" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Indexing="false"/>
          <QualifierFilter Type="ANNOT" Name="transcript_id" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="transcriptName" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="transcriptId" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="transcriptID" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="product" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="old_locus_tag" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="note" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="locus_tag" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="iwgsc_id" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="gene_id" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="gene_synonym" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="geneName" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="geneId" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="gene" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="description" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="definition" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="alias" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Synonym" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Parent" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Name" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Note" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Info" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Function description" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Description" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Alias" Indexing="true"/>           
        <QualifierFilter Type="ANNOTNAME" Indexing="true"/>
        <QualifierFilter Type="ANNOTFUNC" Indexing="true"/>
    </Connection>
  </SearchIndex>

As best practice, use standard qualifier names that carry the gene name and gene function. For example, geneName can always store the gene name. The INI file for loading gene annotation can have an instruction to save a GFF attribute to a qualifier with a different name. Here is how you can convert the GFF attribute  'ID' to 'geneName':


QualifierAttributeKey.gene:ID=geneName

This will allow to use a simple qualifier filter like this:

<QualifierFilter Type="ANNOT" Name="geneName" Indexing="true"/>           


Another trick to simplify the filters uses the qualilfier filters with types ANNOTNAME and ANNOTFUNC. For example, a line like this:

<QualifierFilter Type="ANNOTNAME" Indexing="true"/>

will force indexing any qualifier if it its role is set to GeneName. Please see the section [AnnotationSearch] in the INI files for loading the gene models.


Masking entire tracks from indexing

Another way of reducing the size of the Solr index is masking certain tracks using the instruction IsSearchable=false in the INI file during loading the data or later, by running the command 'searchindex skip'.

Sync vs. rebuild

The command 'searchindex sync' finds the discrepancy between the objects in the database and the Solr index and makes the necessary synchronization by removing or adding the items. It is important to understand that the comparison is done on the level of track. This means that if some qualifiers are deleted or added after the track has been created, the 'sync' command will not detect the difference - the track itself will remain. To update the index after manipulating the qualifiers (by adding/removing to the QualifierFilter rules or adding/removing the qualifiers themselves), we have to issue 'searchindex rebuild' command, which may take considerably longer. So, please give a good thought before engaging the QualifierFilter rules. Anyway, if you modify the rules and do not "rebuild" the corresponding index, the new rules will be applied only to the newly added items.

Sync vs. deepsync

The process of indexing can be done in parallel with the data loading procedure. This concurrency introduces a risk that the data being loaded are indexed partially - indexing a track can be finished before it is completely loaded. The Solr index will contain the record for the track, so a regular 'sync' command will not recognize that the index is incomplete. To overcome this limitation, we introduced a command 'deepsync' that, besides checking for the existence of a track', will also compare the count of the features and, in case of mismatch, will trigger rebuilding the index.