The fast search engine is provided by Apache Solr. The server running on Java should be installed separately (unless you are using the pre-configured Docker container).

This page will give details on populating the Solr index using PersepshoneShell.

To enable the Solr functions in PersephoneShell, a few adjustments of the configuration file psh.exe.config might be necessary. Here is a typical XML node responsible for setting up Solr:


  <SearchIndex>
    <Connection Name="test" Url="http://localhost:8983/solr/" CoreName="test" />
    <Connection Name="prod" Url="http://localhost:8983/solr/" CoreName="prodCore" >
          <QualifierFilter Floats="true" Indexing="false" />
          <QualifierFilter Integers="true" Indexing="false" />
          <QualifierFilter Type="MARKER" Indexing="false"/>
          <QualifierFilter Type="MARKER" Name="CLNDN" Indexing="true"/>
          <QualifierFilter Type="MARKER" Name="rsId" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Indexing="false"/>
          <QualifierFilter Type="ANNOT" Name="transcript_id" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="transcriptName" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="transcriptId" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="transcriptID" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="product" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="old_locus_tag" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="note" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="locus_tag" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="iwgsc_id" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="gene_id" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="gene_synonym" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="geneName" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="geneId" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="gene" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="description" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="definition" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="alias" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Synonym" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Parent" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Name" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Note" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Info" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Function description" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Description" Indexing="true"/>
          <QualifierFilter Type="ANNOT" Name="Alias" Indexing="true"/>
    </Connection>
  </SearchIndex>

The code above specifies different rules for different connections named "test" and "prod".

Each connection should have its own Solr core. For example, the Connection named "prod" references the CoreName "prodCore".

QualifierFilter

Please be aware that the Solr text index has a limit of 2 billion items. When PersephoneShell sends the data to Solr, the items include gene models (annotation), markers, QTLs, maps, map sets and tracks. Each of them can have qualifiers and with millions of gene models or markers the number of qualifiers in the index can be quite considerable. To minimize the index size, whenever possible, mask some of the tracks and qualifiers from indexing.

The node QualifierFilter defines the rules for the qualifiers to be included into or excluded from the index. The order of the rules is important - the later records have priority. To index just a few designated qualifiers and avoid indexing the rest, first, disable indexing for all qualifiers. For example, to stop indexing marker qualifiers use this:

      <QualifierFilter Type="MARKER" Indexing="false" />

After this, list the qualifiers that need to be indexed, for example:

      <QualifierFilter Type="MARKER" Name="CLNDN" Indexing="true"/>

The instruction above turns the indexing on for the marker qualifier "CLNDN".

Accordingly, if you want to index all qualifiers except some, first, allow the indexing for all qualifiers and then set Indexing to off for the selected ones.

      <QualifierFilter Type="MARKER" Name="alleles" Indexing="false"/>

The rule can also include the type of the qualifier value. Some of them can be read as a pure floating point number or as an integer, with no additional text. Most likely, such values will not be searched for literally, so it makes perfect sense to mask them from indexing (this is by default now, no need to type this text anymore):

      <QualifierFilter Floats="true" Indexing="false" />
      <QualifierFilter Integers="true" Indexing="false" />

A general format for the QualifierFilter is: 

<QualifierFilter Type="{ANNOT/MARKER/empty}" Name="{qualname/empty}" Indexing="{true/false}" FloatingPoint="{true/empty}" Integer="{true/empty}" />

Use the command 'list marker_qualifier' or 'list annotation_qualifier' to see the values stored under particular qualifiers. This will help you decide whether the indexing of some qualifiers can be skipped.

Masking entire tracks from indexing

Another way of reducing the size of the Solr index is masking certain tracks using the instruction IsSearchable=false in the INI file during loading the data or later, by running the command 'searchindex skip'.

Sync vs. rebuild

The command 'searchindex sync' finds the discrepancy between the objects in the database and the Solr index and makes the necessary synchronization by removing or adding the items. It is important to understand that the comparison is done on the level of track. This means that if some qualifiers are deleted or added after the track has been created, the 'sync' command will not see the difference - the track itself will remain. To update the index after manipulating the qualifiers (by adding/removing to the QualifierFilter rules or adding/removing the qualifiers themselves), we have to issue 'searchindex rebuild' command, which may take considerably longer. So, please give a good thought before engaging the QualifierFilter rules. Anyway, if you modify the rules and do not "rebuild" the corresponding index, the new rules will be applied only to the newly added items.

Sync vs. deepsync

The process of indexing can be done in parallel with the data loading procedure. This concurrency introduces a risk that the data being loaded are indexed partially - indexing a track can be finished before it is completely loaded. The Solr index will contain the record for the track, so a regular 'sync' command will not recognize that the index is incomplete. To overcome this limitation, we introduced a command 'deepsync' that, besides checking for the existence of a track', will also compare the count of the features and, in case of mismatch, will trigger rebuilding the index.