Overview
While using the Predictive Search and Solr Search features, you notice that you are seeing some unexpected search results. You want to better understand the interaction and nuances of the two search engines.
Information
Because of differences in search methodologies between the Predictive Search and the standard Quick Search function when Solr is active, you may encounter edge cases where unexpected results appear after executing a Quick Search. This can sometimes be attributed to your gram size settings in Solr setup since Predictive Search does a simple text match of the filename starting from the beginning of the filename, but the Quick Search leverages Solr's n-gram search logic.
Depending on the language used and the Solr settings, these n-gram tokens will be interpreted by one of the tokenizers to enhance search results.
minGramSize
and maxGramSize
are the values used by NGram Tokenizer in Solr and are defined by the following XML file: /usr/etc/venture/solr/install/schema.xml
. The settings for both values are language-specific (as some languages use longer words in general than others) and depend on the tokenizer used for that language.
There are two different types of tokenizers used in the Solr schema.xml file:
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="18"/
<tokenizer class="solr.StandardTokenizerFactory"/>
Only NGram Tokenizer uses the minGramSize
and maxGramSize
Setting. The default settings for most languages using this tokenizer are:
minGramSize="3" maxGramSize="18"
Except for Japanese, that is:
minGramSize="2" maxGramSize="6"
Some languages like Hindi or Finnish do not use NGram Tokenizer.
Changing the configuration of how Solr interprets searched strings will affect the results from Quick Searches (when Solr is enabled), increase the disk space used by the Solr internal database, and the time needed to recreated it (if Solr was disabled before). To understand the effects of changing the minGramSize
and maxGramSize
settings, let's consider the following example:
Suppose the text to be indexed is "Enterprise," with minGramSize
of 3
and maxGramSize
of 5
, it will yield all of the following indexed terms:
- 3-gram: ent nte ter erp rpr pri ris ise.
- 4-gram: ente nter terp erpr rpri pris rise.
- 5-gram: enter nterp terpr erpri rpris prise.
As you can see, a total of 21 indexes will be generated for the word "Enterprise." Now, let's assume we increase the maxGramSize
value to 10
. The following indexes should be generated:
- 3-gram: ent nte ter erp rpr pri ris ise.
- 4-gram: ente nter terp erpr rpri pris rise
- 5-gram: enter nterp terpr erpri rpris prise
- 6-gram: enterp nterpr terpri erpris rprise
- 7-gram: enterpr nterpri terpris erprise
- 8-gram: enterpri nterpris terprise
- 9-gram: enterpris nterprise
- 10-gram: enterprise
So, increasing the maxGramSize
from 5
to 10
leads to an increase in the number of indexes from 21 to 36, which is almost a 75% increase in utilization. In the case of strings that consist of multiple words, the NGram calculation would include white spaces between words to generate indexes like above. So, the amount of overhead, in terms of time to index and space utilization, if you increase the maxGramSize
from 18 to 60 can be enormous.
It is important to understand these limitations related to searches in the portal because it can cause some known inconsistencies depending on the version of Xinet you are using. Before Xinet 2021.6, filenames that exceed the maxGramSize
may appear in Predictive Search but may not always appear in Quick Search results. Similarly, words that are under the minGramSize
may be filtered out of Quick Search results. This behavior will be significantly reduced as of Xinet 2021.6 and would most likely occur when a filename selected in Predictive Search contains characters identified as special search operators (e.g., " ", AND, OR, +, -)
Comments
0 comments
Please sign in to leave a comment.