Number-Unit Rewriter¶
What it does¶
The Number-Unit Rewriter takes term combinations comprising a number and a unit and rewrites these
combinations to filter and boost queries. The precondition for configuring this rewriting for a
certain unit is a numeric field in the index containing standardized values for the respective unit.
Let’s assume there is an index containing data for televisions. Furthermore, there is a field for
the screen size screen_size
of the notebooks. This rewriter is now able to rewrite the query
notebook 15 inch
in a way that the regular matching is only done for the term notebook
,
whereas the terms 15 inch
are rewritten to range queries in order to limit the results
to a reasonable subset of notebooks with a screen size between 12" - 18"
)
and to boost the notebooks the more the closer their price is to the searched price.
Setup¶
This rewriter is configured using a JSON definition. The minimum required configuration is one
entry in numberUnitDefinitions
with at least one unit and one field. This sparse configuration
already works as various defaults are applied.
1{
2 "numberUnitDefinitions": [
3 {
4 "units": [ { "term": "inch" } ],
5 "fields": [ { "fieldName": "screen_size" } ]
6 }
7 ]
8}
For Solr, the JSON file is put into the ZooKeeper; for Elasticsearch/OpenSearch,
the JSON is put into a string value for the property config
.
PUT /_querqy/rewriter/numberunit
1{
2 "class": "querqy.elasticsearch.rewriter.NumberUnitRewriterFactory",
3 "config": {
4 "config": "{ \"numberUnitDefinitions\": [ ... ] }"
5 }
6}
Note
OpenSearch users: Simply replace package name elasticsearch
with opensearch
in rewriter configurations.
Querqy 5
POST /solr/mycollection/querqy/rewriter/number_unit?action=save
Content-Type: application/json
1{
2 "class": "querqy.solr.rewriter.numberunit.NumberUnitRewriterFactory",
3 "config": {
4 "config": "{ \"numberUnitDefinitions\": [ ... ] }"
5 }
6}
Querqy 4
1<lst name="rewriter">
2 <str name="class">querqy.solr.contrib.NumberUnitRewriterFactory</str>
3 <str name="config">number-unit-config.json</str>
4</lst>
Configuring filter queries¶
The range for the notebooks included in the results depends on the number-unit term combination
in the query as well as on configured percentages used to calculate range filter boundaries at
query time. For this definition, the JSON object filter
is added to the
numberUnitDefinition
object. The filter
object expects two properties,
percentageLowerBoundary
defining the percentage that is subtracted from the searched
number to define the lower filter range boundary and percentageUpperBoundary
defining the
percentage that is added to the searched number to define the upper filter range boundary.
The default for both properties is 20
.
1{
2 "numberUnitDefinitions": [
3 {
4 "units": [ { "term": "inch" } ],
5 "fields": [ { "fieldName": "screen_size" } ],
6 "filter": {
7 "percentageLowerBoundary": 20,
8 "percentageUpperBoundary": 10
9 }
10 }
11 ]
12}
By using this definition, number-unit part of the query notebook 15 inch
is rewritten to a
filter query with a lower bound of 15 - 3 = 12
and an upper bound of
15 + 1.5 = 16.5
. Consequently, all notebooks with a screen size between 12 - 16.5 inch
are included in the results (boundaries are inclusive).
Configuring boost queries¶
From a business perspective, it might be reasonable not to be too strict with filter boundaries in
order to avoid zero results for more specific queries. This is why this rewriter additionally
creates boost queries in order to help the best fitting products
to appear at the top. The boost properties are all defined in the JSON object boost
.
1{
2 "numberUnitDefinitions": [
3 {
4 "units": [ { "term": "inch" } ],
5 "fields": [ { "fieldName": "screen_size" } ],
6 "boost": {
7 "percentageLowerBoundary": 10,
8 "percentageUpperBoundary": 10,
9
10 "minScoreAtLowerBoundary": 20,
11 "minScoreAtUpperBoundary": 20,
12
13 "maxScoreForExactMatch": 40,
14 "additionalScoreForExactMatch": 15
15 },
16 "filter": {
17 "percentageLowerBoundary": 20,
18 "percentageUpperBoundary": 10
19 }
20 }
21 ]
22}
The properties percentageLowerBoundary
and percentageUpperBoundary
are calculated in the
same way like the boundaries for the filter query. The properties
maxScoreForExactMatch
and additionalScoreForExactMatch
define
the additional score a product gets when its value exactly matches the searched value.
Furthermore, the properties minScoreAtLowerBoundary
and minScoreAtUpperBoundary
define
the additional score a product gets when its value matches the defined boundary.
For instance, for the query notebook 15 inch
the value for percentageLowerBoundary
is
13.5
and the value for percentageUpperBoundary
is 16.5
. This means that all products
with a screen size between these two values are boosted by a certain additional score. Products
with a value of 13.5
or 16.5
are boosted by an additional score of 20
, whereas
products with a value of 15
are boosted by 40 + 15 = 55
.
For the products being somewhere in between, linear functions are calculated using the searched
value, the boundaries and the respective score. They are applied in a way that the closer products
are to the searched value, the closer the additional score is to maxScoreForExactMatch
. For
this example, a linear function is created that crosses the points (13.5, 20)
and
(15, 40)
. For instance, a product with a value of 14.0
would be boosted approximately
by an additional score of 27
, a product with a value of 14.5
by 33
. For products between
15
and 16.5
the same is done with a linear function crossing the points (15, 40)
and
(16.5, 20)
.
The property additionalScoreForExactMatch
is not used for the calculations of linear functions.
The main purpose for this property is to ensure that perfectly matching product will actually appear
at the top (there might be other boosting factors in addition).
Perfect match ranges¶
For several contexts, it makes sense to consider a perfect match rather as a range than a comparison
of equality. For instance, a notebook with 16 inch
has more or less the same screen size like a
notebook with 15.6 inch
. Consequently, a bag for notebooks with 16 inch
is well suitable also
for notebooks with 15.6 inch
. This is why the rewriter supports configuring perfect matches as a
range. Products that are within this range will also benefit from the boost that is configurable via
additionalScoreForExactMatch
.
1{
2 "numberUnitDefinitions": [
3 {
4 "units": [ { "term": "inch" } ],
5 "fields": [ { "fieldName": "screen_size" } ],
6 "boost": {
7 "percentageLowerBoundary": 10,
8 "percentageUpperBoundary": 10,
9
10 "minScoreAtLowerBoundary": 20,
11 "minScoreAtUpperBoundary": 20,
12
13 "percentageLowerBoundaryExactMatch": 5,
14 "percentageUpperBoundaryExactMatch": 5,
15
16 "maxScoreForExactMatch": 40,
17 "additionalScoreForExactMatch": 15
18 },
19 "filter": {
20 "percentageLowerBoundary": 20,
21 "percentageUpperBoundary": 10
22 }
23 }
24 ]
25}
The boundaries for these properties are calculated in the same way like the other boundaries.
Configuring multiple units¶
The rewriter supports searching for different units. For instance, users might not only use the unit
inch
for searching for screen sizes, but also use the unit cm
. However, the value in the index
might be standardized for the unit inch
, so the value in the query must be adjusted in order to
be suitable for the respective field. The adjusted value will be used for all filter and boost queries.
1{
2 "numberUnitDefinitions": [
3 {
4 "units": [
5 { "term": "inch", "multiplier": 1.0 },
6 { "term": "cm", "multiplier": 0.393701 }
7 ],
8 "fields": [ { "fieldName": "screen_size" } ]
9 }
10 ]
11}
Configuring the same unit multiple times¶
For several units, it might be unclear, which field they are related to. Users searching for cm
might
either searching for a screen size or the height of something. This is why the rewriter supports
configuring the same unit multiple times. As it is not clear in query time, for what exactly a user is
searching for, the fields are combined in the filter queries using boolean OR
, which means the searched
number must match in one of the configured fields. For the boosting, the different boost functions are
wrapped by a max() function, which means that the highest boost is applied for the scoring.
1{
2 "numberUnitDefinitions": [
3 {
4 "units": [
5 { "term": "inch", "multiplier": 1.0 },
6 { "term": "cm", "multiplier": 0.393701 }
7 ],
8 "fields": [ { "fieldName": "screen_size" } ]
9 },
10 {
11 "units": [
12 { "term": "cm", "multiplier": 1.0 }
13 ],
14 "fields": [ { "fieldName": "height" } ]
15 }
16 ]
17}
Multiple number-unit inputs in a query¶
In some edge cases, a query even might contain more than one number-unit combination, e. g.
notebook 15 inch 1 tb
. In this case, one filter query is applied for each number-unit combination
(which means they are logically combined via boolean AND
) and one boost is added to the score per
number-unit combination.