More about queries¶
In ‘Getting started with Querqy’ we showed how to build a minimal query with Querqy:
POST /myindex/_search
1{
2 "query": {
3 "querqy": {
4 "matching_query": {
5 "query": "notebook"
6 },
7 "query_fields": [ "title^3.0", "brand^2.1", "shortSummary"]
8 }
9 }
10}
All we had to do was to use a querqy
query (line #3), define a query string
for matching (#5) and specify which fields to query (#7).
/solr/mycollection/select?q=notebook&defType=querqy&qf=title^3.0 brand^2.1 shortSummary
All we had to do was to use the Querqy query parser (defType=querqy
),
define a query string for matching (q=...
) and specify which fields to query
(qf=...
).
Querqy has many more query parameters. We will introduce a few underlying concepts before we explain them in the Reference section.
The matching query is the query that defines the members of the search result set. Only documents that match this query will make it into the search result.
Its pendant is a boosting query. A boosting query is has no influence on search result membership but it influences the search result scoring. The score of documents that match a boosting query will be changed. Depending on the purpose of the boosting, the matching documents will either be moved further to the top or to the bottom of the search result list. There can be more than one boosting query in a single search request.
Query rewriting can manipulate queries by adding or removing query terms or entire subqueries. For example, if a rewriter adds a synonym it will add one or more terms to the matching query. If it adds an UP or DOWN boost, it will add boosting queries. We will say that these additional terms are generated.
Manipulating the matching query not only influences which documents are included in the search results but also the scoring will be impacted, regardless of boosting queries. The result set can be narrowed down further by filter queries that are generated by a rewriter. These filters do not influence scoring.
Solr Ranking Queries and Querqy Boosting¶
Solr allows for query result re-ranking using Ranking Queries. A RankQuery can be passed using Solr’s rq parameter. Solr will then re-score the top N results of the main query by the RankQuery’s weights and re-rank the results accordingly. An example of a RankQuery is the one produced by Solr’s Learning to Rank plugin.
Some RankQueries don’t consider boosts or scores in the main query and completely re-rank the result by their own criteria. If you are using re-ranking (e.g. via the Solr LTR plugin) but want to retain control over your Querqy-boosted queries you might want to skip re-ranking for these queries.
We have added a request parameter querqy.rq
as a replacement for Solr’s rq
parameter to
facilitate this behaviour. It is an equivalent to Solr rq
but only applies re-ranking if
there are no boosts on the query generated by Querqy.
For example, if you want to re-rank the first 500 docs using LTR but only when there
are no boost queries on the Querqy query use the query paramer querqy.rq={!ltr model=myModel reRankDocs=500}
.
Reference¶
POST /myindex/_search
1{
2 "query": {
3
4 "querqy": {
5
6 "matching_query": {
7 "query": "notebook",
8 "similarity_scoring": "dfc",
9 "weight": 0.75
10 },
11
12 "query_fields": [
13 "title^3.0", "brand^2.1", "shortSummary"
14 ],
15
16 "minimum_should_match": "100%",
17 "tie_breaker": 0.01,
18 "field_boost_model": "prms",
19
20 "rewriters": [
21 "word_break",
22 {
23 "name": "common_rules",
24 "params": {
25 "criteria": {
26 "filter": "$[?(!@.prio || @.prio == 1)]"
27 }
28 }
29 }
30 ],
31
32 "boosting_queries": {
33 "rewritten_queries": {
34 "use_field_boost": false,
35 "similarity_scoring": "off",
36 "positive_query_weight": 1.2,
37 "negative_query_weight": 2.0
38 },
39 "phrase_boosts": {
40 "full": {
41 "fields": ["title", "brand^4"],
42 "slop": 2
43 },
44 "bigram": {
45 "fields": ["title"],
46 "slop": 3
47 },
48 "trigram": {
49 "fields": ["title", "brand", "shortSummary"],
50 "slop": 6
51 },
52 "tie_breaker": 0.5
53 }
54 },
55
56 "generated" : {
57 "query_fields": [
58 "title^2.0", "brand^1.5", "shortSummary^0.0007"
59 ],
60 "field_boost_factor": 0.8
61 }
62
63 }
64 }
65}
Need example
Global parameters and matching query
- query_field
The list of fields in which to search for query terms. A field weight can be appended to the field name using the
^
-symbol. Field weights are positive integer or decimal numbers. The default field weight is1.0
Required
- minimum_should_match
The minimum number of query clauses that must match for a document to be returned. (Copied from Elasticsearch’s match query documentation, which also see for valid parameter values).
The minimum number of query clauses is counted across fields. For example, if the query
a b
is searched in"query_fields":["f1", "f2"]
with"minimum_should_match":"100%"
, the two terms need not match in the same field so that a document matchingf1:a
andf2:b
will be included in the result set.Default:
1
- tie_breaker
When a query term
a
is searched across fields (f1
,f2
andf3
), the query is expanded into term queries (f1:a
,f2:a
,f3:a
). The expanded query will use as its own score the score of the highest scoring term query plus the sum of the scores of the other term queries multiplied withtie_breaker
. Let’s assume thatf2:a
produces the highest score, the resulting score will bescore(f2:a) + tie_breaker * (score(f1:a) + score(f3:a))
.Default:
0.0
- field_boost_model
Values:
fixed
prms
Querqy allows to choose between two approaches for field boosting in scoring:
fixed
: field boosts are specified at field names in ‘query_fields’. The same field weight will be used across all query terms for a given query field.prms
: field boosts are derived from the distribution of the query terms in the index. More specifically, they are derived from the probability that a given query term occurs in a given field in the index. For example, given the query ‘apple iphone black’ with query fields ‘brand’, ‘category’ and ‘color’, the term ‘apple’ will in most data sets have a greater probability and weight for the ‘brand’ field compared to ‘category’ and ‘color’, whereas ‘black’ will have the greatest probability in the ‘color’ field. [1]Field weights specified in ‘query_fields’ will be ignored if ‘field_boost_model’ is set to ‘prms’.
Default:
fixed
- matching_query.similarity_scoring
Values:
dfc
on
off
Controls how Lucene’s scoring implementation (= similarity) is used when an input query term is expanded across fields and when it is expanded during query rewriting:
dfc
: ‘document frequency correction’ - use the same document frequency value for all terms that are derived from the same input query term. For example, let ‘a b’ be the input query and let it be rewritten to ‘(f1:a | f2:a | ((f1:x | f2:x) | (f1:y | f2:x)) (f1:b | f2:b)` by synonym and field expansion, then ‘(f1:a | f2:a | ((f1:x | f2:x) | (f1:y | f2:x))’ (all derived from ‘a’) will use the same document frequency value. More specifically, Querqy will use the maximum document frequency of these terms as the document frequency value for all of them. Similarily, the maximum document frequency of ‘(f1:b | f2:b)’ will be used for these two terms.off
: Ignore the output of Lucene’s similarity scoring. Only field boosts will be used for scoring.on
: Use Lucene’s similarity scoring output. Note that field boosting (normally part of Lucene similarity scoring) is handled outside the similarity in Querqy and it can be configured using the ‘field_boost_model’ parameter.
Default:
dfc
- matching_query.weight
A weight that is multiplied with the score that is produced by the matching query before the score of the boosting queries is added.
Default:
1.0
- qf (query fields)
The list of fields in which to search for query terms. A field weight can be appended to the field name using the
^
-symbol. Field weights are positive integer or decimal numbers. The default field weight is1.0
. See Solr Documentation for parameter value syntax. [2]Example:
qf=title^3 brand^2.1 shortDescription^0.2
Required
- mm (minimum should match)
The minimum number of optional query clauses that must match for a document to be returned.
The minimum number of query clauses is counted across fields. For example, if the query
a b
is searched inqf=f1 f2
withmm=100%
, the two terms need not match in the same field so that a document matchingf1:a
andf2:b
will be included in the result set. See Solr Documentation for value syntax. [2]Example:
mm=100% 2<-1
Default:
1
- tie (tie breaker)
When a query term
a
is searched across fields (f1
,f2
andf3
), the query is expanded into term queries (f1:a
,f2:a
,f3:a
). The expanded query will use as its own score the score of the highest scoring term query plus the sum of the scores of the other term queries multiplied withtie
. Let’s assume thatf2:a
produces the highest score, the resulting score will bescore(f2:a) + tie * (score(f1:a) + score(f3:a))
. [2]Default:
0.0
- fbm (field boost model)
Values:
fixed
prms
Querqy allows to choose between two approaches for field boosting in scoring:
fixed
: field boosts are specified at field names in ‘query_fields’. The same field weight will be used across all query terms for a given query field.prms
: field boosts are derived from the distribution of the query terms in the index. More specifically, they are derived from the probability that a given query term occurs in a given field in the index. For example, given the query ‘apple iphone black’ with query fields ‘brand’, ‘category’ and ‘color’, the term ‘apple’ will in most data sets have a greater probability and weight for the ‘brand’ field compared to ‘category’ and ‘color’, whereas ‘black’ will have the greatest probability in the ‘color’ field. [1]Field weights specified in ‘query_fields’ will be ignored if ‘fbm’ is set to ‘prms’.
Default:
fixed
- uq.similarityScore
Values:
dfc
on
off
Controls how Lucene’s scoring implementation (= similarity) is used when an input query term is expanded across fields and when it is expanded during query rewriting:
dfc
: ‘document frequency correction’ - use the same document frequency value for all terms that are derived from the same input query term. For example, let ‘a b’ be the input query and let it be rewritten to ‘(f1:a | f2:a | ((f1:x | f2:x) | (f1:y | f2:x)) (f1:b | f2:b)` by synonym and field expansion, then ‘(f1:a | f2:a | ((f1:x | f2:x) | (f1:y | f2:x))’ (all derived from ‘a’) will use the same document frequency value. More specifically, Querqy will use the maximum document frequency of these terms as the document frequency value for all of them. Similarily, the maximum document frequency of ‘(f1:b | f2:b)’ will be used for these two terms.off
: Ignore the output of Lucene’s similarity scoring. Only field boosts will be used for scoring.on
: Use Lucene’s similarity scoring output. Note that field boosting (normally part of Lucene similarity scoring) is handled outside the similarity in Querqy and that it can be configured using the ‘fbm’ parameter.
Default:
dfc
- uq.boost
A weight that is multiplied with the score that is produced by the matching query before the score of the boosting queries is added.
Default:
1.0
Boosting queries
- boosting_queries
Controls sub-queries that do not influence the matching of documents but contribute to the score of documents that are retrieved by the ‘matching_query’. A ‘querqy’ query allows to control two main types of boosting queries:
rewritten_queries
- boost queries that are produced as part of query rewritingphrase_boosts
- (partial) phrases that are derived from the query string for boosting documents that contain corresponding phrase matches
Scores from both types of boosting queries will be added to the score of the ‘matching_query’.
- boosting_queries.rewritten_queries.use_field_boost
If
true
, the scores of the boost queries will include field weights. A field boost of1.0
will be used otherwise.Default:
true
- boosting_queries.rewritten_queries.similarity_scoring
Values:
dfc
on
off
Controls how Lucene’s scoring implementation (= similarity) is used when the boosting query is expanded across fields.
dfc
: ‘document frequency correction’ - use the same document frequency (df) value for all term queries that are produced from the same boost term. Querqy will use the maximum document frequency of the produced terms as the df value for all of them. If the ‘matching_query’ also uses ‘similarity_scoring=dfc’, the maximum (df) of the matching query will be added to the df of the boosting query terms in order to put the (dfs) of the two query parts on a comparable scale and to avoid giving extremely high weight to very sparse boost terms.off
: Ignore the output of Lucene’s similarity scoring.on
: Use Lucene’s similarity scoring output.
Default:
dfc
- boosting_queries.rewritten_queries.positive_query_weight / .negative_query_weight`
Query rewriting in Querqy can produce boost queries that either promote matching documents to the top of the search result (positive boost) or that push matching documents to the bottom of the search result list (negative boost).
Scores of positive boost queries are multiplied with ‘positive_query_weight’. Scores of negative boost queries are multiplied with negative_query_weight. Both weights must be positive decimal numbers. Note that increasing the value of ‘negative_query_weight’ means to demote matching documents more strongly.
Default:
1.0
- boosting_queries.phrase_boosts.full / .bigram / .trigram / .tie_breaker`
Unlike ‘rewritten_queries’,
phrase_boosts
can be applied regardless of query rewriting. If enabled, a boost query will be created from phrases which are derived from the query string. Documents matching this boost query will be promoted to towards the top of the search result.The parameter objects
full
,bigram
andtrigram
control how phrase boost queries will be formed:full
: boosts documents that contain the entire input query as a phrasebigram
: creates phrase queries for boosting from pairs of adjacent query tokenstrigram
: creates phrase queries for boosting from triples of adjacent query tokens
The
fields
lists under each of these parameters define the fields and their weights in which the phrases will be looked up. Theslop
defines the number of positions the phrase tokens are allowed to shift while still counting as a phrase. A ‘slop’ of two or greater allows for token transposition (compare Elasticsearch’s Match phrase query). The default ‘slop’ is 0.Depending on the number of query tokens, a matching ‘full’ phrase query can imply one or more ‘bigram’ and ‘trigram’ matches. The scores of these matches will be summed up, which can quickly result in a very large score for documents that match a long full query phrase. Setting
tie_breaker
for ‘phrase_boosts’ to a low value will reduce this aggregation effect. Querqy will use the highest score produced by ‘full’, ‘bigram’ and ‘trigram’ matches and multiply the score of the remaining phrase matches with the ‘tie_breaker’ value. A ‘tie_breaker’ of 0.0 - which is the default value - will only use the highest score.The concept of phrase boosting is very similar to the pf/pf2/pf3/ps/ps2/ps3 parameters of Solr’s Extended DisMax / DisMax query parsers. However, Querqy adds control over the aggregation of the scores from the different phrase boost types using the ‘tie_breaker’.
The score produced by ‘phrase_boosts’ is added to the boost of the ‘matching_query’.
- qboost.fieldBoost
Values:
on
off
If
on
, the scores of the boost queries that are produced by query rewriting will include field weights. A field boost of1.0
will be used otherwise.Default:
on
- qboost.similarityScore
Values:
dfc
on
off
Controls how Lucene’s scoring implementation (= similarity) is used when the boosting query is expanded across fields.
dfc
: ‘document frequency correction’ - use the same document frequency (df) value for all term queries that are produced from the same boost term. Querqy will use the maximum document frequency of the produced terms as the df value for all of them. If the ‘uq.similarityScore’ also uses ‘dfc’, the maximum (df) of the matching query will be added to the df of the boosting query terms in order to put the (dfs) of the two query parts on a comparable scale and to avoid giving extremely high weight to very sparse boost terms.off
: Ignore the output of Lucene’s similarity scoring.on
: Use Lucene’s similarity scoring output.
Default:
dfc
- qboost.weight / .negWeight`
Query rewriting in Querqy can produce boost queries that either promote matching documents to the top of the search result (positive boost) or that push matching documents to the bottom of the search result list (negative boost).
Scores of positive boost queries are multiplied with ‘qboost.weight’. Scores of negative boost queries are multiplied with qboost.negWeight. Both weights must be positive decimal numbers. Note that increasing the value of ‘qboost.negWeight’ means to demote matching documents more strongly.
Default:
1.0
- pf/pf2/pf3/ps/ps2/ps3/qpf.tie (phrase boosts)
Phrase boosts can be applied regardless of query rewriting. If enabled, a boost query will be created from phrases which are derived from the query string, either turning using the entire query into as a phrase for boosting (pf/ps), or using bigrams (pf2/ps2) or trigrams (pf3/ps3) as a phrase.
This works very similar to the same parameters Solr (see Solr’s DisMax and eDismax Query Parsers) but Querqy adds another parameter,
qpf.tie
to control how the scores from ‘pf’, ‘pf2’ and ‘pf3’ are combined: a long query that matches as a phrase, will boost the entire query as a phrase and a lot of bigram and trigram sub-query phrases at the same time, producing a very high boost.Setting
qpf.tie
to a low value will reduce this aggregation effect. Querqy will use the highest score produced by ‘pf’, ‘pf2’ and ‘pf3’ matches and multiply the score of the remaining phrase matches with the ‘qpf.tie’ value. A ‘qpf.tie’ of 0.0 will only use the highest score.Example:
pf=name^0.8 brand&pf2=brand&ps=2$ps2=0&ppf.tie=0.01
Defaults:
pf
/pf2
/pf3
: (empty, no phrase boosting)ps
:0.0
ps2/ps3
: value copied fromps
qpf.tie
:0.0
- bf/bq/boost
Additive boost function (
bf
), additive boost query (bq
) and multiplicative boost query (boost
). Same as in Solr’s DisMax and eDismax Query Parsers.- querqy.rq
Same as in Solr’s rq parameter but only applies the RankQuery when the Querqy query does not contain any boosts.
Generated query parts
- generated.query_fields
The list of fields and their weights for matching generated query terms like synonyms or boost queries. If no ‘query_fields’ are specified for the generated query parts, the global ‘query_fields’ will be used.
Default: copy from global ‘query_fields’
- generated.field_boost_factor
A factor that is multiplied with the field weights of the generated query terms. This factor can be used to apply a penalty to all terms that were not entered by the user but inserted as part of the query rewriting, for example, to give synonyms a smaller weight compared to the original term.
This factor is applied regardless of where the ‘query_fields’ for generated terms are defined, i.e. in the ‘query_fields’ of the ‘generated’ object or globally.
Default:
1.0
- gqf (generated query fields)
The list of fields and their weights for matching generated query terms like synonyms or boost queries. If no ‘generated query fields’ are specified, the global value from
qf
will be used.Example:
qf=name^3 brand^1.2 ean&gqf=name^2.4 brand^0.9
Default: copy from global
qf
- gbf (generated boost factor)
A factor that is multiplied with the field weights of the generated query terms. This factor can be used to apply a penalty to all terms that were not entered by the user but inserted as part of the query rewriting, for example, to give synonyms a smaller weight compared to the original term.
This factor is applied regardless of where the query fields for generated terms are defined, i.e. in
gqf
(generated query fields) orqf
(globally).Default:
1.0