Logging and debugging rewriters¶
It can become difficult to understand the output of query pre-processing. For example, if multiple rewriters interact as part of the rewrite chain or if a Common Rules Rewriter is configured with tens of thousands of rules it can become a challenge to figure out how a given query was produced. But even if you use a very simple setup with only a few you rules you will probably be interested in whether a given rewriter was applied at all, for example, to analyse in your production environment which queries and which fraction of the search traffic was manipulated by Querqy.
In Querqy you can use two main approaches to understand how Querqy rewrites a
query. The first approach would just use the standard facilities that your
search engine offers to you - debug=true in Solr and the
/_validate/query endpoint in Elasticsearch/OpenSearch. We will discuss this
further in the section Debugging below. The second approach provides information
about query rewriting that is available in non-debug mode of your search engine
and it can be used for analysis and tracking. We call this approach Info Logging
below.
Debugging¶
Standard tools¶
The Querqy plugin finally produces a Lucene query. This query can be inspected using the standard means of your search engine to inspect the Lucene queries that the search engine produces from its query DSL.
In Elasticsearch/OpenSearch this means that you can call
https://<MYHOST>:<PORT>/<INDEX>/_validate/query?explain=true and submit the
usual Querqy search query in the request body. For example:
GET https://<MYHOST>:<PORT>/<INDEX>/_validate/query?explain=trueContent-Type: application/json{
"query":{
"querqy":{
"matching_query":{
"query":"laptop"
},
"query_fields":[
"title^23",
"name",
"shortSummary"
],
"rewriters":[
"common_rules"
]
}
}
}
If the above case had a CommonRulesRewriter common_rules defined with rules
laptop =>
SYNONYM: notebook
UP(100): AMD
DOWN(50): sleeve
the output of _validate/query?explain=true will look like this:
1{
2 "_shards":{
3 "total":1,
4 "successful":1,
5 "failed":0
6 },
7 "valid":true,
8 "explanations":[
9 {
10 "index":"myindex",
11 "valid":true,
12 "explanation":"+(+(name:notebook | shortSummary:laptop | title:laptop^23.0 | title:notebook^23.0 | shortSummary:notebook | name:laptop) AdditiveBoostFunction(100.0,query(+(name:amd | shortSummary:amd | title:amd^23.0),def=0.0)) AdditiveBoostFunction(-50.0,query(+(shortSummary:sleeve name:sleeve title:sleeve^23.0),def=0.0)))"
13 }
14 ]
15}
Line 12 contains the string representation of the parsed Lucene query and you
will probably recognise the notebook / laptop synonyms. It also shows
AdditiveBoostFunction sub-queries. AdditiveBoostFunction is a custom Lucene
query that is provided by Querqy to deal with UP/DOWN boosting. It especially
avoids producing negative document scores, which are not allowed by Lucene, and
it guarantees that documents that match for both UP(100) and DOWN(100) yield
the same score like documents that match neither UP(100) nor DOWN(100).
In Solr this means that you can see the Lucene query in the parsedquery section
of the debug output that you will get if you append debug=true or
debugQuery=true to a search request (see Solr documentation).
For example, the following debug output would be produced for the search request
/select?defType=querqy&q=notebook&querqy.rewriters=common&qf=name%20cat&debug=true
"debug": {
"parsedquery":"+DisjunctionMaxQuery((cat:notebook | name:notebook | name:laptop | cat:laptop)) FunctionQuery(AdditiveBoostFunction(100.0,query(+(cat:AMD | name:amd),def=0.0))) FunctionQuery(AdditiveBoostFunction(-100.0,query(+(cat:sleeve name:sleeve),def=0.0)))",
"parsedquery_toString":"+(cat:notebook | name:notebook | name:laptop | cat:laptop) AdditiveBoostFunction(100.0,query(+(cat:AMD | name:amd),def=0.0)) AdditiveBoostFunction(-100.0,query(+(cat:sleeve name:sleeve),def=0.0))"
}
provided that there was a CommonRulesRewriter rewriter defined for the name
common with the following rules:
notebook =>
SYNONYM: laptop
UP(100): AMD
DOWN(50): sleeve
You will probably recognise the notebook / laptop synonyms in the parsed query
representation in the debug output. It also shows AdditiveBoostFunction
sub-queries. AdditiveBoostFunction is a custom Lucene query that is provided
by Querqy to deal with UP/DOWN boosting. It especially avoids producing negative
document scores, which are not allowed by Lucene, and it guarantees that
documents that match for both UP(100) and DOWN(100) yield the same score like
documents that match neither UP(100) nor DOWN(100).
Querqy details in Solr debug mode
Warning
Note that the additional details that Querqy provides to Solr’s debug output have changed in structure and content with the release of ‘Querqy for Solr’ version 5.5.lucene900.0.
Calling a Solr SearchHandler with debugQuery=true will add
an additional section querqy to the debug section in Solr’s response,
for example:
1"debug": {
2 "querqy": {
3 "parser":"querqy.parser.WhiteSpaceQuerqyParser",
4 "rewrite":{
5 "rewriteChain":[
6 {
7 "rewriterId":"common",
8 "actions":[
9 {
10 "message":"notebook#0",
11 "match":{
12 "term":"notebook",
13 "type":"exact"
14 },
15 "instructions":[
16 {
17 "type":"down",
18 "param":"100",
19 "value":"sleeve"
20 },
21 {
22 "type":"up",
23 "param":"100",
24 "value":"AMD"
25 },
26 {
27 "type":"synonym",
28 "value":"laptop"
29 }
30 ]
31 }
32 ]
33 }
34 ]
35 }
36 }
37}
The output tells you under parser which
query string parser Querqy used for processing
the user’s query string.
The section under rewrite contains information about how the query was
processed in the rewrite chain. The content of this section is the same as the
output that is produced by Querqy’s Info Logging with
parameter querqy.rewriteLogging=details with the difference that it is added
to the debug section here.
Info Logging¶
Info logging is a powerful tool when you want to understand how a specific rewriter manipulated a query.
Concepts¶
Querqy rewriters can produce information whether they were triggered at all and about how they changed the query. Within the framework of Info Logging, we call this information the info logging output.
Depending on the required granularity of the logging output, we distinguish between two types: one that contains all details that a rewriter can possibly produce as logging output, and one that contains only the rewriter ID.
The info logging output needs to be sent somewhere where we can pick it up and analyse or collect it for later analysis. We call the place to which we send the output a sink. The same output could, at least in theory, be sent to more than one sink.
Setting up Info Logging¶
Setting up Info Logging requires two steps: Configuring sinks and enabling the logging per request.
Warning
With Solr, note that the configuration of info logging and the required
parameters have changed in an incompatible way with the release of ‘Querqy
for Solr’ version 5.5.lucene900.0. The documentation for info logging in
older Querqy versions can be found here.
When migrating to a newer Querqy for Solr version, you might also consider
using the more detailed output that Querqy provides for rewriters when you
call Solr with debugQuery=true instead of Info Logging, depending on your
use case.
Sinks¶
To use Info Logging we need a mapping between each rewriter and the sink(s) to which this rewriter should send its log output. This mapping is configured within the rewriter definition:
PUT /_querqy/rewriter/common_rules
1{
2 "class": "querqy.elasticsearch.rewriter.SimpleCommonRulesRewriterFactory",
3 "config": {
4 "rules" : "notebook =>\nSYNONYM: laptop"
5 },
6 "info_logging": {
7 "sinks": ["log4j"]
8 }
9}
Note
OpenSearch users: Simply replace package name elasticsearch
with opensearch in rewriter configurations.
As you probably recognise at this stage, the example shows the configuration for
a Common Rules Rewriter. Lines 6-8 are new. They contain the configuration for
Info Logging. The sink property is a list of named sinks to which this
rewriter should send its log messages.
In this case, the list contains only one
element, log4j, which is a predefined sink that routes the output to the
Log4j logging framework, which is used in Elasticsearch and Opensearch and which
can be configured further. At the current stage, log4j is the only available
sink for Info Logging under Elasticsearch/Opensearch and it is not possible
(yet) to provide a custom sink implementation.
The output in Log4j will look like this (using a file appender):
1[2021-03-26T13:23:43,006][INFO ][q.e.i.Log4jSink ] [node_s_0]DETAIL[ QUERQY ] {"id":"id-1001","msg":{"common_rules1":[{"APPLIED_RULES":["msg1"]}],"common_rules2":[{"APPLIED_RULES":["msg2"]}]}}
2[2021-03-26T13:28:47,454][INFO ][q.e.i.Log4jSink ] [node_s_0]REWRITER_ID[ QUERQY ] {"id":"id-1002","msg":["common_rules"]}
Let’s decompose this. DETAIL[ QUERQY ] (line 1) and REWRITER_ID[ QUERQY ]
(line 2) are Log4j markers that Querqy
provides and that you can use to filter log messages.
The DETAIL and REWRITER_ID markers correspond to the output types that you
can set per request and that are described below. They are both children of a
common parent marker QUERQY.
The log message itself is a small JSON document. The msg element contains
the messages as they were produced by the rewriters with the rewriter IDs
(such as common_rules1) as keys and further rewriter-specific information as
values.
The id element is an ID that can be passed per request to help trace
requests across nodes (see below).
POST /solr/mycollection/querqy/rewriter/common_rules?action=saveContent-Type: application/json1{
2 "class": "querqy.solr.rewriter.commonrules.CommonRulesRewriterFactory",
3 "config": {
4 "rules" : "notebook =>\nSYNONYM: laptop"
5 },
6 "info_logging": {
7 "sinks": ["response"]
8 }
9}
As you probably recognise at this stage, the example shows the configuration for
a Common Rules Rewriter. Lines 6-8 are new. They contain the configuration for
Info Logging. The sink property is a list of named sinks to which this
rewriter sends its log messages.
In this case, the list contains only one element, response, which is a
predefined sink that adds the Info Logging output to the search response
returned by Solr.
Expert: Predefined and custom sinks in Solr
The response sink is currently the only predefined sink that comes with
Querqy for Solr. However, you can use your own sink by implementing the
querqy.infologging.Sink interface and making it available by adding the
following configuration to the QuerqyRewriterRequestHandler in
solrconfig.xml:
<requestHandler name="/querqy/rewriter" class="querqy.solr.QuerqyRewriterRequestHandler">
<lst name="infoLogging">
<lst name="sink">
<str name="id">customSink1</str>
<str name="class">my.name.CustomSinkOne</str>
</lst>
<lst name="sink">
<str name="id">customSink2</str>
<str name="class">my.name.CustomSinkTwo</str>
</lst>
</lst>
</requestHandler>
and then add the mappings to the sink(s) in the rewriter configurations:
POST /solr/mycollection/querqy/rewriter/common_rules?action=saveContent-Type: application/json1{
2 "class": "querqy.solr.rewriter.commonrules.CommonRulesRewriterFactory",
3 "config": {
4 "rules" : "notebook =>\nSYNONYM: laptop"
5 },
6 "info_logging": {
7 "sinks": ["response", "customSink1", "customSink2"]
8 }
9}
As the sink mappings are configured per rewriter, you can decide per rewriter to which sink you want to send their Info Logging output and even have one sink per rewriter.
Enabling info logging per request¶
Once you have mapped rewriters to sinks, you can start using Info Logging. To trigger the rewriters to send logging output to the sinks, you need to enable Info Logging in your search requests:
POST /myindex/_search
1{
2 "query": {
3 "querqy": {
4 "matching_query": {
5 "query": "notebook"
6 },
7 "query_fields": [ "title^3.0", "brand^2.1", "shortSummary"],
8 "rewriters": [
9 "word_break",
10 "common_rules"
11 ],
12 "info_logging": {
13 "id":"REQ-ID-0043",
14 "type": "DETAIL"
15 }
16 }
17 }
18}
Info Logging is controlled by the properties specified under info_logging (lines 12-15). You can set the properties as follows:
- type
Values:
DETAILREWRITER_IDNONEControls whether a logging output is generated at all together with the format of the output. It can take the values:
DETAIL- Logs all details that the rewriter produces as logging output.REWRITER_ID- Only logs the IDs of the rewriters.NONE- Logs nothing at all.
Default:
NONE- id
An identifier. This can be used for identifying search requests. For example, when you use more than one shard, the same search request will be executed on more than one shard and create a log message on each shard. You can use this ID to trace and aggregate the messages across shards. It is up to the client that makes the search request to supply the ID.
Default: not set
For examples of the output format for types DETAIL and REWRITER_ID see
the Log4j sink output above. It is up to the individual rewriter what log
message the emit for type DETAIL.
Once you have set up you sinks and mapped rewriters to sinks, you can start using it. To trigger the rewriters to send logging output to the sinks, you need to pass the following request parameters to enable the logging per request:
- querqy.rewriteLogging.rewriters
A comma-separated list of rewriter IDs for which info logging should be enabled. Use
querqy.rewriteLogging.rewriters=*if you want to enable it for all rewriters in the rewrite chain. Note that not all rewriters have implemented info logging. The are also expected to remain ‘silent’ if they did not modify the query.- querqy.rewriteLogging
Values:
detailsrewriter_idoffDefines the type of the output.
details: Gives you all details that the rewriter produces as logging output. For example, the CommonRulesRewriter will return information about the rules that were applied and the log message that was configured.rewriter_id: Only returns the IDs of the rewriters.off: Returns nothing at all.
Default:
off
Examples:
GET https://<MYHOST>:<PORT>/solr/<collection>/select?q=notebook&querqy.rewriters=common&querqy.rewriteLogging.rewriters=common&querqy.rewriteLogging=rewriter_idreturns
1{
2 "responseHeader":{ },
3 "response":{ },
4 "querqyRewriteLogging":{
5 "rewriteChainLogging":[
6 {
7 "rewriterId":"common"
8 }
9 ]
10 }
11}
provided that rewriter ‘common’ changed the query.
The same query with detailed output (setting querqy.rewriteLogging=details):
GET https://<MYHOST>:<PORT>/solr/<collection>/select?q=notebook&querqy.rewriters=common&querqy.rewriteLogging.rewriters=common&querqy.rewriteLogging=details 1{
2 "responseHeader":{},
3 "response":{},
4 "querqyRewriteLogging":{
5 "rewriteChainLogging":[
6 {
7 "actions":[
8 {
9 "message":"notebook#0",
10 "match":{
11 "term":"notebook",
12 "type":"exact"
13 },
14 "instructions":[
15 {
16 "type":"down",
17 "param":"100",
18 "value":"sleeve"
19 },
20 {
21 "type":"up",
22 "param":"100",
23 "value":"AMD"
24 },
25 {
26 "type":"synonym",
27 "value":"laptop"
28 }
29 ]
30 }
31 ],
32 "rewriterId":"common"
33 }
34 ]
35 }
36}
The actions element is specific to the CommonRulesRewriter. It reflects that
the following block in the rule definition was applied and it should be easy to
map the instructions output with the following rule
definition:
notebook =>
SYNONYM: laptop
UP(100): AMD
DOWN(50): sleeve
Should more than one such blocks of rules be applied to a query, they would each
occur as their own object in the actions list of the Info Logging output.
Besides instructions output, we also get a match and a message
element. match tells what input triggered the application of rules and how it was
matched. In this case, notebook was matched exactly.
Had we used a wildcard in the rule, the logging output would still tell us the full matching term and also that the type is affix for the above query:
note* =>
SYNONYM: laptop
...
…would thus produce the following output for query notebook:
{
"match":{
"term":"notebook",
"type":"affix"
}
}
The message element of the action was auto-generated above:
1{
2 "querqyRewriteLogging":{
3 "rewriteChainLogging":[
4 {
5 "actions":[
6 {
7 "message":"notebook#0",
8 "match": {},
9 "instructions": []
10 }
11
12 ]
13 }
14 ]
15 }
16 }
"notebook#0" is a generated from the input notebook and a count of rule
definition blocks. In this case it is the first block in our rule definitions
(the count starts at 0).
This default log message can be overridden in the rule definitions using the
_log and _id_ properties:
notebook =>
SYNONYM: laptop
DELETE: cheap
@_id: "ID1"
@_log: "Log message for notebook"
samusng =>
SYNONYM: samsung
@{
"_id": "ID2",
"_log": "Log message for samusng typo",
}
32g =>
SYNONYM: 32gb
@_id: "ID3"
The query ‘samusng notebook 32g’ will now produce the following log messages (we’re skipping the instructions details):
{
"querqyRewriteLogging":{
"rewriteChainLogging":[
{
"actions":[
{
"message":"Log message for samusng typo",
"match":{
"term":"samusng",
"type":"exact"
},
"instructions":[ ]
},
{
"message":"Log message for notebook",
"match":{
"term":"notebook",
"type":"exact"
},
"instructions":[ ]
},
{
"message":"ID3",
"match":{
"term":"32g",
"type":"exact"
},
"instructions":[ ]
}
],
"rewriterId":"common"
}
]
}
}
As the third block doesn’t have a ‘_log’ property, the _id property (ID3) will be
used as the message, and if that didn’t exist, we’d fall back to the
<input>#<rule number> scheme that we saw above.