Logging and debugging rewriters

It can become difficult to understand the output of query pre-processing. For example, if multiple rewriters interact as part of the rewrite chain or if a Common Rules Rewriter is configured with tens of thousands of rules it can become a challenge to figure out how a given query was produced. But even if you use a very simple setup with only a few you rules you will probably be interested in whether a given rewriter was applied at all, for example, to analyse in your production environment which queries and which fraction of the search traffic was manipulated by Querqy.

In Querqy you can use two main approaches to understand how Querqy rewrites a query. The first approach would just use the standard facilities that your search engine offers to you - debug=true in Solr and the /_validate/query endpoint in Elasticsearch/OpenSearch. We will discuss this further in the section Debugging below. The second approach provides information about query rewriting that is available in non-debug mode of your search engine and it can be used for analysis and tracking. We call this approach Info Logging below.

Elasticsearch/OpenSearch
Solr

Debugging

Standard tools

The Querqy plugin finally produces a Lucene query. This query can be inspected using the standard means of your search engine to inspect the Lucene queries that the search engine produces from its query DSL.

In Elasticsearch/OpenSearch this means that you can call https://<MYHOST>:<PORT>/<INDEX>/_validate/query?explain=true and submit the usual Querqy search query in the request body. For example:

GET https://<MYHOST>:<PORT>/<INDEX>/_validate/query?explain=true
Content-Type: application/json
{
  "query":{
    "querqy":{
      "matching_query":{
        "query":"laptop"
      },
      "query_fields":[
        "title^23",
        "name",
        "shortSummary"
      ],
      "rewriters":[
        "common_rules"
      ]
    }
  }
}

If the above case had a CommonRulesRewriter common_rules defined with rules

laptop =>
  SYNONYM: notebook
  UP(100): AMD
  DOWN(50): sleeve

the output of _validate/query?explain=true will look like this:

 1{
 2  "_shards":{
 3    "total":1,
 4    "successful":1,
 5    "failed":0
 6  },
 7  "valid":true,
 8  "explanations":[
 9    {
10      "index":"myindex",
11      "valid":true,
12      "explanation":"+(+(name:notebook | shortSummary:laptop | title:laptop^23.0 | title:notebook^23.0 | shortSummary:notebook | name:laptop) AdditiveBoostFunction(100.0,query(+(name:amd | shortSummary:amd | title:amd^23.0),def=0.0)) AdditiveBoostFunction(-50.0,query(+(shortSummary:sleeve name:sleeve title:sleeve^23.0),def=0.0)))"
13    }
14  ]
15}

Line 12 contains the string representation of the parsed Lucene query and you will probably recognise the notebook / laptop synonyms. It also shows AdditiveBoostFunction sub-queries. AdditiveBoostFunction is a custom Lucene query that is provided by Querqy to deal with UP/DOWN boosting. It especially avoids producing negative document scores, which are not allowed by Lucene, and it guarantees that documents that match for both UP(100) and DOWN(100) yield the same score like documents that match neither UP(100) nor DOWN(100).

In Solr this means that you can see the Lucene query in the parsedquery section of the debug output that you will get if you append debug=true or debugQuery=true to a search request (see Solr documentation).

For example, the following debug output would be produced for the search request /select?defType=querqy&q=notebook&querqy.rewriters=common&qf=name%20cat&debug=true

"debug": {
  "parsedquery":"+DisjunctionMaxQuery((cat:notebook | name:notebook | name:laptop | cat:laptop)) FunctionQuery(AdditiveBoostFunction(100.0,query(+(cat:AMD | name:amd),def=0.0))) FunctionQuery(AdditiveBoostFunction(-100.0,query(+(cat:sleeve name:sleeve),def=0.0)))",
  "parsedquery_toString":"+(cat:notebook | name:notebook | name:laptop | cat:laptop) AdditiveBoostFunction(100.0,query(+(cat:AMD | name:amd),def=0.0)) AdditiveBoostFunction(-100.0,query(+(cat:sleeve name:sleeve),def=0.0))"
}

provided that there was a CommonRulesRewriter rewriter defined for the name common with the following rules:

notebook =>
  SYNONYM: laptop
  UP(100): AMD
  DOWN(50): sleeve

You will probably recognise the notebook / laptop synonyms in the parsed query representation in the debug output. It also shows AdditiveBoostFunction sub-queries. AdditiveBoostFunction is a custom Lucene query that is provided by Querqy to deal with UP/DOWN boosting. It especially avoids producing negative document scores, which are not allowed by Lucene, and it guarantees that documents that match for both UP(100) and DOWN(100) yield the same score like documents that match neither UP(100) nor DOWN(100).

Querqy details in Solr debug mode

Warning

Note that the additional details that Querqy provides to Solr’s debug output have changed in structure and content with the release of ‘Querqy for Solr’ version 5.5.lucene900.0.

Calling a Solr SearchHandler with debugQuery=true will add an additional section querqy to the debug section in Solr’s response, for example:

 1"debug":  {
 2  "querqy": {
 3    "parser":"querqy.parser.WhiteSpaceQuerqyParser",
 4    "rewrite":{
 5      "rewriteChain":[
 6        {
 7          "rewriterId":"common",
 8          "actions":[
 9            {
10              "message":"notebook#0",
11              "match":{
12                "term":"notebook",
13                "type":"exact"
14              },
15              "instructions":[
16                {
17                  "type":"down",
18                  "param":"100",
19                  "value":"sleeve"
20                },
21                {
22                  "type":"up",
23                  "param":"100",
24                  "value":"AMD"
25                },
26                {
27                  "type":"synonym",
28                  "value":"laptop"
29                }
30              ]
31            }
32          ]
33        }
34      ]
35    }
36  }
37}

The output tells you under parser which query string parser Querqy used for processing the user’s query string.

The section under rewrite contains information about how the query was processed in the rewrite chain. The content of this section is the same as the output that is produced by Querqy’s Info Logging with parameter querqy.rewriteLogging=details with the difference that it is added to the debug section here.

Info Logging

Info logging is a powerful tool when you want to understand how a specific rewriter manipulated a query.

Concepts

Querqy rewriters can produce information whether they were triggered at all and about how they changed the query. Within the framework of Info Logging, we call this information the info logging output.

Depending on the required granularity of the logging output, we distinguish between two types: one that contains all details that a rewriter can possibly produce as logging output, and one that contains only the rewriter ID.

The info logging output needs to be sent somewhere where we can pick it up and analyse or collect it for later analysis. We call the place to which we send the output a sink. The same output could, at least in theory, be sent to more than one sink.

Setting up Info Logging

Setting up Info Logging requires two steps: Configuring sinks and enabling the logging per request.

Elasticsearch/OpenSearch
Solr

Warning

Note that the configuration of info logging and the required parameters have changed in an incompatible way with the release of ‘Querqy for Solr’ version 5.5.lucene900.0. The documentation for info logging in older Querqy versions can be found here. When migrating to a newer Querqy for Solr version, you might also consider using the more detailed output that Querqy provides for rewriters when you call Solr with debugQuery=true instead of Info Logging, depending on your use case.

Sinks

To use Info Logging we need a mapping between each rewriter and the sink(s) to which this rewriter should send its log output. This mapping is configured within the rewriter definition:

PUT  /_querqy/rewriter/common_rules

1{
2    "class": "querqy.elasticsearch.rewriter.SimpleCommonRulesRewriterFactory",
3    "config": {
4        "rules" : "notebook =>\nSYNONYM: laptop"
5    },
6    "info_logging": {
7      "sinks": ["log4j"]
8    }
9}

Note

OpenSearch users: Simply replace package name elasticsearch with opensearch in rewriter configurations.

As you probably recognise at this stage, the example shows the configuration for a Common Rules Rewriter. Lines 6-8 are new. They contain the configuration for Info Logging. The sink property is a list of named sinks to which this rewriter should send its log messages.

In this case, the list contains only one element, log4j, which is a predefined sink that routes the output to the Log4j logging framework, which is used in Elasticsearch and Opensearch and which can be configured further. At the current stage, log4j is the only available sink for Info Logging under Elasticsearch/Opensearch and it is not possible (yet) to provide a custom sink implementation.

The output in Log4j will look like this (using a file appender):

1[2021-03-26T13:23:43,006][INFO ][q.e.i.Log4jSink ] [node_s_0]DETAIL[ QUERQY ] {"id":"id-1001","msg":{"common_rules1":[{"APPLIED_RULES":["msg1"]}],"common_rules2":[{"APPLIED_RULES":["msg2"]}]}}
2[2021-03-26T13:28:47,454][INFO ][q.e.i.Log4jSink ] [node_s_0]REWRITER_ID[ QUERQY ] {"id":"id-1002","msg":["common_rules"]}

Let’s decompose this. DETAIL[ QUERQY ] (line 1) and REWRITER_ID[ QUERQY ] (line 2) are Log4j markers that Querqy provides and that you can use to filter log messages. The DETAIL and REWRITER_ID markers correspond to the output types that you can set per request and that are described below. They are both children of a common parent marker QUERQY.

The log message itself is a small JSON document. The msg element contains the messages as they were produced by the rewriters with the rewriter IDs (such as common_rules1) as keys and further rewriter-specific information as values.

The id element is an ID that can be passed per request to help trace requests across nodes (see below).

POST /solr/mycollection/querqy/rewriter/common_rules?action=save
Content-Type: application/json
1{
2    "class": "querqy.solr.rewriter.commonrules.CommonRulesRewriterFactory",
3    "config": {
4        "rules" : "notebook =>\nSYNONYM: laptop"
5    },
6    "info_logging": {
7      "sinks": ["response"]
8    }
9}

As you probably recognise at this stage, the example shows the configuration for a Common Rules Rewriter. Lines 6-8 are new. They contain the configuration for Info Logging. The sink property is a list of named sinks to which this rewriter sends its log messages.

In this case, the list contains only one element, response, which is a predefined sink that adds the Info Logging output to the search response returned by Solr.

Expert: Predefined and custom sinks in Solr

The response sink is currently the only predefined sink that comes with Querqy for Solr. However, you can use your own sink by implementing the querqy.infologging.Sink interface and making it available by adding the following configuration to the QuerqyRewriterRequestHandler in solrconfig.xml:

<requestHandler name="/querqy/rewriter" class="querqy.solr.QuerqyRewriterRequestHandler">
  <lst name="infoLogging">
    <lst name="sink">
      <str name="id">customSink1</str>
      <str name="class">my.name.CustomSinkOne</str>
    </lst>
    <lst name="sink">
      <str name="id">customSink2</str>
      <str name="class">my.name.CustomSinkTwo</str>
    </lst>
  </lst>
</requestHandler>

and then add the mappings to the sink(s) in the rewriter configurations:

POST /solr/mycollection/querqy/rewriter/common_rules?action=save
Content-Type: application/json
1{
2     "class": "querqy.solr.rewriter.commonrules.CommonRulesRewriterFactory",
3     "config": {
4                "rules" : "notebook =>\nSYNONYM: laptop"
5     },
6     "info_logging": {
7       "sinks": ["response", "customSink1", "customSink2"]
8     }
9}

As the sink mappings are configured per rewriter, you can decide per rewriter to which sink you want to send their Info Logging output and even have one sink per rewriter.

Enabling info logging per request

Once you have set up you sinks and mapped rewriters to sinks, you can start using it. To trigger the rewriters to send logging output to the sinks, you need to pass the following request parameters to enable the logging per request:

querqy.rewriteLogging.rewriters

A comma-separated list of rewriter IDs for which info logging should be enabled. Use querqy.rewriteLogging.rewriters=* if you want to enable it for all rewriters in the rewrite chain. Note that not all rewriters have implemented info logging. The are also expected to remain ‘silent’ if they did not modify the query.

querqy.rewriteLogging

Values: details rewriter_id off

Defines the type of the output.

  • details: Gives you all details that the rewriter produces as logging output. For example, the CommonRulesRewriter will return information about the rules that were applied and the log message that was configured.

  • rewriter_id: Only returns the IDs of the rewriters.

  • off: Returns nothing at all.

Default: off

Examples:

GET https://<MYHOST>:<PORT>/solr/<collection>/select?q=notebook&querqy.rewriters=common&querqy.rewriteLogging.rewriters=common&querqy.rewriteLogging=rewriter_id

returns

 1{
 2  "responseHeader":{  },
 3  "response":{ },
 4  "querqyRewriteLogging":{
 5    "rewriteChainLogging":[
 6      {
 7        "rewriterId":"common"
 8      }
 9    ]
10  }
11}

provided that rewriter ‘common’ changed the query.

The same query with detailed output (setting querqy.rewriteLogging=details):

GET https://<MYHOST>:<PORT>/solr/<collection>/select?q=notebook&querqy.rewriters=common&querqy.rewriteLogging.rewriters=common&querqy.rewriteLogging=details
 1{
 2  "responseHeader":{},
 3  "response":{},
 4  "querqyRewriteLogging":{
 5    "rewriteChainLogging":[
 6      {
 7        "actions":[
 8          {
 9            "message":"notebook#0",
10            "match":{
11              "term":"notebook",
12              "type":"exact"
13            },
14            "instructions":[
15              {
16                "type":"down",
17                "param":"100",
18                "value":"sleeve"
19              },
20              {
21                "type":"up",
22                "param":"100",
23                "value":"AMD"
24              },
25              {
26                "type":"synonym",
27                "value":"laptop"
28              }
29            ]
30          }
31        ],
32        "rewriterId":"common"
33      }
34    ]
35  }
36}

The actions element is specific to the CommonRulesRewriter. It reflects that the following block in the rule definition was applied and it should be easy to map the instructions output with the following rule definition:

notebook =>
  SYNONYM: laptop
  UP(100): AMD
  DOWN(50): sleeve

Should more than one such blocks of rules be applied to a query, they would each occur as their own object in the actions list of the Info Logging output.

Besides instructions output, we also get a match and a message element. match tells what input triggered the application of rules and how it was matched. In this case, notebook was matched exactly.

Had we used a wildcard in the rule, the logging output would still tell us the full matching term and also that the type is affix for the above query:

note* =>
  SYNONYM: laptop
  ...

…would thus produce the following output for query notebook:

{
    "match":{
      "term":"notebook",
      "type":"affix"
    }
}

The message element of the action was auto-generated above:

 1{
 2  "querqyRewriteLogging":{
 3    "rewriteChainLogging":[
 4      {
 5        "actions":[
 6          {
 7            "message":"notebook#0",
 8            "match": {},
 9            "instructions": []
10          }
11
12        ]
13       }
14      ]
15    }
16  }

"notebook#0" is a generated from the input notebook and a count of rule definition blocks. In this case it is the first block in our rule definitions (the count starts at 0).

Once you have mapped rewriters to sinks, you can start using Info Logging. To trigger the rewriters to send logging output to the sinks, you need to enable Info Logging in your search requests:

POST /myindex/_search

 1{
 2   "query": {
 3       "querqy": {
 4           "matching_query": {
 5               "query": "notebook"
 6           },
 7           "query_fields": [ "title^3.0", "brand^2.1", "shortSummary"],
 8           "rewriters": [
 9             "word_break",
10             "common_rules"
11           ],
12           "info_logging": {
13             "id":"REQ-ID-0043",
14             "type": "DETAIL"
15           }
16       }
17   }
18}

Info Logging is controlled by the properties specified under info_logging (lines 12-15). You can set the properties as follows:

type

Values: DETAIL REWRITER_ID NONE

Controls whether a logging output is generated at all together with the format of the output. It can take the values:

  • DETAIL - Logs all details that the rewriter produces as logging output.

  • REWRITER_ID - Only logs the IDs of the rewriters.

  • NONE - Logs nothing at all.

Default: NONE

id

An identifier. This can be used for identifying search requests. For example, when you use more than one shard, the same search request will be executed on more than one shard and create a log message on each shard. You can use this ID to trace and aggregate the messages across shards. It is up to the client that makes the search request to supply the ID.

Default: not set

For examples of the output format for types DETAIL and REWRITER_ID see the Log4j sink output above. It is up to the individual rewriter what log message the emit for type DETAIL.

This default log message can be overridden in the rule definitions using the _log and _id_ properties:

notebook =>
      SYNONYM: laptop
      DELETE: cheap
      @_id: "ID1"
      @_log: "Log message for notebook"

samusng =>
   SYNONYM: samsung
   @{
       "_id": "ID2",
       "_log": "Log message for samusng typo",
   }

32g =>
  SYNONYM: 32gb
  @_id: "ID3"

The query ‘samusng notebook 32g’ will now produce the following log messages (we’re skipping the instructions details):

{
  "querqyRewriteLogging":{
    "rewriteChainLogging":[
      {
        "actions":[
          {
            "message":"Log message for samusng typo",
            "match":{
              "term":"samusng",
              "type":"exact"
            },
            "instructions":[ ]
          },
          {
            "message":"Log message for notebook",
            "match":{
              "term":"notebook",
              "type":"exact"
            },
            "instructions":[ ]
          },
          {
            "message":"ID3",
            "match":{
              "term":"32g",
              "type":"exact"
            },
            "instructions":[ ]
          }
        ],
        "rewriterId":"common"
      }
    ]
  }

}

As the third block doesn’t have a ‘_log’ property, the _id property (ID3) will be used as the message, and if that didn’t exist, we’d fall back to the <input>#<rule number> scheme that we saw above.