Regex Replace Rewriter

What it does

The Regex Replace Rewriter is a query preprocessing step that rewrites query tokens using regular expression rules. It is similar in purpose to the Replace Rewriter but uses a custom regex engine instead of simple prefix/suffix wildcards, giving you more expressive matching power.

Typical use cases include:

  • Normalising measurement notations (8 x 128x12, 32 gb ram32gb)

  • Removing optional noise words from structured queries

  • Standardising punctuation or whitespace variants of product identifiers

Rules are applied token-by-token across the query. When multiple rules match different parts of the input, all of them are applied. When multiple rules match the same part of the input, the later rule in the rule file wins.

Note

The Regex Replace Rewriter is still experimental. It is recommended to use it for up to a few hundred patterns.

Setup

PUT  /_querqy/rewriter/regex_replace

1{
2    "class": "querqy.elasticsearch.rewriter.RegexReplaceRewriterFactory",
3    "config": {
4        "rules": "(\\d+) ?x ?(\\d+) => ${1}x${2}",
5        "ignoreCase": true
6    }
7}

The class property (line #2) references the Elasticsearch factory implementation.

The config object (line #3) accepts the following properties:

rules (required)

A string containing the rewriting rules, one rule per line. Remember to JSON-escape the value (e.g. escape backslashes as \\).

ignoreCase (optional, default: true)

When true, pattern matching is case-insensitive.

Configuring rules

Each non-empty line that does not start with # defines one rule. The format is:

<regex> => <replacement>

Characters following # on any line are treated as a comment and are ignored.

# This is a comment
abc => def                           # replace "abc" with "def"
a?bc => def                          # matches "abc" and "bc"
(\d{2,3}) gb( ram)? => ${1}gb        # "32 gb ram" → "32gb", "128 gb" → "128gb"
(\d+) ?x ?(\d+) => ${1}x${2}        # "8 x 12" → "8x12", "10 x16" → "10x16"

Supported regex syntax

The Regex Replace Rewriter implements a custom regex engine that supports the following subset of standard regex syntax.

Literals

x     matches the literal character 'x'

Character classes

\d    matches any digit (0–9)
.     matches any single character

Ranges

[def]                  matches 'd', 'e', or 'f'
[a-zA-Z0-9&&[^bc]]    matches a letter or digit that is not 'b' or 'c'

Quantifiers

?         zero or one occurrence
+         one or more occurrences
*         zero or more occurrences (Kleene star)
{min,max} between min and max occurrences, e.g. \d{2,4}
{min,}    min or more occurrences

Grouping and backreferences

Parentheses create numbered capture groups. Captured text can be referenced in the replacement using ${n}, where n is the group index (following the same numbering convention as Java Regex — see java.util.regex.Pattern):

(height) (\d+) ?cm => h: ${2}0mm

For nested groups the numbering follows left-to-right opening-parenthesis order:

Pattern: ((A)(B(C)))

Group 1: ((A)(B(C)))
Group 2: (A)
Group 3: (B(C))
Group 4: (C)

Alternatives

a(b|c)d   matches "abd" and "acd"

Rule ordering

When two rules match the same part of the input, the later rule in the rule file wins. Given:

a(b|c)d => xyz
abc => klm

the input abc is replaced with klm because the second rule appears later.

When rules match different parts of the input, all rules are applied:

abc => klm
def => xyz

The input abc def produces the output klm xyz.

Current limitations

  • It is not possible to anchor a rule to the complete input — a rule for def will also match abc def, not only the input def in isolation.

  • The rewriter is intended for up to a few hundred patterns. For larger rule sets, consider the Replace Rewriter or the Common Rules Rewriter.