Regex Replace Rewriter¶
What it does¶
The Regex Replace Rewriter is a query preprocessing step that rewrites query tokens using regular expression rules. It is similar in purpose to the Replace Rewriter but uses a custom regex engine instead of simple prefix/suffix wildcards, giving you more expressive matching power.
Typical use cases include:
Normalising measurement notations (
8 x 12→8x12,32 gb ram→32gb)Removing optional noise words from structured queries
Standardising punctuation or whitespace variants of product identifiers
Rules are applied token-by-token across the query. When multiple rules match different parts of the input, all of them are applied. When multiple rules match the same part of the input, the later rule in the rule file wins.
Note
The Regex Replace Rewriter is still experimental. It is recommended to use it for up to a few hundred patterns.
Setup¶
PUT /_querqy/rewriter/regex_replace
1{
2 "class": "querqy.elasticsearch.rewriter.RegexReplaceRewriterFactory",
3 "config": {
4 "rules": "(\\d+) ?x ?(\\d+) => ${1}x${2}",
5 "ignoreCase": true
6 }
7}
The class property (line #2) references the Elasticsearch factory
implementation.
The config object (line #3) accepts the following properties:
rules(required)A string containing the rewriting rules, one rule per line. Remember to JSON-escape the value (e.g. escape backslashes as
\\).ignoreCase(optional, default:true)When
true, pattern matching is case-insensitive.
Coming soon
Coming soon
Configuring rules¶
Each non-empty line that does not start with # defines one rule. The format is:
<regex> => <replacement>
Characters following # on any line are treated as a comment and are ignored.
# This is a comment
abc => def # replace "abc" with "def"
a?bc => def # matches "abc" and "bc"
(\d{2,3}) gb( ram)? => ${1}gb # "32 gb ram" → "32gb", "128 gb" → "128gb"
(\d+) ?x ?(\d+) => ${1}x${2} # "8 x 12" → "8x12", "10 x16" → "10x16"
Supported regex syntax¶
The Regex Replace Rewriter implements a custom regex engine that supports the following subset of standard regex syntax.
Literals
x matches the literal character 'x'
Character classes
\d matches any digit (0–9)
. matches any single character
Ranges
[def] matches 'd', 'e', or 'f'
[a-zA-Z0-9&&[^bc]] matches a letter or digit that is not 'b' or 'c'
Quantifiers
? zero or one occurrence
+ one or more occurrences
* zero or more occurrences (Kleene star)
{min,max} between min and max occurrences, e.g. \d{2,4}
{min,} min or more occurrences
Grouping and backreferences
Parentheses create numbered capture groups. Captured text can be referenced in
the replacement using ${n}, where n is the group index (following the
same numbering convention as Java Regex — see
java.util.regex.Pattern):
(height) (\d+) ?cm => h: ${2}0mm
For nested groups the numbering follows left-to-right opening-parenthesis order:
Pattern: ((A)(B(C)))
Group 1: ((A)(B(C)))
Group 2: (A)
Group 3: (B(C))
Group 4: (C)
Alternatives
a(b|c)d matches "abd" and "acd"
Rule ordering¶
When two rules match the same part of the input, the later rule in the rule file wins. Given:
a(b|c)d => xyz
abc => klm
the input abc is replaced with klm because the second rule appears later.
When rules match different parts of the input, all rules are applied:
abc => klm
def => xyz
The input abc def produces the output klm xyz.
Current limitations¶
It is not possible to anchor a rule to the complete input — a rule for
defwill also matchabc def, not only the inputdefin isolation.The rewriter is intended for up to a few hundred patterns. For larger rule sets, consider the Replace Rewriter or the Common Rules Rewriter.