Evabalilk.com

The Perfect Tech Experience

Technology

What is a Dixmax?

The term “dismax” appears frequently in SOLR lists, which can be quite confusing for new users. It originated as a shorthand name for DisMaxRequestHandler (which I named after DisjunctionMaxQueryParser, which I named after the DisjunctionMaxQuery class that you use a lot). In recent years, DisMaxRequestHandler and StandardRequestHandler have been refactored into a single SearchHandler class, and the term “dismax” now generally refers to DisMaxQParser.

Sure as Mudd, right?

Regardless of whether you use DisMaxRequestHandler via the qt=dismax parameter, or use SearchHandler with DisMaxQParser via defType=dismax, the end result is that your q parameter is parsed by DisjunctionMaxQueryParser.

The original goals of dismax (whatever meaning you might infer) have never changed:

supports a simplified version of the Lucene QueryParser syntax. Quotes can be used to group sentences, and +/- can be used to denote required and optional clauses, but all other Lucene query parser special characters are escaped to simplify the user experience. The driver takes responsibility for building a good query from user input using BooleanQueries that contain DisjunctionMaxQueries in fields and ticks that you specify. It also allows you to provide additional boost queries, boost functions and filter queries to artificially affect the result of all searches. All of these options can be specified as default parameters for the controller in its solrconfig.xml or override the Solr query URL.

In short: you care about the fields and pulses you want to use when you set it up, your users just give you words without worrying too much about syntax.

The magic of dismax (in my opinion) comes from the query structure it produces. Basically, it boils down to matrix multiplication: a one-column matrix of each “chunk” of your user input, multiplied by a one-row matrix of the qf fields to produce one big matrix of each field: permutation of chunks . The array is then converted to a BooleanQuery consisting of DisjunctionMaxQueries for each row in the array.

DisjunctionMaxQuery is used because its score is determined by the maximum score of its subclauses, rather than the sum like a BooleanQuery, so no single word from user input dominates the final score. The best way to explain this is with an example, so let’s consider the next entry.

First, we consider the parser “markup” characters that appear in this string q:
• white space: split the input string into chunks
• quotation marks: form a fragment of a single sentence
• + – makes a fragment mandatory
So we have 3 “chunks” of user input:
• “apache solr” (must match)
• “search” (must match)
• “server” (must match)

With me so far, right?

Where people tend to get confused is when thinking about how Solr’s per-field parsing configuration (in schema.xml) impacts all of this. Our previous example was pretty straightforward, but let’s consider for a moment what might happen if:
• The name field uses the WordDelimiterFilter at query time, but the functions do not.
• The characteristics field is configured so that “the” is a stop word, but the name is not.

Now let’s see what we get when our input parameters are structurally similar to what we had before, but different enough that WordDelimiterFilter and StopFilter come into play…

Using the WordDelimiterFilter hasn’t changed much: the functions treat “search server” as a single term, while in the name field we search for the phrase “search server”. use of WordDelimiterFilter for the name field (presumably that’s why it’s used). This DisjunctionMaxQuery still “makes sense”, but other fields with weird parses that produce less/more Tokens than a “typical” field for the same processor might lead to queries that are not as easy to understand. In particular, consider what happened in our example with the word “the”: because “the” is a stopword in the functions field, no query object is produced for that field/fragment combination. But a query occurs for the name field, which means that the total number of “Should Match” clauses in our top-level query is still 2, so our minimum match number is still 1 (50% of 2 == 1).

This kind of situation tends to confuse a lot of people: since “the” is a stopword in a field, they don’t expect it to matter in the final query, but as long as at least one qf field produces a Token for it (name in our example) will be included in the final query and will contribute to the count of “Should Match” clauses.

So what is the conclusion of all this?

DisMax is a tricky creature. When using it, you should consider all of your options carefully and watch the output of debugQuery=true while experimenting with different query strings and different parsing settings to ensure that you really understand how your users’ queries will be parsed.

LEAVE A RESPONSE

Your email address will not be published. Required fields are marked *