## Monday, August 15, 2011

### Implementing a custom Lucene query to obtain flexible score customization

In our R&D department, we have been successfully using Lucene (and Solr too) since its version 2.2 to implement information retrieval solutions in internal (own products) and external (customer's products) projects. Recently, we have been developing on top of Solr a search engine for one of our customers. This customer owns a web portal where users can search for different types of companies (from ice-cream saloons to car sellers). Those companies that want their ads displayed in this web site must paid a certain amount of money. Each ad is made of: name, additional information and a series of keywords that the company can buy in order to appear in response to certain queries. For example, a car seller that sells Ford cars can buy the keyword “Ford” in order to appear as a result if a user searches “Ford”.
The challenge here was the ranking formula that the customer wanted us to implement: the formula must compute the product of the length norm (tokens that matches the query divided by the total number of tokens in the field) of the field that best matches with the query with the investment of each company. To better understand the proposed formula, let's take a look to the following example, where the Lucene/Solr documents are represented as a single entry in the index for each company
Company 1
Name: John Doe Car Seller
Info: The best cars in the middlewest
Keyword: Cars And Bikes
Investment: 2000
Company 2
Name: Uncle Sam Car Seller
Info: The best cars in the east
Keyword: Cars
Investment: 1500
For this example, these should be the scoring calculations, if the query is “cars”.
For company 1: (1/2) * 2000 = 1000. The factor (½) is because the matching field with shortest length is the Keyword field, and it has two terms: “cars” and “bikes” (let's assume “and” is a stopword).
For company 2: (1/1) * 1500 = 1500. The factor (1/1) is because the matching field with shortest length is the Keyword field, and it has one term: “cars” which matches the query exactly.
So, using the proposed formula, company 2 should appear first and company 1 should appear second in the result list. The formula has three different parts:
1. How to include the investment as part of the scoring formula.

2. How to obtain the field that is the best match for the query.

3. How to multiply 1) and 2) in order to obtain the score.

We started analyzing 1), and our first idea was to use Lucene's document boosting (1) to represent the investment of each company. At first glance, it looked like a good idea because using it we can solve the problem of using investment and multiply it for something else almost automatically, but shortly after using this solution, we started having some problems. We did not account for the fact that in Lucene, document boosting is encoded in 1 byte, along with fields boostings and lengthNorms. As a result, in scoring calculation, when retrieving the single byte, the values lost resolution, because there are only 256 fixed values for represent the boostings. This did not appear as a problem for big differences in investments (for example, 1000 vs. 10000), but it did for small differences (like 1000 vs. 1050) that must be reflected in the ranking.
We found instead the class CustomScoreQuery (2)(3) which is a special type of query that sets document score as a programmatic function of several (sub) scores. This was extremely useful because we could use it to multiply the scoring of one “regular” Lucene query (the one with the length norm of the best matching field, which still we had to solve) by a ValueSourceQuery associated with the “investment” field of each company document.
As we said previously, we still had to resolve the problem of build a query that return as its score the length norm of the best matching field. After a lot of searching we did not find any “standard” solution for this issue, so we decided to build our own custom Lucene query. We designed and implemented a custom Query subclass named MaxQuery. Its class constructor expects two parameters: a boolean query, where boolean query clauses are split and applied individually to find the one with the highest length norm, and another query, typically a FieldScoreQuery, that is associated with the investment field. The boolean query is used because it fits with the original query (and legacy code) that was used to retrieve search results, but it could also be a collection of term queries for example. We also had to implement the related Weight and Scorer subclasses, that in a moment of incredible inspiration, we called MaxWeight and MaxScorer. The latter is the “heart” of this implementation, and we will explain its code.
Before starting, the implemented classes are fully compatible with the latest version of Lucene at the time (3.3.0). We have used deprecated code, like the Searcher class, but as of version 3.3.0 the deprecated code is still in use in Solr core classes, so that is why we used it. Our classes should be modified when Lucene 4.0 becomes the stable version.
Now, let's take a look to the code of the MaxScorer class, and then we will explain each part (some code has been cut off for clarity).

``` class MaxScorer { /** * The scorers of the queries of the clauses of the boolean query. * These scorers will be used to get the maximum score of these queries, * that will be used in the final score. */ protected Scorer[] maxScorers; /** * The scorer of the query related with the investment field */ protected Scorer investmentScorer; /** * Array of next document numbers that match the {@link #maxScorers}. * The indexes of this array matches the indexes of {@link #maxScorers} This array it is initialized in */ protected int[] maxScorersNextDocs; /** * Document number which corresponds to the minimum document number * in {@link #maxScorersNextDocs}. * This document number is a candidate to match the query */ protected int maxScorersMinOfNextDocs = -1; /** * Array that contains for each position: * <ul> * <li><code>true</code> if the scorer in * {@link #maxScorers} in that position matches {@link #doc} * <li><code>false</code> otherwise</li> * </ul> * This array is used in {@link #score()} method in order to * determine which max scorers should be used for scoring {@link * #doc} * and get the maximum scoring. * This array it's initialized in FALSE */ protected boolean[] maxScorersForCurrDoc; protected void advanceMaxScorers() throws IOException { for (int idx = 0; idx < this.maxScorersNextDocs.length; idx++) { ```
`if (this.maxScorersNextDocs[idx] != NO_MORE_DOCS && this.maxScorersNextDocs[idx] <= this.maxScorersMinOfNextDocs) { `
`this.maxScorersNextDocs[idx] = this.maxScorers[idx].nextDoc(); `
`} `
`} `
`int minDoc = NO_MORE_DOCS; `
` for (int idx = 0; idx < this.maxScorersNextDocs.length; idx++) { `
` if (this.maxScorersNextDocs[idx] < minDoc) { `
`minDoc = this.maxScorersNextDocs[idx]; `
`} `
`} `
`this.maxScorersMinOfNextDocs = minDoc; `
` } `

`@Override `
` public float score() throws IOException { `
`/* * Calculate the scoring, as the maximum of the max scorers score * multiplied by the investment */ `
`float max = Float.MIN_VALUE; `
`for (int idx = 0; idx < this.maxScorersForCurrDoc.length; idx++) { `
`if (this.maxScorersForCurrDoc[idx]) { `
`float score = this.maxScorers[idx].score(); `
``` if (score > max) { max = score; } } } float priority = this.priorityScorer.score(); return this.qWeight * max * priority; } @Override public int nextDoc() throws IOException { //Advance the max scorers this.advanceMaxScorers(); if (this.maxScorersMinOfNextDocs == NO_MORE_DOCS) { this.doc = NO_MORE_DOCS; } else { /* Assign the current document that matches this query and scorer */ this.doc = maxScorersMinOfNextDocs; /* Advance the investment scorer to the current matching document (all documents have investment) */ this.priorityScorer.advance(this.doc); /* Determine which of the max scorers match the current matching document, in order to use them for the score calculation */ for (int idx = 0; idx < this.maxScorersNextDocs.length; idx++) { this.maxScorersForCurrDoc[idx] = this.maxScorersNextDocs[idx] == this.doc ? true : false; ```
`} `
`} `
`/* Return the current matching document */ `
`return this.doc;`
` } `

The `nextDoc() `method must always return a doc id that matches at least one of the clauses of the boolean query. For this reason the overloaded `nextDoc()` method calls the `advanceMaxScorers()` method. This method does the following: it calls the nextDoc() method of the scorer for each clause and stores the doc id in an array of length N (`maxScorersNextDocs`), where N is the number of clauses, and stores the minimum doc id in the variable `maxScorersMinOfNextDocs.` In subsequent calls to this method, the `nextDoc()` method of each clause's scorer will be called only if the document number in `maxScorersNextDocs` for each scorer is less or equal than `maxScorersMinOfNextDocs`. The aim of the method is to advance each scorer in a way that can be determined if one or more of the scorers match a single document number. For example:
MaxScorerBDocs = [18,20]
In this example, the aim is to advance only the MaxScorerA and not the MaxScorerB, so in a certain moment we will have the document number 18 as the current document in `maxScorersNextDocs` for both scorers.
The `score()` method reads `maxScorersForCurrDoc` in order to determine which of the scorers match the current document setted by `nextDoc()`, and then gets the maximum score from the `maxScorers` that match and multiplies it for the score of the `investmentScorer`.
This way, we can customize the scoring of each document for implement the proposed formula. The `MaxQuery` class could be generalized for other similar problems that need a customized formula, and even be included (a generalized version of this class) in a future relese of Lucene.
There are a certain number of tickets related with this:
References: