Weighted Random Selection with Elasticsearch

I recently found an interesting stackoverflow question that I posted an answer to without the thought the question deserved. The poster was storing documents in Elasticsearch and using function_score with its random_score option to randomly select a document from a set. This seemed fairly straightforward. The poster also wanted to boost certain documents. My first response suggested the use of query boost to accomplish the task. However, what the poster really wanted was weighted random sampling and upon further exploration, I discovered this was not the simple task of constructing the right query as I had initially thought.

I also realized the value of an answer to a question like this. Supposed you wanted to feature a randomly selected product to each visitor. Because it’s summer, you might also want some products to have a higher probability of being selected relative to others. Out of the box, there is no way to accomplish this task with Elasticsearch and guarantee that a set of seasonal products are twice as likely to be selected relative to the others.

Weighted Random Sampling (without Elasticsearch)

Let’s start with a contrived example to use for this post:

  • A very fashionable umbrella
  • Socks
  • Dress shoes
  • Shovel (weight: 2)
  • Swim wear (weight: 3)
  • Sunscreen (weight: 3)

Since it is summer, I assigned higher weights to items like swim wear and sunscreen. The intention here would be to have the sunscreen selected as a featured product with three times the likelihood as the umbrella.

The quick solution that comes to mind is to construct an artificial set where the shovel appears twice, the swim wear appears three times, and then sunscreen appears three times. Selecting randomly from this set gives the sunscreen three times the probability of being selected as the umbrella.

Of course, that does not scale well. Just as easily, we can take a sum of the weights, the size of that artificial set, and select a random number within the range of 0 to sum-1. Map this number back to an index within the original set and we have our featured product!

products = ['umbrella', 'socks', 'shoes', 'shovel' 'swimwear', 'sunscreen']
weightSum = 11
randSel = ... // within [0-10]
mapping = {
  0: 0,
  1: 1,
  2: 2,
  3: 3, 4: 3,
  5: 4, 6: 4, 7: 4,
  8: 5, 9: 5, 10: 5
}
featured = products[mapping[randSel]]

I took Elasticsearch out of the equation. Without Elasticsearch, it might be easier to see that selecting documents involved the total number of documents and the sum of weights assigned to documents.

Now With Elasticsearch

With Elasticsearch, I was unable to find a way to assign scores that would rely on a sum of weights to pick a top hit that would act as the featured product.

The function_score query is a flexible option that I have seen recommended for this purpose. But a combination of random_score and weight is not enough as this gist demonstrates.

Some notes on the script:

  • script creates one document with featured_flag enabled and one document with undesired_flag enabled that should not be ever chosen
  • TOTAL_DOCUMENTS can be used to adjust how many total documents are created (including the first two created)
  • FEATURED_FLAG_WEIGHT is the weight applied at query time via function_score
  • script reruns the same query 1000 times and outputs stats on how many times each of the created documents was returned as the first result

Leveraging Aggregations

One possible solution is to make an initial request to Elasticsearch leveraging aggregations and a second request with a random offset selected outside of Elasticsearch.

What we need is to figure out how many documents there are for each weight. We can do this with a simple terms aggregation.

{
  "query": ...,
  "aggs": {
    "weights": {
      "terms": {
        "field": "weight",
        "size": 0
      }
    }
  },
  "size": 0
}

Using the initial example, a response might look like:

{
    ...,
    "aggregations" : {
        "weights" : {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets" : [
                {
                    "key" : "1",
                    "doc_count" : 3
                },
                {
                    "key" : "3",
                    "doc_count" : 2
                },
                {
                    "key" : "2",
                    "doc_count" : 1
                },
            ]
        }
    }
}

Random Offset

The aggregations results provide the following key pieces that we need:

  • sum of weights to define a range for the initial random number
  • count for each of the weights to map the random to a valid offset
{
  "query": ...,
  "from": <RANDOM OFFSET>,
  "size": 1,
  "sort": [
    "weight",
    "id"
  ]
}

I explicitly set the sort so that our random offset applies to a similar ordered set each time.

Where the Solution Falls Short

This solution falls short if you need to expand this to the task of weighted reservoir sampling (selecting more than one). You might be able to fake it via Elasticsearch’s multi search API or by making multiple requests with each subsequent request filtering out a set of obtained results. Another concern might be a large number of unique weight values which increases the work done to select the random offset.

References


© 2019. All rights reserved.