Indexing HTML content in Elasticsearch

Notes on Elasticsearch’s html_strip filter

LAST UPDATED on 22 Nov 2019 – Elasticsearch allows you to create analyzers giving you control over how your content is indexed. What if this content is HTML content? Perhaps, you need to index content submitted via rich text editors in Wordpress forms. Or, you might be improving the search experience for your intranet. Elasticsearch’s html_strip character filter is handy in these types of scenarios.

If you start out with the default standard analyzer, you might find that Elasticsearch is doing fine when returning results for your queries. Upon closer inspection, you will notice that it is not returning results that it should if you are using phrase queries. Alternatively, Elasticsearch returns odd results that you did not expect because HTML tags and the value of tag attributes are indexed, or HTML entities have not been decoded.

Elasticsearch’s html_strip filter

Incorporating Elasticsearch’s html_strip character filter is straighforward. Here is a sample analyzer that leverages html_strip named content.

"content": {
  "char_filter": [
    "html_strip"
  ],
  "filter": [
    "lowercase",
    "stop"
  ],
  "tokenizer": "standard"
}

The analyzer strips HTML elements and decodes HTML entities prior to piping the content through the lowercase, and stop filters.

Review of Analysis with html_strip

Elasticsearch has an Analyze API endpoint available that allows you to review the results of the analysis process.

Let’s use the following sample text:

<p>The quick brown fox jumps over <strong>the lazy dog.</strong></p>

Then, a request against the analyze endpoint

GET /_analyze
{
  "char_filter" : ["html_strip"],
  "filter" : ["lowercase", "stop"],
  "text" : "<p>The quick brown fox jumps over <strong>the lazy dog.</strong></p>"
  "tokenizer" : "standard"
}

will yield:

{
  "tokens":[
    {"token":"the","start_offset":3,"end_offset":6,"type":"<ALPHANUM>","position":0},
    {"token":"quick","start_offset":7,"end_offset":12,"type":"<ALPHANUM>","position":1},
    {"token":"brown","start_offset":13,"end_offset":18,"type":"<ALPHANUM>","position":2},
    {"token":"fox","start_offset":19,"end_offset":22,"type":"<ALPHANUM>","position":3},
    {"token":"jumps","start_offset":23,"end_offset":28,"type":"<ALPHANUM>","position":4},
    {"token":"over","start_offset":29,"end_offset":33,"type":"<ALPHANUM>","position":5},
    {"token":"the","start_offset":42,"end_offset":45,"type":"<ALPHANUM>","position":6},
    {"token":"lazy","start_offset":46,"end_offset":50,"type":"<ALPHANUM>","position":7},
    {"token":"dog","start_offset":51,"end_offset":54,"type":"<ALPHANUM>","position":8}
  ]
}

As expected, p and strong markup tags have been dropped and the key tokens from the sample text are left.

Checkout a visual comparison of the results of the html_strip character filter against the standard analyzer in my elasticsearch-analysis-inspector app.

When html_strip falls short

There are few situations that make leveraging the built-in filter tricky.

  • The filter falls short when working with invalid HTML. In these instances, you might have to find a way to clean up the content of HTML tags before submitting the content to Elasticsearch.

  • Headings, body content, etc need to be parsed from markup if you need to differentiate, target, and boost particular pieces of the content.

References


© 2019. All rights reserved.