Indexing HTML content in Elasticsearch

Notes on Elasticsearch’s html_strip filter

Elasticsearch allows you to create analyzers giving you control over how your content is indexed. What if this content is HTML content? Perhaps, you need to index content submitted via rich text editors in Wordpress forms. Or, you might be improving the search experience for your intranet. Elasticsearch’s html_strip filter is handy in these types of scenarios.

If you start out with the default standard analyzer, you might find that Elasticsearch is doing fine when returning results for your queries. Upon closer inspection, you will notice that it is not returning results that it should if you are using phrase queries. Alternatively, Elasticsearch returns odd results that you did not expect because HTML tags and the value of tag attributes are indexed, or HTML entities have not been decoded.

Elasticsearch’s html_strip filter

Incorporating, Elasticsearch’s dedicated filter is straighforward. Here is a sample analyzer that leverages html_strip named content.

"content": {
  "char_filter": [
    "html_strip"
  ]
  , "filter": [
    "lowercase"
    , "stop"
    , "synonym"
    , "snowball"
  ]
  , "tokenizer": "standard"
}

The analyzer strips HTML elements and decodes HTML entities prior to piping the content through the lowercase, stop, synonym, and snowball filters.

When html_strip falls short

There are few situations that make leveraging the built-in filter tricky.

  • The filter falls short when working with invalid HTML. In these instances, you might have to find a way to clean up the content of HTML tags before submitting the content to Elasticsearch.

  • Headings, body content, etc need to be parsed from markup if you need to differentiate, target, and boost particular pieces of the content.

References


© 2019. All rights reserved.