Indexing HTML content in Elasticsearch
LAST UPDATED on 22 Nov 2019 –
Elasticsearch allows you to create analyzers
giving you control over how your content is
indexed. What if this content is HTML content?
Perhaps, you need to index content submitted
via rich text editors in Wordpress forms. Or,
you might be improving the search experience for
your intranet. Elasticsearch’s html_strip
character
filter is handy in these types of scenarios.
If you start out with the default standard
analyzer,
you might find that Elasticsearch is doing fine
when returning results for your queries. Upon closer
inspection, you will notice that it is not
returning results that it should if you are using
phrase queries. Alternatively, Elasticsearch returns
odd results that you did not expect because HTML
tags and the value of tag attributes are indexed,
or HTML entities have not been decoded.
Elasticsearch’s html_strip
filter
Incorporating Elasticsearch’s html_strip
character filter is
straighforward. Here is a sample analyzer that
leverages html_strip
named content
.
"content": {
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"stop"
],
"tokenizer": "standard"
}
The analyzer strips HTML elements and decodes HTML entities
prior to piping the content through the lowercase
, and stop
filters.
Review of Analysis with html_strip
Elasticsearch has an Analyze API endpoint available that allows you to review the results of the analysis process.
Let’s use the following sample text:
<p>The quick brown fox jumps over <strong>the lazy dog.</strong></p>
Then, a request against the analyze endpoint
GET /_analyze
{
"char_filter" : ["html_strip"],
"filter" : ["lowercase", "stop"],
"text" : "<p>The quick brown fox jumps over <strong>the lazy dog.</strong></p>"
"tokenizer" : "standard"
}
will yield:
{
"tokens":[
{"token":"the","start_offset":3,"end_offset":6,"type":"<ALPHANUM>","position":0},
{"token":"quick","start_offset":7,"end_offset":12,"type":"<ALPHANUM>","position":1},
{"token":"brown","start_offset":13,"end_offset":18,"type":"<ALPHANUM>","position":2},
{"token":"fox","start_offset":19,"end_offset":22,"type":"<ALPHANUM>","position":3},
{"token":"jumps","start_offset":23,"end_offset":28,"type":"<ALPHANUM>","position":4},
{"token":"over","start_offset":29,"end_offset":33,"type":"<ALPHANUM>","position":5},
{"token":"the","start_offset":42,"end_offset":45,"type":"<ALPHANUM>","position":6},
{"token":"lazy","start_offset":46,"end_offset":50,"type":"<ALPHANUM>","position":7},
{"token":"dog","start_offset":51,"end_offset":54,"type":"<ALPHANUM>","position":8}
]
}
As expected, p
and strong
markup tags have been dropped and the key
tokens from the sample text are left.
Checkout a visual comparison of the results of the html_strip
character filter against the standard analyzer in my
elasticsearch-analysis-inspector
app.
When html_strip
falls short
There are few situations that make leveraging the built-in filter tricky.
-
The filter falls short when working with invalid HTML. In these instances, you might have to find a way to clean up the content of HTML tags before submitting the content to Elasticsearch.
-
Headings, body content, etc need to be parsed from markup if you need to differentiate, target, and boost particular pieces of the content.