You Know, for Search…
I have heard a simplistic description of Elasticsearch that goes something like:
Elasticsearch is an index of inverted indices.
What is an inverted index? An inverted index allows us to store documents in a way that allows us to quickly figure out what documents a term appears in. For example, think about indexing all of Wikipedia and looking to find every Wikipedia article that the word “drone” appears in without having to wait. How did we get the “drone” articles?
Long before we needed the articles, we pushed the articles as documents into Elasticsearch. Elasticsearch documentation refers to these moments as “index time”. Sure, the articles are stored in an inverted index or in most cases many inverted indices. However, before that happened, we could have directed Elasticsearch to “analyze” the articles.
One way to think of this step is to think of it as a chance to normalize the data. There is not really any difference between “Drones” or “drone” when we are searching for articles with references to “drones”. In the inverted index, all variations might be represented by a single form, “drone”. This might also be the time to take other ideas into consideration like synonyms.
When we actually ask for the articles with the word “drone”, this gets referred to as “search time”. At search time, your query might get analyzed in a similar fashion so that you can search “Drones” or “a drone” and get back the articles containing “Drones” or “drone”.
When you look for “a drone”, the results you get back might conveniently be ordered such that the results containing the most “drone” terms appear before the results with fewer instances of “drone” terms. The reason that this is possible is because Elasticsearch relies on term frequency and inverse document frequency.
The results with more instances of “drone” are likely to be more relevant. This is the idea behind term frequency.
A good setup likely ignored the “a” term because it is a “stop” word. Almost every article will have the term and it is not useful to be able to search these extremely common terms. But even if the “a” term had not been ignored, the top results are likely related to “drones”. The more a term appears across many documents, the less rare it is, and the less important it is. This is the idea behind inverse document frequency.
Elasticsearch can be distributed. One document might be stored in multiple inverted indices on multiple machines. At search time, your results might come from many machines in an Elasticsearch “cluster”. Why?
- One of these machines might go down, but your applications are uninterrupted because copies of documents were replicated onto multiple machines
- You get back your results faster because work is spread across multiple machines
What we control when working with the data and queries out of the box:
- if data gets indexed
- how data is analyzed or if at all before getting indexed
- how users queries at search time are analyzed
- building queries using Elasticsearch’s query language containing requests to filter, query, and run aggregations against the data
- which of multiple variations of scoring algorithms is used to score results using term frequencies and inverse document frequencies