Elasticsearch – Analysis

When a query is processed during a search operation, the content in any index is analyzed by the Elasticsearch Analysis module. This module consists of the analyzer, tokenizer, token filters, and char filters. If no analyzer is defined, then by default the built-in analyzers, token, filters, and tokenizers get registered with the analysis module.

In the following example, we use a standard analyzer which is used when no other analyzer is specified. It will analyze the sentence based on the grammar and produce words used in the sentence.

POST _analyze
{
   "analyzer": "standard",
   "text": "Today's weather is beautiful"
}

On running the above code, we get the response as shown below βˆ’

{
   "tokens" : [
      {
         "token" : "today's",
         "start_offset" : 0,
         "end_offset" : 7,
         "type" : "",
         "position" : 0
      },
      {
         "token" : "weather",
         "start_offset" : 8,
         "end_offset" : 15,
         "type" : "",
         "position" : 1
      },
      {
         "token" : "is",
         "start_offset" : 16,
         "end_offset" : 18,
         "type" : "",
         "position" : 2
      },
      {
         "token" : "beautiful",
         "start_offset" : 19,
         "end_offset" : 28,
         "type" : "",
         "position" : 3
      }
   ]
}

Configuring the Standard analyzer

We can configure the standard analyzer with various parameters to get our custom requirements.

In the following example, we configure the standard analyzer to have a max_token_length of 5.

For this, we first create an index with the analyzer having the max_length_token parameter.

PUT index_4_analysis
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_english_analyzer": {
               "type": "standard",
               "max_token_length": 5,
               "stopwords": "_english_"
            }
         }
      }
   }
}

Next, we apply the analyzer with a text as shown below. Please note how the token does not appear as it has two spaces at the beginning and two spaces at the end. For the word β€œis”, there is a space at the beginning of it and a space at the end of it. Taking all of them, it becomes 4 letters with spaces and that does not make it a word. There should be a nonspace character at least at the beginning or at the end, to make it a word to be counted.

POST index_4_analysis/_analyze
{
   "analyzer": "my_english_analyzer",
   "text": "Today's weather is beautiful"
}

On running the above code, we get the response as shown below βˆ’

{
   "tokens" : [
      {
         "token" : "today",
         "start_offset" : 0,
         "end_offset" : 5,
         "type" : "",
         "position" : 0
      },
      {
         "token" : "s",
         "start_offset" : 6,
         "end_offset" : 7,
         "type" : "",
         "position" : 1
      },
      {
         "token" : "weath",
         "start_offset" : 8,
         "end_offset" : 13,
         "type" : "",
         "position" : 2
      },
      {
         "token" : "er",
         "start_offset" : 13,
         "end_offset" : 15,
         "type" : "",
         "position" : 3
      },
      {
         "token" : "beaut",
         "start_offset" : 19,
         "end_offset" : 24,
         "type" : "",
         "position" : 5
      },
      {
         "token" : "iful",
         "start_offset" : 24,
         "end_offset" : 28,
         "type" : "",
         "position" : 6
      }
   ]
}

The list of various analyzers and their description is given in the table shown below βˆ’

S.NoAnalyzer & Description
1Standard analyzer (standard)stopwords and max_token_length setting can be set for this analyzer. By default, the stopwords list is empty and max_token_length is 255.
2Simple analyzer (simple)This analyzer is composed of a lowercase tokenizer.
3Whitespace analyzer (whitespace)This analyzer is composed of a whitespace tokenizer.
4Stop analyzer (stop)stopwords and stopwords_path can be configured. By default, stopwords are initialized to English stop words and stopwords_path contains the path to a text file with stop words.

Tokenizers

Tokenizers are used for generating tokens from a text in Elasticsearch. Text can be broken down into tokens by taking whitespace or other punctuations into account. Elasticsearch has plenty of built-in tokenizers, which can be used in custom analyzers.

An example of a tokenizer that breaks text into terms whenever it encounters a character that is not a letter, but it also lowercases all terms, is shown below βˆ’

POST _analyze
{
   "tokenizer": "lowercase",
   "text": "It Was a Beautiful Weather 5 Days ago."
}

On running the above code, we get the response as shown below βˆ’

{
   "tokens" : [
      {
         "token" : "it",
         "start_offset" : 0,
         "end_offset" : 2,
         "type" : "word",
         "position" : 0
      },
      {
         "token" : "was",
         "start_offset" : 3,
         "end_offset" : 6,
         "type" : "word",
         "position" : 1
      },
      {
         "token" : "a",
         "start_offset" : 7,
         "end_offset" : 8,
         "type" : "word",
         "position" : 2
      },
      {
         "token" : "beautiful",
         "start_offset" : 9,
         "end_offset" : 18,
         "type" : "word",
         "position" : 3
      },
      {
         "token" : "weather",
         "start_offset" : 19,
         "end_offset" : 26,
         "type" : "word",
         "position" : 4
      },
      {
         "token" : "days",
         "start_offset" : 29,
         "end_offset" : 33,
         "type" : "word",
         "position" : 5
      },
      {
         "token" : "ago",
         "start_offset" : 34,
         "end_offset" : 37,
         "type" : "word",
         "position" : 6
      }
   ]
}

A list of Tokenizers and their descriptions are shown here in the table given below βˆ’

S.NoTokenizer & Description
1Standard tokenizer (standard)This is built on a grammar-based tokenizer and max_token_length can be configured for this tokenizer.
2Edge NGram tokenizer (edge n-gram)Settings like min_gram, max_gram, token_chars can be set for this tokenizer.
3Keyword tokenizer (keyword)This generates the entire input as output and buffer_size can be set for this.
4Letter tokenizer (letter)This captures the whole word until a non-letter is encountered.

Next Topic – Click Here

This Post Has One Comment

Leave a Reply