-
Overview
All the words come from Elasticsearch Reference 7.0, for study.
Text Analysis is the process of converting unstructed text, like the body of an email or a product description, into a structured format that’s opetimized for search.
-
Tokenization
Analysis makes full-text search possible through
tokenization
: breaking a text down into small chunks, calledtokens
.In most cases, these tokens are individual words.
-
Normalization
Tokenization enables matching on individual terms, but each
token
is still matchedliterally
.For search enable synonyms, similar meaning, same root word …
-
Analyzer
Text analysis is performed by an analyzer, a set of rules that govern then entire process.
A custom analyzer gives you control over each step of the analysis process, including :
- Changes to the text before tokenization
- How text is converted to tokens
- Normalization changes made to
tokens
beforeindexing
orsearch
An analyzer - whether built-in or custom - is just a package which contains three lower-level building blocks: charater filters, tokenizers, and token filters.
-
Character Filters
A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing chracters.
An analyzer may have zero or more character filters.
-
Tokenizer
A tokenizer receives a stream of characters, breaks it up into individual
tokens
, and outputs a stream oftokens
.The tokenizer is also responsible for recording the order or position of each term and start and end character offsets of the original word which the term represents.
An analyzer must have exactly one tokenizer.
-
Token filters
A token filter receives the
token
stream and may add, remove, or changetokens
.An analyzer may have zero or more token filters, which are applied in order.
-
Index and search analysis
Text analysis occurs at two times:
When a document is indexed, any
text
field values are analyzed; when running a ful-text search on atext
field, the query string is analyzed, which called search time or query time.The analyzer, or set of analysis rules, used at each time is called the index analyzer or search analyzer respectively.
-
Stemming
Stemming is the process of reducing a word to its root form.
For example,
walking
andwalked
can be stemmed to the same root word:walk
.In some cases, the root form of a stemmed word may not be a real word.
Stemming is handled by
stemmer token filters
. These token filters can be categorized based on how they stem words: Algorithmic stemmers and Dictionary stemmers.
深入理解Elasticsearch专题:Text Analysis
猜你喜欢
转载自blog.csdn.net/The_Time_Runner/article/details/111709150
今日推荐
周排行