分词器
分词器的作用是当一个文档被索引时,分词器从文档中提取出若干词元来支持索引的存储和搜索,其是由一个分解器(tokenizer)和零个或若干个词元过滤器(token filters)组成的
分解器(tokenizer)处理前可能会做一些预处理,如去掉里面的HTML标记,这些处理算法称为字符过滤器(character filters),一个分解器中会有零个或多个字符过滤器,分解器是用来把字符串分解成一系列词元
词元过滤器(token filters)作用是对分词器提取出来的词元作进一步处理,如大小写转换等,处理后的结果称为索引词,文档中包含几个这样的索引词称为词频,引擎会建立索引和原文档的倒排索引,这样就可以通过索引词很快找到原文档
elasticsearch提供的默认分词器有
- Standard Analyzer
- Simple Analyzer
- Whitespace Analyzer
- Stop Analyzer
- Keyword Analyzer
- Pattern Analyzer
- Language Analyzers
- Fingerprint Analyzer
elasticsearch提供的默认分解器有
Word Oriented Tokenizers
Standard Tokenizer
Letter Tokenizer
Lowercase Tokenizer
Whitespace Tokenizer
UAX URL Email Tokenizer
Classic Tokenizer
Thai Tokenizer
Partial Word Tokenizers
- N-Gram Tokenizer
- Edge N-Gram Tokenizer
Structured Text Tokenizers
- Keyword Tokenizer
- Pattern Tokenizer
- Simple Pattern Tokenizer
- Char Group Tokenizer
- Simple Pattern Split Tokenizer
- Path Tokenizer
默认的词元过滤器有
Standard Token Filter
ASCII Folding Token Filter
Flatten Graph Token Filter
Length Token Filter
Lowercase Token Filter
Uppercase Token Filter
NGram Token Filter
Edge NGram Token Filter
Porter Stem Token Filter
Shingle Token Filter
Stop Token Filter
Word Delimiter Token Filter
Word Delimiter Graph Token Filter
Multiplexer Token Filter
Conditional Token Filter
Predicate Token Filter Script
Stemmer Token Filter
Stemmer Override Token Filter
Keyword Marker Token Filter
Keyword Repeat Token Filter
KStem Token Filter
Snowball Token Filter
Phonetic Token Filter
Synonym Token Filter
Parsing synonym files
Synonym Graph Token Filter
Compound Word Token Filters
Reverse Token Filter
Elision Token Filter
Truncate Token Filter
Unique Token Filter
Pattern Capture Token Filter
Pattern Replace Token Filter
Trim Token Filter
Limit Token Count Token Filter
Hunspell Token Filter
Common Grams Token Filter
Normalization Token Filter
CJK Width Token Filter
CJK Bigram Token Filter
Delimited Payload Token Filter
Keep Words Token Filter
Keep Types Token Filter
Exclude mode settings example
Classic Token Filter
Apostrophe Token Filter
Decimal Digit Token Filter
Fingerprint Token Filter
MinHash Token Filter
Remove Duplicates Token Filter
默认的字符过滤器有
- HTML Strip Character Filter
- Mapping Character Filter
- Pattern Replace Character Filter
自定义分词器
自定义分词器可以设置
- tokenizer 分解器的逻辑
- char_filter 字符过滤器的逻辑
- filter 词元过滤器的逻辑
- position_increment_gap 每个使用本分词字段的字段值之间可增加的位置,默认100
示例
1 | { |
中文分词器
elasticsearch中内置的分词器对中文支持不是很友好,可以使用IKAnanlyzer
1 | GET /_analyze |
IKAnanlyzer中提供了两种分词器:ik_max_word和ik_smart
- ik_max_word:会将文本做最细粒度的拆分,比如会将“张三丰打太极”拆分为“张三丰 张三 三 丰 打 太极”,会穷尽各种可能的组合,适合Term Query;
- ik_smart: 会做最粗粒度的拆分,比如会将“张三丰打太极”拆分为“张三丰 打 太极”,适合 Phrase 查询