0%

分词器

分词器

分词器的作用是当一个文档被索引时,分词器从文档中提取出若干词元来支持索引的存储和搜索,其是由一个分解器(tokenizer)和零个或若干个词元过滤器(token filters)组成的

分解器(tokenizer)处理前可能会做一些预处理,如去掉里面的HTML标记,这些处理算法称为字符过滤器(character filters),一个分解器中会有零个或多个字符过滤器,分解器是用来把字符串分解成一系列词元

词元过滤器(token filters)作用是对分词器提取出来的词元作进一步处理,如大小写转换等,处理后的结果称为索引词,文档中包含几个这样的索引词称为词频,引擎会建立索引和原文档的倒排索引,这样就可以通过索引词很快找到原文档

elasticsearch提供的默认分词器有

  • Standard Analyzer
  • Simple Analyzer
  • Whitespace Analyzer
  • Stop Analyzer
  • Keyword Analyzer
  • Pattern Analyzer
  • Language Analyzers
  • Fingerprint Analyzer

elasticsearch提供的默认分解器有

  • Word Oriented Tokenizers

    • Standard Tokenizer

    • Letter Tokenizer

    • Lowercase Tokenizer

    • Whitespace Tokenizer

    • UAX URL Email Tokenizer

    • Classic Tokenizer

    • Thai Tokenizer

  • Partial Word Tokenizers

    • N-Gram Tokenizer
    • Edge N-Gram Tokenizer
  • Structured Text Tokenizers

    • Keyword Tokenizer
    • Pattern Tokenizer
    • Simple Pattern Tokenizer
    • Char Group Tokenizer
    • Simple Pattern Split Tokenizer
    • Path Tokenizer

默认的词元过滤器有

  • Standard Token Filter

  • ASCII Folding Token Filter

  • Flatten Graph Token Filter

  • Length Token Filter

  • Lowercase Token Filter

  • Uppercase Token Filter

  • NGram Token Filter

  • Edge NGram Token Filter

  • Porter Stem Token Filter

  • Shingle Token Filter

  • Stop Token Filter

  • Word Delimiter Token Filter

  • Word Delimiter Graph Token Filter

  • Multiplexer Token Filter

  • Conditional Token Filter

  • Predicate Token Filter Script

  • Stemmer Token Filter

  • Stemmer Override Token Filter

  • Keyword Marker Token Filter

  • Keyword Repeat Token Filter

  • KStem Token Filter

  • Snowball Token Filter

  • Phonetic Token Filter

  • Synonym Token Filter

  • Parsing synonym files

  • Synonym Graph Token Filter

  • Compound Word Token Filters

  • Reverse Token Filter

  • Elision Token Filter

  • Truncate Token Filter

  • Unique Token Filter

  • Pattern Capture Token Filter

  • Pattern Replace Token Filter

  • Trim Token Filter

  • Limit Token Count Token Filter

  • Hunspell Token Filter

  • Common Grams Token Filter

  • Normalization Token Filter

  • CJK Width Token Filter

  • CJK Bigram Token Filter

  • Delimited Payload Token Filter

  • Keep Words Token Filter

  • Keep Types Token Filter

  • Exclude mode settings example

  • Classic Token Filter

  • Apostrophe Token Filter

  • Decimal Digit Token Filter

  • Fingerprint Token Filter

  • MinHash Token Filter

  • Remove Duplicates Token Filter

默认的字符过滤器有

  • HTML Strip Character Filter
  • Mapping Character Filter
  • Pattern Replace Character Filter

自定义分词器

自定义分词器可以设置

  • tokenizer 分解器的逻辑
  • char_filter 字符过滤器的逻辑
  • filter 词元过滤器的逻辑
  • position_increment_gap 每个使用本分词字段的字段值之间可增加的位置,默认100

示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom", // 自定义分词器
"char_filter": [ // 字符过滤器别名,在下边定义
"emoticons"
],
"tokenizer": "punctuation",// 分解器别名,在下面定义
"filter": [ // 词元过滤器别名,在下边定义
"lowercase",
"english_stop"
]
}
},
"tokenizer": {
"punctuation": { // 自定义的分解器
"type": "pattern",
"pattern": "[ .,!?]"
}
},
"char_filter": {
"emoticons": { // 自定义的字符过滤器
"type": "mapping",
"mappings": [
":) => _happy_",
":( => _sad_"
]
}
},
"filter": {
"english_stop": { // 自定义的词元过滤器
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}

中文分词器

elasticsearch中内置的分词器对中文支持不是很友好,可以使用IKAnanlyzer

1
2
3
4
5
GET /_analyze
{
"analyzer":"ik_max_word",
"text":"张三丰打太极"
}

IKAnanlyzer中提供了两种分词器:ik_max_wordik_smart

  • ik_max_word:会将文本做最细粒度的拆分,比如会将“张三丰打太极”拆分为“张三丰 张三 三 丰 打 太极”,会穷尽各种可能的组合,适合Term Query;
  • ik_smart: 会做最粗粒度的拆分,比如会将“张三丰打太极”拆分为“张三丰 打 太极”,适合 Phrase 查询