Elasticsearch.Nest 教程系列 6-2 分析:Testing analyzers | 测试分词器

  • 本系列博文是“伪”官方文档翻译(更加本土化),并非完全将官方文档进行翻译,而是在查阅、测试原始文档并转换为自己真知灼见后的“准”翻译。有不同见解 / 说明不周的地方,还请海涵、不吝拍砖 :)

  • 官方文档见此:https://www.elastic.co/guide/en/elasticsearch/client/net-api/current/introduction.html

  • 本系列对应的版本环境:ElasticSearch@7.3.1,NEST@7.3.1,IDE 和开发平台默认为 VS2019,.NET CORE 2.1


借助 Analyze API,可以方便测试内置/自定义的分词器。

测试内置分词器

通过 Analyze API,可以查看内置分析器是如何分析一段文本。

  • 基于标准分词器

1
2
3
4
var analyzeResponse = _client.Indices.Analyze(a => a
.Analyzer("standard")
.Text("F# is THE SUPERIOR language :)")
);

实际发送的请求如下:

1
2
3
4
5
6
7
POST /_analyze
{
"analyzer": "standard",
"text": [
"F# is THE SUPERIOR language :)"
]
}

响应结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
{
"tokens": [
{
"token": "f",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "is",
"start_offset": 3,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "the",
"start_offset": 6,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "superior",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "language",
"start_offset": 19,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
}
]
}
  • “F#” 被切分为了 “f”。

  • 停用词 is、the 依然包含。

通过 Nest 客户端请求的时候,会将响应流反序列化为 AnalyzeResponse 实例:

你可以通过如下方式来使用:

1
2
3
4
5
6
7
if (analyzeResponse.IsValid)
{
foreach (var analyzeToken in analyzeResponse.Tokens)
{
Console.WriteLine($"{analyzeToken.Token}");
}
}

增加分词器的其他组成部分后再进行测试

增加 Filter 过滤器:对分词进一步加工:转小写,删除停用词。

1
2
3
4
5
var analyzeResponse = client.Indices.Analyze(a => a
.Tokenizer("standard")
.Filter("lowercase", "stop")
.Text("F# is THE SUPERIOR language :)")
);

请求 URL 如下:

1
2
3
4
5
6
7
8
9
10
11
POST /_analyze
{
"filter": [
"lowercase",
"stop"
],
"text": [
"F# is THE SUPERIOR language :)"
],
"tokenizer": "standard"
}

响应结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
"tokens": [
{
"token": "f",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "superior",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "language",
"start_offset": 19,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
}
]
}
  • character filter和 token filter 根据指定的顺序进行应用。

测试自定义分词器

可以在创建索引时或通过更新现有索引上的方式再索引上创建自定义分析器。

  • 当添加到一个已经存在的索引上的时候,需要先关闭索引。

关闭索引:

1
_client.Indices.Close("analysis-index");

更新设置:添加分词器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
_client.Indices.UpdateSettings("analysis-index", i => i
.IndexSettings(s => s
.Analysis(a => a
.CharFilters(cf => cf
.Mapping("my_char_filter", m => m
.Mappings("F# => FSharp")
)
)
.TokenFilters(tf => tf
.Synonym("my_synonym", sf => sf
.Synonyms("superior, great") // 使用同义词来对分词后的结果进行加工。

)
)
.Analyzers(an => an
.Custom("my_analyzer", ca => ca
.Tokenizer("standard") //使用标准分词器
.CharFilters("my_char_filter") //使用自定义的 character filter
.Filters("lowercase", "stop", "my_synonym")
)
)

)
)
);

重新打开索引:

1
2
3
4
5
_client.Indices.Open("analysis-index");
_client.Cluster.Health("analysis-index", h => h
.WaitForStatus(WaitForStatus.Green)
.Timeout(TimeSpan.FromSeconds(5))
);

测试分词器:

1
2
3
4
5
var analyzeResponse = _client.Indices.Analyze(a => a
.Index("analysis-index")
.Analyzer("my_analyzer")
.Text("F# is THE SUPERIOR language :)")
);

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
"tokens": [
{
"token": "fsharp",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "superior",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "great",
"start_offset": 10,
"end_offset": 18,
"type": "SYNONYM",
"position": 3
},
{
"token": "language",
"start_offset": 19,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
}
]
}

测试应用在字段上的分词器

通过 Analyze Api,同样可以测试应用在“字段”上的分词器。

假定有以下索引:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
_client.Indices.Create("project-index", i => i
.Settings(s => s
.Analysis(a => a
.CharFilters(cf => cf
.Mapping("my_char_filter", m => m
.Mappings("F# => FSharp")
)
)
.TokenFilters(tf => tf
.Synonym("my_synonym", sf => sf
.Synonyms("superior, great")

)
)
.Analyzers(an => an
.Custom("my_analyzer", ca => ca
.Tokenizer("standard")
.CharFilters("my_char_filter")
.Filters("lowercase", "stop", "my_synonym")
)
)

)
)
.Map<Project>(mm => mm
.Properties(p => p
.Text(t => t
.Name(n => n.Name)
.Analyzer("my_analyzer") //在 Name 字段上设定分词器
)
)
)
);

请求 URL 如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
PUT /project-index
{
"mappings": {
"properties": {
"name": {
"analyzer": "my_analyzer",
"type": "text"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase",
"stop",
"my_synonym"
],
"tokenizer": "standard",
"type": "custom"
}
},
"char_filter": {
"my_char_filter": {
"mappings": [
"F# => FSharp"
],
"type": "mapping"
}
},
"filter": {
"my_synonym": {
"synonyms": [
"superior, great"
],
"type": "synonym"
}
}
}
}
}

测试应用在 Name 上的分词器:

1
2
3
4
5
var analyzeResponse = _client.Indices.Analyze(a => a
.Index("project-index")
.Field<Project, string>(f => f.Name)
.Text("F# is THE SUPERIOR language :)")
);

请求 URL 如下:

1
2
3
4
5
6
7
POST /project-index/_analyze
{
"field": "name",
"text": [
"F# is THE SUPERIOR language :)"
]
}

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
"tokens": [
{
"token": "fsharp",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "superior",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "great",
"start_offset": 10,
"end_offset": 18,
"type": "SYNONYM",
"position": 3
},
{
"token": "language",
"start_offset": 19,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
}
]
}