Elasticsearch.Nest 系列 6-2 分析:Testing analyzers | 测试分词器


借助 Analyze API,可以方便测试内置/自定义的分词器。

测试内置分词器

通过 Analyze API,可以查看内置分析器是如何分析一段文本。

  • 基于标准分词器

    var analyzeResponse = _client.Indices.Analyze(a => a
      .Analyzer("standard") 
      .Text("F# is THE SUPERIOR language :)")
    );
    

实际发送的请求如下:

POST /_analyze
{
    "analyzer": "standard",
    "text": [
        "F# is THE SUPERIOR language :)"
    ]
}

响应结果如下:

{
    "tokens": [
        {
            "token": "f",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "is",
            "start_offset": 3,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "the",
            "start_offset": 6,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "superior",
            "start_offset": 10,
            "end_offset": 18,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "language",
            "start_offset": 19,
            "end_offset": 27,
            "type": "<ALPHANUM>",
            "position": 4
        }
    ]
}
  • “F#” 被切分为了 “f”。
  • 停用词 is、the 依然包含。

通过 Nest 客户端请求的时候,会将响应流反序列化为 AnalyzeResponse 实例:

你可以通过如下方式来使用:

if (analyzeResponse.IsValid)
{
    foreach (var analyzeToken in analyzeResponse.Tokens)
    {
        Console.WriteLine($"{analyzeToken.Token}");
    }
}

增加分词器的其他组成部分后再进行测试

增加 Filter 过滤器:对分词进一步加工:转小写,删除停用词。

var analyzeResponse = client.Indices.Analyze(a => a
    .Tokenizer("standard")
    .Filter("lowercase", "stop")
    .Text("F# is THE SUPERIOR language :)")
);

请求 URL 如下:

POST /_analyze
{
    "filter": [
        "lowercase",
        "stop"
    ],
    "text": [
        "F# is THE SUPERIOR language :)"
    ],
    "tokenizer": "standard"
}

响应结果如下:

{
    "tokens": [
        {
            "token": "f",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "superior",
            "start_offset": 10,
            "end_offset": 18,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "language",
            "start_offset": 19,
            "end_offset": 27,
            "type": "<ALPHANUM>",
            "position": 4
        }
    ]
}
  • character filter和 token filter 根据指定的顺序进行应用。

测试自定义分词器

可以在创建索引时或通过更新现有索引上的方式再索引上创建自定义分析器。

  • 当添加到一个已经存在的索引上的时候,需要先关闭索引。

关闭索引:

_client.Indices.Close("analysis-index");

更新设置:添加分词器

_client.Indices.UpdateSettings("analysis-index", i => i
    .IndexSettings(s => s
        .Analysis(a => a
            .CharFilters(cf => cf
                .Mapping("my_char_filter", m => m
                    .Mappings("F# => FSharp")
                )
            )
            .TokenFilters(tf => tf
                .Synonym("my_synonym", sf => sf
                    .Synonyms("superior, great") // 使用同义词来对分词后的结果进行加工。

                )
            )
            .Analyzers(an => an
                .Custom("my_analyzer", ca => ca
                    .Tokenizer("standard") //使用标准分词器
                    .CharFilters("my_char_filter") //使用自定义的 character filter
                    .Filters("lowercase", "stop", "my_synonym")
                )
            )

        )
    )
);

重新打开索引:

_client.Indices.Open("analysis-index");
_client.Cluster.Health("analysis-index", h => h
    .WaitForStatus(WaitForStatus.Green)
    .Timeout(TimeSpan.FromSeconds(5))
);

测试分词器:

var analyzeResponse = _client.Indices.Analyze(a => a
    .Index("analysis-index") 
    .Analyzer("my_analyzer")
    .Text("F# is THE SUPERIOR language :)")
);

结果如下:

{
  "tokens": [
    {
      "token": "fsharp",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "superior",
      "start_offset": 10,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "great",
      "start_offset": 10,
      "end_offset": 18,
      "type": "SYNONYM",
      "position": 3
    },
    {
      "token": "language",
      "start_offset": 19,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 4
    }
  ]
}

测试应用在字段上的分词器

通过 Analyze Api,同样可以测试应用在“字段”上的分词器。

假定有以下索引:

_client.Indices.Create("project-index", i => i
    .Settings(s => s
        .Analysis(a => a
            .CharFilters(cf => cf
                .Mapping("my_char_filter", m => m
                    .Mappings("F# => FSharp")
                )
            )
            .TokenFilters(tf => tf
                .Synonym("my_synonym", sf => sf
                    .Synonyms("superior, great")

                )
            )
            .Analyzers(an => an
                .Custom("my_analyzer", ca => ca
                    .Tokenizer("standard")
                    .CharFilters("my_char_filter")
                    .Filters("lowercase", "stop", "my_synonym")
                )
            )

        )
    )
    .Map<Project>(mm => mm
        .Properties(p => p
            .Text(t => t
                .Name(n => n.Name)
                .Analyzer("my_analyzer")  //在 Name 字段上设定分词器
            )
        )
    )
);

请求 URL 如下:

PUT /project-index
{
    "mappings": {
        "properties": {
            "name": {
                "analyzer": "my_analyzer",
                "type": "text"
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "char_filter": [
                        "my_char_filter"
                    ],
                    "filter": [
                        "lowercase",
                        "stop",
                        "my_synonym"
                    ],
                    "tokenizer": "standard",
                    "type": "custom"
                }
            },
            "char_filter": {
                "my_char_filter": {
                    "mappings": [
                        "F# => FSharp"
                    ],
                    "type": "mapping"
                }
            },
            "filter": {
                "my_synonym": {
                    "synonyms": [
                        "superior, great"
                    ],
                    "type": "synonym"
                }
            }
        }
    }
}

测试应用在 Name 上的分词器:

var analyzeResponse = _client.Indices.Analyze(a => a
    .Index("project-index")
    .Field<Project, string>(f => f.Name)
    .Text("F# is THE SUPERIOR language :)")
);

请求 URL 如下:

POST /project-index/_analyze
{
    "field": "name",
    "text": [
        "F# is THE SUPERIOR language :)"
    ]
}

结果如下:

{
    "tokens": [
        {
            "token": "fsharp",
            "start_offset": 0,
            "end_offset": 2,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "superior",
            "start_offset": 10,
            "end_offset": 18,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "great",
            "start_offset": 10,
            "end_offset": 18,
            "type": "SYNONYM",
            "position": 3
        },
        {
            "token": "language",
            "start_offset": 19,
            "end_offset": 27,
            "type": "<ALPHANUM>",
            "position": 4
        }
    ]
}

常言道:学然后知不足,教然后知困。

我知道你的焦虑,一起共进加油:P

关不关注都无所谓,会根据生活节奏紧凑度定期分享些开发经验、搬砖生涯、痛点、感悟。

欢迎关注我的订阅号:P