Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with sorting some characters in which start with گ چ پ ژ #20

Open
AylinNaebzadeh opened this issue Sep 16, 2023 · 2 comments
Open

Comments

@AylinNaebzadeh
Copy link

When I try to sort my documents, the documents in which their name starts with one of the characters like گ چ پ ژ, the sort does not work correctly.
This is the index which I have created:

PUT index_persian_names_test_with_nariman_analyzer
{
  "mappings": {
    "properties": {
      "name": {
        "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "analyzer": "persian_custom_analyzer"
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": 5,
      "max_result_window": 5000,
      "analysis": {
        "analyzer": {
          "english_custom_analyzer": {
            "filter": [
              "lowercase",
              "decimal_digit"
            ],
            "tokenizer": "classic"
          },
          "persian_custom_analyzer": {
            "filter": [
              "lowercase",
              "decimal_digit",
              "parsi_normalizer"
            ],
            "char_filter": [
              "zero_width_spaces"
            ],
            "type": "custom",
            "tokenizer": "standard"
          }
        },
        "char_filter": {
          "zero_width_spaces": {
            "type": "mapping",
            "mappings": [
              """\u200C => \u0020""",
              """\u200B => \u0020""",
              """\u200D => \u0020""",
              """\u200E => \u0020""",
              """\u200F => \u0020""",
              """\u001F => \u0020""",
              """\u00AC => \u0020"""
            ]
          }
        }
      },
      "number_of_replicas": 0
    }
  }
}

I've added these documents:

POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "کرگدن"
}


POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "فیل"
}


POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "پاندا"
}


POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "قناری"
}


POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "گراز وحشی"
}


POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "ژیان"
}

POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "یوزپلنگ"
}

And finally when I try to see the sorted results, the document in which starts with پ must come at first, but it does not.

GET index_persian_names_test_with_nariman_analyzer/_search
{
  "query": {
    "match_all": {
      
    }
  },
  "sort": [
    {
      "name.keyword": {
        "order": "asc"
      }
    }
  ]
}

Here is the result:

{
  "took" : 624,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "F0qtn4oBMhBe8matcKHy",
        "_score" : null,
        "_source" : {
          "name" : "فیل"
        },
        "sort" : [
          "فیل"
        ]
      },
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "30qtn4oBMhBe8matgqHw",
        "_score" : null,
        "_source" : {
          "name" : "قناری"
        },
        "sort" : [
          "قناری"
        ]
      },
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "3Eqtn4oBMhBe8mateKHt",
        "_score" : null,
        "_source" : {
          "name" : "پاندا"
        },
        "sort" : [
          "پاندا"
        ]
      },
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "pkqtn4oBMhBe8matn6Jp",
        "_score" : null,
        "_source" : {
          "name" : "ژیان"
        },
        "sort" : [
          "ژیان"
        ]
      },
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "FUqtn4oBMhBe8matXqGp",
        "_score" : null,
        "_source" : {
          "name" : "کرگدن"
        },
        "sort" : [
          "کرگدن"
        ]
      },
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "4Eqtn4oBMhBe8mati6HX",
        "_score" : null,
        "_source" : {
          "name" : "گراز وحشی"
        },
        "sort" : [
          "گراز وحشی"
        ]
      },
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "qUqtn4oBMhBe8matpqKq",
        "_score" : null,
        "_source" : {
          "name" : "یوزپلنگ"
        },
        "sort" : [
          "یوزپلنگ"
        ]
      }
    ]
  }
}

I will be grateful for your help...

@AylinNaebzadeh
Copy link
Author

I have also tried the below rebuilt Persian analyzer which have been provided by elasticseach, but it also does not work.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#persian-analyzer

@NarimanN2
Copy link
Owner

Hi,
Thanks for bringing this up. I will try to see what can I do about it and if it is possible I will add it as a feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants