Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem using own analyzer configuration #30

Open
petersiman opened this issue Jul 14, 2015 · 5 comments
Open

Problem using own analyzer configuration #30

petersiman opened this issue Jul 14, 2015 · 5 comments

Comments

@petersiman
Copy link

Hi, I am trying to set up Liferay with Elasticsearch and use hunspell as analyzer for czech language. I have set up the index with following analyzer definition:

PUT /liferay_0
{
   "settings": {
      "analysis": {
         "analyzer": {
            "cestina_hunspell": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "stopwords_CZ",
                  "cs_CZ",
                  "icu_folding",
                  "stopwords_CZ",
                  "remove_duplicities"
               ]
            }
         },
         "filter": {
            "stopwords_CZ": {
               "type": "stop",
               "stopwords": [
                  "právě",
                  "že",
                  "_czech_"
               ],
               "ignore_case": true
            },
            "cs_CZ": {
               "type": "hunspell",
               "locale": "cs_CZ",
               "dedup": true,
               "recursion_level": 0
            },
            "remove_duplicities": {
               "type": "unique",
               "only_on_same_position": true
             }
           }
      }
   }
}

It seems to work on czech text, when I try to call the analyzer through the REST API:

curl 'localhost:9200/i/_analyze?analyzer=cestina_hunspell&pretty=true' -d 'Právě se mi zdálo, že se kolem okna něco mihlo.'

I get tokens:

  • zdát
  • kolem, kolo
  • okno
  • něco
  • mihnout

which are the wanted tokens.

But when I try to search web content with such text (indexed after new settings) I don´t get right results (I have to provide exact word, to get result).

Any ideas what could cause this behaviour.

Thanks.

@ajay-kottapally
Copy link
Contributor

Hi,

Elasticray creates indices on liferay with name liferay_{companyId}. So you must apply settings not just for liferay_0 but also to liferay_{comapnyId}. Or else use elasticsearch templates and apply template to liferay_*.

@ajay-kottapally
Copy link
Contributor

Sorry closed by mistake.

@petersiman
Copy link
Author

Hi,
thanks for quick reply. I have applied settings to indices for every company in Liferay (i have chosen the liferay_0 index only for ilustration). However, I think the problem might be in in dynamic mapping according to language defined in this file - https://github.com/R-Knowsys/elasticray/blob/master/webs/elasticray-web/docroot/WEB-INF/classes/com/rknowsys/portal/search/elastic/template.json. Is there any other way (some configuration) to bypass the

{
    "cs": {
        "match": "*_cs*",
        "match_mapping_type": "string",
        "mapping": {
            "type": "string",
            "analyzer": "czech"
        }
    }
}

mapping? Or do I have to modify this file and re-deploy the package?

@ajay-kottapally
Copy link
Contributor

I am afraid we don't have a configuration for this. You can change the file and redeploy the package and reindex or you can do this

  1. Delete templates by name liferay_template*.
  2. Then Apply you own template and reindex.

@petersiman
Copy link
Author

Thank you for reply. After defining my own analyzer in dynamic template, the search ran well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants