Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lucene analyzer #4155

Closed
martingg88 opened this issue May 14, 2015 · 15 comments
Closed

lucene analyzer #4155

martingg88 opened this issue May 14, 2015 · 15 comments
Assignees
Milestone

Comments

@martingg88
Copy link

is that possible to apply multiple analyzer as per field to cater i18n?

as i know different language may need its corresponding analyzer to make the fuzzy search return most accurate result. So what is strategy may apply for following example.

i18n class

@Rid title locale
#3:11 testing en-US
#3.12 测试 zh-CN
#3:13 ujian ms-MY

As according to table above, how can i index the "title" with the right analyzer to make sure the fuzzy search work out based on its language's rule?

@martingg88
Copy link
Author

any idea about this question?

@wolf4ood
Copy link
Member

Hi @martingg88 is not possible.
When you create an index with an analyzer that analyzer is applied to all the fields involved in index.

@martingg88
Copy link
Author

so what is the good solution to handle the field that has multiple language ?

@wolf4ood
Copy link
Member

How many language do you have?

@martingg88
Copy link
Author

i may support more than 10 languages for my website.

@martingg88
Copy link
Author

it's really hard to maintain index per language if we have many languages to support. any idea what is the best solution to support the case above?

@martingg88
Copy link
Author

any update about discussion on this issue?

@smolinari
Copy link
Contributor

@maggiolo00 answered you above. It isn't possible with ODB. Also, having a single field with multiple languages is the worst solution of them all. Here is a list of the downfalls (taken from the lucidworks docs).

  • Requires Language Detection software, which will slow down indexing
  • Requires the query language to be specified beforehand, since language detection on queries is often inaccurate
  • May return irrelevant results, since words may have same spelling but different meanings in different languages
  • May skew relevancy statistics
  • Hard to filter/search by language

The most used solution is to use multiple fields for each language and index on them accordingly.

Scott

@wolf4ood
Copy link
Member

@martingg88 i'm marking this as e enhancement.

@wolf4ood wolf4ood removed the question label Sep 22, 2015
@wolf4ood wolf4ood added this to the 3.0 milestone Sep 22, 2015
@martingg88
Copy link
Author

ok. thanks.

@smolinari
Copy link
Contributor

Sorry to be a pain, but what is the feature being requested? As I understood @martingg88, he wants to have a single class with a single field, but have multiple analyzers work on that single field dependent on the locale value stored in each document/ vertex. Can ODB actually oblige to that request?

Scott

@wolf4ood
Copy link
Member

@smolinari @martingg88 sorry you are right
i was thinking about this:
https://lucene.apache.org/core/4_7_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/PerFieldAnalyzerWrapper.html
That would be nice feature to have
But it is slightly different from the request and will not cover @martingg88 use case for single field, as for a single field can be declared one analyzer

@smolinari
Copy link
Contributor

So, with this per field wrapper, we'd have to save each translation in its own field? Like

title_en : Jaws
title_de: Der weisse Hai
title_fr: Les dents de la mer
title_gr: Τα σαγόνια του καρχαρία

Scott

@wolf4ood
Copy link
Member

Yes using one single index instead of 4
Il 22/set/2015 15:43, "Scott" notifications@github.com ha scritto:

So, with this per field wrapper, we'd have to save each translation in its
own field? Like

title_en : Jaws
title_de: Der weisse Hai
title_fr: Les dents de la mer
title_gr: Τα σαγόνια του καρχαρία

Scott


Reply to this email directly or view it on GitHub
#4155 (comment)
.

@lvca lvca assigned robfrank and unassigned wolf4ood May 13, 2016
@robfrank robfrank modified the milestones: 2.2.0 GA, 3.0 May 16, 2016
@robfrank
Copy link
Contributor

On 2.2.x you can select analyzer for each field:
http://orientdb.com/docs/last/Full-Text-Index.html#analyzer

but if you are working in a multi-language environment, as @smolinari suggested, the best way is to have different fields one for each language. There's no other way, as far a I know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants