-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging the terms from multiple sub-analyzers #1128
Comments
Consider this issue as a pre- pull-request. The current implementation should be independent of the used sub-analyzers. |
Here is the patch I proposed to the Lucene community: We'll see how it goes. |
This is the solution to a multilingual "_all" field. Can't wait for it. |
This seems nice. Is the analyzer field/path available with your combo plugin? |
It is available as a standalone plugin now, see: https://github.com/yakaz/elasticsearch-analysis-combo. @slorber: Of course! The analyzer used for a field can be controlled using the analyzer field, then the analyzer is called, fed with some data. So any analyzer can be used with this feature. |
Hello, Thanks, yes it's obvious it can be used for the _analyzer field since your combo is... an analyzer... Thus i guess i just need to create a combo analyzer for each langage instead of the classic "one analyzer per langage". Btw i've had quite an appropriate result using multi fields but i think it's a pain and you noticed that. Here's my mapping: The pain is:
Do you also noticed that? And how does your combo analyzer solve these problems?
And the most important:
|
Storing a field means storing the original content. This content is then available for display (hightlighting). This has not much to do with the combo analyzer. Yes, the tokens take, if they get repeated by the combo analyzer, more space - but only for referencing, positions, frequency for scoring, and the like, not in the dictionary (the index is inverted!) so this is neglectable. During a Lucene search, the query words are transformed into tokens for matching documents in the index by the analyzer for the field. It is always recommended to use the same analyzer for indexing and for search. Otherwise your search results are getting unpredictable. This holds also for the combo analyzer. The situation is more relaxed, as you will mostly get results if you just use a subanalyzer on the combo analyzed field. If you like to follow up, I would recommend asking questions on the Elasticsearch mailing list, because not everybody will be able to monitor the github issue tracking system for interesting discussions. More info: https://groups.google.com/group/elasticsearch |
Thanks. By chance do you know if it's possible to embed your plugin in unit tests? |
Plugins can be tested, sure, with testng/surefire/junit... the jar and the deps must be on the classpath. |
Thanks, didn't know it was so easy, i though we would have to deal with the plugin path property or something... |
So, this was closed because it is never being implemented in elastic? |
The proposed patch has never been integrated into Lucene. |
@nickminutello the reason we never implemented it was that we think it is a bad idea to mix analysis chains like this. |
@nickminutello note that we are using the plugin in production since 2012 and it works well until now |
The combo analyzer is also here in production since 2012 and we could not live without it. At least Elasticsearch uses the KeywordRepeatFilter #2753 |
@jprante the existence of a feature doesn't make it a good idea: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/stemming-in-situ.html#stemming-in-situ |
I see the points, but there are workarounds:
So, when strategies exist to work around the effects, mixing tokens from multiple analyzers is still a good idea, especially for multi language search. Many applications here use this, with success. |
@jprante what are you doing now in the 5.x versions, since that original yakaz plugin was never updated? |
@apatrida in the meantime, I could reorganize my simple use case to a more complex token filter chain, and I dropped multiple language analysis support in favor of ICU case folding, which is not a full substitution though. After the But if analyzer chaining is still the only possible method for some use cases, I maybe find time to try to implement such an über-analyzer for ES 5.x. |
@jprante I'm in the same now, filter chains, but do run into issues like someone mentioned on one of your projects where you might want to protect a keyword from the next link in the chain, and yet want the rest of the chain to process that token. (really just need to add exception lists to some of the plugins would solve this, like the decompounder). I'll hop over to your |
Lucene has a |
@s1monw but that blocks all future items in the chain from processing it, not just the next link in the chain yes? The issue I was referring too would be better solved with an exclude list in his decompounder because the rest of the chain needs to process the token, just not the decompounder. |
the way token filters work is that you can chain them so you can also add one that resets keyword attributes. I think stuff like this should be addressed in a pluggable fashion otherwise you just end up with legacy issues. Also it seems not related to ES so I wonder if you wanna discuss this on the repos where that langdetect is maintained? |
@s1monw sure, I was writing here to get alternatives written that you might use instead of what was originally presented (sub-analyzers), then rejected, in this issue. Google leads here, and now this topic gives some alternatives from some of those who originally backed that idea. |
Multi-field is great, but searching with multiple analyzers against only one field is simpler/better.
If you have a multi-lingual index, where each document has its source language, you can analyze the text fields using a special analyzer, based on the detected language (maybe even using the
_analyzer.path
functionality).But what happens when you misdetected the language somehow, either at index- or at query-time? Some aggressive stemming can have devastating effects.
In such a scenario, having the original words indexed in parallel to the stemmed one would help. Be they in the same field would even letting phrase/slop queries work properly.
The only way to get multiple terms at the same position with ElasticSearch is through the synonym token filter, useless for stemming.
I've been working on a way to merge the terms that multiple analyzers output.
Say you want both to use a simple analyzer, and any of the specialized language-specific analyzer, or anything.
My plugin can make it as simple as the following index setting:
Here is a simple example of what is does:
Terms are sorted by position, then by start/end offset, so that it's easier to use its output under reasonable assumptions of using a classical analyzer.
Here is the good news! You can find my implementation here: https://github.com/ofavre/elasticsearch/tree/combo-analyzer-v0.16.4 (based on released ElasticSearch version 0.16.4).
EDIT: It is finally available as a plugin, thanks to jprante: https://github.com/yakaz/elasticsearch-analysis-combo.
The text was updated successfully, but these errors were encountered: