Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging the terms from multiple sub-analyzers #1128

Closed
ofavre opened this issue Jul 18, 2011 · 24 comments
Closed

Merging the terms from multiple sub-analyzers #1128

ofavre opened this issue Jul 18, 2011 · 24 comments

Comments

@ofavre
Copy link

ofavre commented Jul 18, 2011

Multi-field is great, but searching with multiple analyzers against only one field is simpler/better.
If you have a multi-lingual index, where each document has its source language, you can analyze the text fields using a special analyzer, based on the detected language (maybe even using the _analyzer.path functionality).
But what happens when you misdetected the language somehow, either at index- or at query-time? Some aggressive stemming can have devastating effects.

In such a scenario, having the original words indexed in parallel to the stemmed one would help. Be they in the same field would even letting phrase/slop queries work properly.
The only way to get multiple terms at the same position with ElasticSearch is through the synonym token filter, useless for stemming.

I've been working on a way to merge the terms that multiple analyzers output.
Say you want both to use a simple analyzer, and any of the specialized language-specific analyzer, or anything.
My plugin can make it as simple as the following index setting:

index:
  analysis:
    analyzer:
      # An analyzer using both the "simple" analyzer and the sophisticated "english" analyzer, combining the resulting terms
      combo_en:
        type: combo
        sub_analyzers: [simple, english]

Here is a simple example of what is does:

# What the "simple" analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=simple' -d 'An example'
{
  "tokens" : [ {
    "token" : "an",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "example",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 2
  } ]
}
# What the "english" analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=english' -d 'An example'
{
  "tokens" : [ {
    "token" : "exampl",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

# Now what our combined analyzer outputs
curl -XGET 'localhost:9200/yakaz/_analyze?pretty=1&analyzer=combo_en' -d 'An example'
{
  "tokens" : [ {
    "token" : "an",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "example",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "exampl",
    "start_offset" : 3,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

Terms are sorted by position, then by start/end offset, so that it's easier to use its output under reasonable assumptions of using a classical analyzer.

Here is the good news! You can find my implementation here: https://github.com/ofavre/elasticsearch/tree/combo-analyzer-v0.16.4 (based on released ElasticSearch version 0.16.4).

EDIT: It is finally available as a plugin, thanks to jprante: https://github.com/yakaz/elasticsearch-analysis-combo.

@ofavre
Copy link
Author

ofavre commented Jul 18, 2011

Consider this issue as a pre- pull-request.

The current implementation should be independent of the used sub-analyzers.
However, I used some tricks in order to clone a Reader in some optimized ways. I think part of those hacks should somehow be integrated within Lucene's core, the combo-analyzer being a contrib, and having some wrapper in ElasticSearch.

@ofavre
Copy link
Author

ofavre commented Aug 26, 2011

Here is the patch I proposed to the Lucene community:
https://issues.apache.org/jira/browse/LUCENE-3392

We'll see how it goes.

@jprante
Copy link
Contributor

jprante commented Nov 28, 2011

This is the solution to a multilingual "_all" field. Can't wait for it.

@slorber
Copy link

slorber commented Jul 3, 2012

This seems nice.

Is the analyzer field/path available with your combo plugin?

@ofavre
Copy link
Author

ofavre commented Jul 5, 2012

It is available as a standalone plugin now, see: https://github.com/yakaz/elasticsearch-analysis-combo.

@slorber: Of course! The analyzer used for a field can be controlled using the analyzer field, then the analyzer is called, fed with some data. So any analyzer can be used with this feature.
And this technique is effectively useful when you have a language field, and want to combine a language dependent analyzer (english, spanish, french, etc.), as well as a language agnostic analyzer (simple, whitespace, etc.), just in case you misdetected the language in the first place.

@slorber
Copy link

slorber commented Jul 5, 2012

Hello,

Thanks, yes it's obvious it can be used for the _analyzer field since your combo is... an analyzer... Thus i guess i just need to create a combo analyzer for each langage instead of the classic "one analyzer per langage".

Btw i've had quite an appropriate result using multi fields but i think it's a pain and you noticed that.
Perhaps you can tell me how it works with multi fields? I think it's not "obvious" about how store and highlights work on a multifield.

Here's my mapping:
https://gist.github.com/3053540

The pain is:

  • I need to use a boolean/text search on these 3 fields
  • I need to store=yes for all of them or i can't get any highlight
  • I need to add highlights for the 3 fields or i only get the highlights when it was matched by the field analyzer
  • My highlight map has now 3 fields and i must select/merge the most appropriate one (exact match > stemming > ngrams for me)

Do you also noticed that?
When using store=yes for all subfields, are they stored as duplicates in ES?

And how does your combo analyzer solve these problems?

  • I will have only 1 field so only one store=true -> nice
  • But what will be the behavior of highlights?
  • What if 2 analyzers are producing the same tokens, are they consumming 2 * the token space on my index or are merged?
  • How will search behave? What kind of analysis will be performed on the search text for that field before trying to find matches?

And the most important:

  • Would you use that in production
  • How "hacky" is your solution and is there an elegant integration with Lucene/ES planned?

@jprante
Copy link
Contributor

jprante commented Jul 5, 2012

Storing a field means storing the original content. This content is then available for display (hightlighting). This has not much to do with the combo analyzer.

Yes, the tokens take, if they get repeated by the combo analyzer, more space - but only for referencing, positions, frequency for scoring, and the like, not in the dictionary (the index is inverted!) so this is neglectable.

During a Lucene search, the query words are transformed into tokens for matching documents in the index by the analyzer for the field. It is always recommended to use the same analyzer for indexing and for search. Otherwise your search results are getting unpredictable. This holds also for the combo analyzer. The situation is more relaxed, as you will mostly get results if you just use a subanalyzer on the combo analyzed field.

If you like to follow up, I would recommend asking questions on the Elasticsearch mailing list, because not everybody will be able to monitor the github issue tracking system for interesting discussions. More info: https://groups.google.com/group/elasticsearch

@slorber
Copy link

slorber commented Jul 9, 2012

Thanks.

By chance do you know if it's possible to embed your plugin in unit tests?

@jprante
Copy link
Contributor

jprante commented Jul 9, 2012

Plugins can be tested, sure, with testng/surefire/junit... the jar and the deps must be on the classpath.

@slorber
Copy link

slorber commented Jul 9, 2012

Thanks, didn't know it was so easy, i though we would have to deal with the plugin path property or something...

@ofavre ofavre closed this as completed Feb 1, 2013
@nickminutello
Copy link

So, this was closed because it is never being implemented in elastic?
Or because its solved via the plugin?

@ofavre
Copy link
Author

ofavre commented Jul 16, 2014

The proposed patch has never been integrated into Lucene.
The feature has been implemented as a plugin. Get it here!

@clintongormley
Copy link
Contributor

@nickminutello the reason we never implemented it was that we think it is a bad idea to mix analysis chains like this.

@slorber
Copy link

slorber commented Jul 16, 2014

@nickminutello note that we are using the plugin in production since 2012 and it works well until now

@jprante
Copy link
Contributor

jprante commented Jul 16, 2014

The combo analyzer is also here in production since 2012 and we could not live without it.

At least Elasticsearch uses the KeywordRepeatFilter #2753
which is a kind of a lightweight version of the combo analyzer, since it handles the combination of stemmed/unstemmed tokens. So the idea of combining token streams is not a bad idea.

@clintongormley
Copy link
Contributor

@jprante the existence of a feature doesn't make it a good idea: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/stemming-in-situ.html#stemming-in-situ

@jprante
Copy link
Contributor

jprante commented Jul 16, 2014

I see the points, but there are workarounds:

  • boosting fields in the index is not the only boosting that is available. There is also query term boosting/weighting, or document boosting by function score
  • if tokens appear more than once in a field they can be radically filtered out by the unique filter (why only_on_same_position? phrase search is no longer reliable anyway when token streams are mixed)
  • skewed IDF is also a challenge when using multiple fields instead of just one field. The effect is small for short text input and BM25 Okapi which has some tunables

So, when strategies exist to work around the effects, mixing tokens from multiple analyzers is still a good idea, especially for multi language search. Many applications here use this, with success.

@apatrida
Copy link
Contributor

@jprante what are you doing now in the 5.x versions, since that original yakaz plugin was never updated?

@jprante
Copy link
Contributor

jprante commented Mar 12, 2017

@apatrida in the meantime, I could reorganize my simple use case to a more complex token filter chain, and I dropped multiple language analysis support in favor of ICU case folding, which is not a full substitution though.

After the language_to feature jprante/elasticsearch-langdetect#49 (comment) , I plan to extend my langdetect plugin by a new query similar to simple_query_string which tries to detect the language in a query and set the appropriate language field before the query is executed on the cluster.

But if analyzer chaining is still the only possible method for some use cases, I maybe find time to try to implement such an über-analyzer for ES 5.x.

@apatrida
Copy link
Contributor

@jprante I'm in the same now, filter chains, but do run into issues like someone mentioned on one of your projects where you might want to protect a keyword from the next link in the chain, and yet want the rest of the chain to process that token. (really just need to add exception lists to some of the plugins would solve this, like the decompounder). I'll hop over to your langdetectand see where you are headed and see if I can help out anywhere. thanks

@s1monw
Copy link
Contributor

s1monw commented Mar 13, 2017

Lucene has a KeywordAttribute that can be set by KeywordMarkerFilter and is respected by stemmers etc. for exactly that reason to prevent certain terms to be modified by the next link in the chain. maybe that is useful and already available.

@apatrida
Copy link
Contributor

@s1monw but that blocks all future items in the chain from processing it, not just the next link in the chain yes? The issue I was referring too would be better solved with an exclude list in his decompounder because the rest of the chain needs to process the token, just not the decompounder.

@s1monw
Copy link
Contributor

s1monw commented Mar 13, 2017

the way token filters work is that you can chain them so you can also add one that resets keyword attributes. I think stuff like this should be addressed in a pluggable fashion otherwise you just end up with legacy issues. Also it seems not related to ES so I wonder if you wanna discuss this on the repos where that langdetect is maintained?

@apatrida
Copy link
Contributor

@s1monw sure, I was writing here to get alternatives written that you might use instead of what was originally presented (sub-analyzers), then rejected, in this issue. Google leads here, and now this topic gives some alternatives from some of those who originally backed that idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants