Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow TokenFilterFactories to rewrite themselves against their preceding chain #33702

Merged
merged 10 commits into from
Sep 19, 2018
Merged

Allow TokenFilterFactories to rewrite themselves against their preceding chain #33702

merged 10 commits into from
Sep 19, 2018

Conversation

romseygeek
Copy link
Contributor

@romseygeek romseygeek commented Sep 14, 2018

We currently special-case SynonymFilterFactory and SynonymGraphFilterFactory, which need to know their predecessors in the analysis chain in order to correctly analyze their synonym lists. This special-casing doesn't work with Referring filter factories, such as the Multiplexer or Conditional filters. We also have a number of filters (eg the Multiplexer) that will break synonyms when they appear before them in a chain, because they produce multiple tokens at the same position.

This commit adds two methods to the TokenFilterFactory interface.

  • getChainAwareTokenFilterFactory() allows a filter factory to rewrite itself against its preceding filter chain, or to resolve references to other filters. It replaces ReferringFilterFactory and CustomAnalyzerProvider.checkAndApplySynonymFilter, and by default returns this.
  • runForSynonyms() defines whether or not a filter should be applied when building a synonym list Analyzer. By default it returns true.

Fixes #33609

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@romseygeek
Copy link
Contributor Author

This would also allow us to move the Synonym filters back into the common analysis module, as the special-casing is no longer required.

Another follow-up could be to add a runForNormalizer method, replacing the slightly hacky MultiTermComponent instance checks, and making it much easier to add new filters to the normalization whitelist.

@jpountz
Copy link
Contributor

jpountz commented Sep 14, 2018

I like the idea in general. Maybe we should have eg. TokenFilterFactory getSynonymAnalysisComponent or something similar instead of runForSynonyms to be consistent with multi-term analysis?

Another follow-up could be to add a runForNormalizer method, replacing the slightly hacky MultiTermComponent instance checks, and making it much easier to add new filters to the normalization whitelist.

Agreed that MultiTermComponent is a bit hacky + not type safe which is a pity, we need to fix it. Hopefully we can do it in Lucene at the same time so that the way that this problem is handled on both sides remains similar.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, this is a great change for the ootb user-experience.

@@ -62,6 +62,11 @@ public MultiplexerTokenFilterFactory(IndexSettings indexSettings, Environment en
this.preserveOriginal = settings.getAsBoolean("preserve_original", true);
}

@Override
public TokenFilterFactory getSynonymFilter() {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be null or a dummy factory that doesn't wrap the token stream? I'm afraid that we might be exposing ourselves to null-pointer exceptions?

public TokenFilterFactory getChainAwareTokenFilterFactory(TokenizerFactory tokenizer, List<CharFilterFactory> charFilters,
List<TokenFilterFactory> previousTokenFilters,
Function<String, TokenFilterFactory> allFilters) {
if (filters == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we worry about concurrency? Caching makes me a little uncomfortable due to the fact that it implies we expect that this will always be called with the same arguments, should we change the API or try to detect misusage?

@romseygeek
Copy link
Contributor Author

test this please

@romseygeek
Copy link
Contributor Author

I refactored things a bit to remove the caching. ScriptedConditionTokenFilter and MultiplexerTokenFilter now generate everything from scratch and return new TokenFilterFactory instances.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@romseygeek romseygeek merged commit 5107949 into elastic:master Sep 19, 2018
@romseygeek romseygeek deleted the synonym-chains branch September 19, 2018 14:52
romseygeek added a commit that referenced this pull request Sep 19, 2018
…ing chain (#33702)

We currently special-case SynonymFilterFactory and SynonymGraphFilterFactory, which need to
know their predecessors in the analysis chain in order to correctly analyze their synonym lists. This
special-casing doesn't work with Referring filter factories, such as the Multiplexer or Conditional
filters. We also have a number of filters (eg the Multiplexer) that will break synonyms when they
appear before them in a chain, because they produce multiple tokens at the same position.

This commit adds two methods to the TokenFilterFactory interface.

* `getChainAwareTokenFilterFactory()` allows a filter factory to rewrite itself against its preceding
  filter chain, or to resolve references to other filters. It replaces `ReferringFilterFactory` and
  `CustomAnalyzerProvider.checkAndApplySynonymFilter`, and by default returns `this`.
* `getSynonymFilter()` defines whether or not a filter should be applied when building a synonym
  list `Analyzer`. By default it returns `true`.

Fixes #33609
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

keyword_repeat and multiplexer don't play well with subsequent synonym filters
4 participants