Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keyword_repeat and multiplexer don't play well with subsequent synonym filters #33609

Closed
cbuescher opened this issue Sep 11, 2018 · 9 comments · Fixed by #33702
Closed

keyword_repeat and multiplexer don't play well with subsequent synonym filters #33609

cbuescher opened this issue Sep 11, 2018 · 9 comments · Fixed by #33702
Labels
:Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v6.5.0 v7.0.0-beta1

Comments

@cbuescher
Copy link
Member

I recently saw an issue where an anlyzer chain was set up to perform some stemming on the input and then apply a synonym filter afterwards.
In order to also keep the unstemmed tokens in the output (and apply synonyms as well there if possible), a keyword_repeat filter was used, but
this already leads to errors on index creating because the synonyms in the filter are validated by running through the analysis chain:

PUT /index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "optimised => optimized"
          ]
        },
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "light_english_stemmer": {
          "type": "stemmer",
          "language": "light_english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "blogs_synonyms_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "keyword_repeat",
            "light_english_stemmer",
            "my_synonyms"
          ]
        }
      }
    }
  }
}

Gives:

    "type": "illegal_argument_exception",
    "reason": "failed to build synonyms",
    "caused_by": {
      "type": "parse_exception",
      "reason": "Invalid synonym rule at line 1",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "term: optimised analyzed to a token (optimise) with position increment != 1 (got: 0)"
      }
    }

I also tried using a multipexer like so, but that is running into similar issues:

PUT /index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "optimised => optimized"
          ]
        },
        "my_multiplexer": {
          "type": "multiplexer",
          "filters": ["light_english_stemmer"]
        },
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "light_english_stemmer": {
          "type": "stemmer",
          "language": "light_english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "blogs_synonyms_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_multiplexer",
            "my_synonyms"
            
          ]
        }
      }
    }
  }
}

I'm wondering if I'm using this the wrong way or if there are other ways to achieve similar effect.
Also I'm trying to understand what the position checks that are causing this rejection in SynonymMap#analyze are supposed to prevent
and if those checks could possibly be omitted for the case of the tokens generated by keyword_repeat or multiplexer.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@romseygeek
Copy link
Contributor

#30968 is supposed to address this sort of issue, although it's still not a great workaround, and I think in this case will end up with the synonym filter just being a no-op.

I wonder if we should make it possible to specify an analyzer to use for the synonym token filter? Mostly synonyms just need tokenization, with some light normalization (eg lowercasing) applied. Trying to re-use the analysis chain up to the point that the synonym filter appears ends up causing all sorts of weird issues if anything complex is being done.

@romseygeek
Copy link
Contributor

Thinking about this some more, including a synonym filter within a condition or multiplex filter currently won't work at all, because SynonymTokenFilterFactory isn't completely set up in its constructor like other filters, and needs an additional method call to load its dictionaries.

@jpountz
Copy link
Contributor

jpountz commented Sep 12, 2018

I wonder if we should make it possible to specify an analyzer to use for the synonym token filter?

+1 this would help work around these issues in the rare cases when today's default behavior doesn't work.

@jimczi
Copy link
Contributor

jimczi commented Sep 12, 2018

I wonder if we should make it possible to specify an analyzer to use for the synonym token filter?

Wouldn't this defeat the purpose of the change we made in 6x ? I agree with the fact that only light normalization should be applied to synonym but we also want to avoid the case where the analyzer that is used to build the synonym map is not compatible with the analyzer chain that is used by the field. If you put a stemmer before a synonym filter, any synonym rule must apply the same normalization beforehand otherwise we would miss synonyms.
The issue at play here is that the synonym_filter cannot be built if it contains rule where some terms are removed or expanded with variations (keyword_repeat with a stemmer for instance).
This is a limitation of the filter but it is also important to disambiguate expansions since otherwise we'd have weird cases to handle where the input and output have synonyms.
Also note that this example works fine if you set the stemmer after the synonym_filter. Maybe we can limit the filters that can appear before a synonym filter the same way we limit filters in the normalizer for keyword field ? This would be breaking and would limit what you can do with synonyms but at least it would be impossible to have an invalid chain with this rule in place.

@romseygeek
Copy link
Contributor

If you put a stemmer before a synonym filter, any synonym rule must apply the same normalization beforehand otherwise we would miss synonyms.

True, although I think the answer here is more "don't put a stemmer before a synonym filter". Limiting what can appear before the synonym filter is tricky though - what do we do with custom tokenfilters, for example?

@cbuescher
Copy link
Member Author

Also note that this example works fine if you set the stemmer after the synonym_filter. Maybe we can limit the filters that can appear before a synonym filter the same way we limit filters in the normalizer for keyword field ?

As far as I understand some use cases though, it makes sense to put a stemmer in front of a synonym_filter. Imagine you have the synonym rule "programmer => developer" and your input text is "A lot of programmers are freaks", you would want the synonym rule to match, so you would perform stemming before the synonym filter. If you were forced to put the stemmer afterwards, you would need to include all kinds of inflected word forms in the synonyms list, which I think is a practical issue.

romseygeek added a commit that referenced this issue Sep 19, 2018
…ing chain (#33702)

We currently special-case SynonymFilterFactory and SynonymGraphFilterFactory, which need to 
know their predecessors in the analysis chain in order to correctly analyze their synonym lists. This
special-casing doesn't work with Referring filter factories, such as the Multiplexer or Conditional
filters. We also have a number of filters (eg the Multiplexer) that will break synonyms when they
appear before them in a chain, because they produce multiple tokens at the same position.

This commit adds two methods to the TokenFilterFactory interface.

* `getChainAwareTokenFilterFactory()` allows a filter factory to rewrite itself against its preceding
  filter chain, or to resolve references to other filters. It replaces `ReferringFilterFactory` and
  `CustomAnalyzerProvider.checkAndApplySynonymFilter`, and by default returns `this`.
* `getSynonymFilter()` defines whether or not a filter should be applied when building a synonym
  list `Analyzer`. By default it returns `true`.

Fixes #33609
romseygeek added a commit that referenced this issue Sep 19, 2018
…ing chain (#33702)

We currently special-case SynonymFilterFactory and SynonymGraphFilterFactory, which need to
know their predecessors in the analysis chain in order to correctly analyze their synonym lists. This
special-casing doesn't work with Referring filter factories, such as the Multiplexer or Conditional
filters. We also have a number of filters (eg the Multiplexer) that will break synonyms when they
appear before them in a chain, because they produce multiple tokens at the same position.

This commit adds two methods to the TokenFilterFactory interface.

* `getChainAwareTokenFilterFactory()` allows a filter factory to rewrite itself against its preceding
  filter chain, or to resolve references to other filters. It replaces `ReferringFilterFactory` and
  `CustomAnalyzerProvider.checkAndApplySynonymFilter`, and by default returns `this`.
* `getSynonymFilter()` defines whether or not a filter should be applied when building a synonym
  list `Analyzer`. By default it returns `true`.

Fixes #33609
@sanjusagare
Copy link

I recently saw an issue where an anlyzer chain was set up to perform some stemming on the input and then apply a synonym filter afterwards.
In order to also keep the unstemmed tokens in the output (and apply synonyms as well there if possible), a keyword_repeat filter was used, but
this already leads to errors on index creating because the synonyms in the filter are validated by running through the analysis chain:

PUT /index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "optimised => optimized"
          ]
        },
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "light_english_stemmer": {
          "type": "stemmer",
          "language": "light_english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "blogs_synonyms_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "keyword_repeat",
            "light_english_stemmer",
            "my_synonyms"
          ]
        }
      }
    }
  }
}

Gives:

    "type": "illegal_argument_exception",
    "reason": "failed to build synonyms",
    "caused_by": {
      "type": "parse_exception",
      "reason": "Invalid synonym rule at line 1",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "term: optimised analyzed to a token (optimise) with position increment != 1 (got: 0)"
      }
    }

I also tried using a multipexer like so, but that is running into similar issues:

PUT /index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "optimised => optimized"
          ]
        },
        "my_multiplexer": {
          "type": "multiplexer",
          "filters": ["light_english_stemmer"]
        },
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "light_english_stemmer": {
          "type": "stemmer",
          "language": "light_english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "blogs_synonyms_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_multiplexer",
            "my_synonyms"
            
          ]
        }
      }
    }
  }
}

I'm wondering if I'm using this the wrong way or if there are other ways to achieve similar effect.
Also I'm trying to understand what the position checks that are causing this rejection in SynonymMap#analyze are supposed to prevent
and if those checks could possibly be omitted for the case of the tokens generated by keyword_repeat or multiplexer.

I am facing the same issue . did u get solution?

@cbuescher
Copy link
Member Author

cbuescher commented Oct 12, 2020

@sanjusagare #33702 should have solved this. If you are having general questions or problems please use the support forums over at https://discuss.elastic.co. We prefer to use Github issues only for bug reports and feature requests, and we think it's more likely this is a question than a bug report and its better to continue the discussion there than on a closed issue.

@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants