`keyword_repeat` and `multiplexer` don't play well with subsequent synonym filters #33609

cbuescher · 2018-09-11T17:40:27Z

I recently saw an issue where an anlyzer chain was set up to perform some stemming on the input and then apply a synonym filter afterwards.
In order to also keep the unstemmed tokens in the output (and apply synonyms as well there if possible), a keyword_repeat filter was used, but
this already leads to errors on index creating because the synonyms in the filter are validated by running through the analysis chain:

PUT /index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "optimised => optimized"
          ]
        },
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "light_english_stemmer": {
          "type": "stemmer",
          "language": "light_english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "blogs_synonyms_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "keyword_repeat",
            "light_english_stemmer",
            "my_synonyms"
          ]
        }
      }
    }
  }
}

Gives:

    "type": "illegal_argument_exception",
    "reason": "failed to build synonyms",
    "caused_by": {
      "type": "parse_exception",
      "reason": "Invalid synonym rule at line 1",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "term: optimised analyzed to a token (optimise) with position increment != 1 (got: 0)"
      }
    }

I also tried using a multipexer like so, but that is running into similar issues:

PUT /index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "optimised => optimized"
          ]
        },
        "my_multiplexer": {
          "type": "multiplexer",
          "filters": ["light_english_stemmer"]
        },
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "light_english_stemmer": {
          "type": "stemmer",
          "language": "light_english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "blogs_synonyms_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_multiplexer",
            "my_synonyms"
            
          ]
        }
      }
    }
  }
}

I'm wondering if I'm using this the wrong way or if there are other ways to achieve similar effect.
Also I'm trying to understand what the position checks that are causing this rejection in SynonymMap#analyze are supposed to prevent
and if those checks could possibly be omitted for the case of the tokens generated by keyword_repeat or multiplexer.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-09-11T17:40:28Z

Pinging @elastic/es-search-aggs

romseygeek · 2018-09-11T21:15:24Z

#30968 is supposed to address this sort of issue, although it's still not a great workaround, and I think in this case will end up with the synonym filter just being a no-op.

I wonder if we should make it possible to specify an analyzer to use for the synonym token filter? Mostly synonyms just need tokenization, with some light normalization (eg lowercasing) applied. Trying to re-use the analysis chain up to the point that the synonym filter appears ends up causing all sorts of weird issues if anything complex is being done.

romseygeek · 2018-09-11T21:30:07Z

Thinking about this some more, including a synonym filter within a condition or multiplex filter currently won't work at all, because SynonymTokenFilterFactory isn't completely set up in its constructor like other filters, and needs an additional method call to load its dictionaries.

jpountz · 2018-09-12T12:55:13Z

I wonder if we should make it possible to specify an analyzer to use for the synonym token filter?

+1 this would help work around these issues in the rare cases when today's default behavior doesn't work.

jimczi · 2018-09-12T13:18:06Z

I wonder if we should make it possible to specify an analyzer to use for the synonym token filter?

Wouldn't this defeat the purpose of the change we made in 6x ? I agree with the fact that only light normalization should be applied to synonym but we also want to avoid the case where the analyzer that is used to build the synonym map is not compatible with the analyzer chain that is used by the field. If you put a stemmer before a synonym filter, any synonym rule must apply the same normalization beforehand otherwise we would miss synonyms.
The issue at play here is that the synonym_filter cannot be built if it contains rule where some terms are removed or expanded with variations (keyword_repeat with a stemmer for instance).
This is a limitation of the filter but it is also important to disambiguate expansions since otherwise we'd have weird cases to handle where the input and output have synonyms.
Also note that this example works fine if you set the stemmer after the synonym_filter. Maybe we can limit the filters that can appear before a synonym filter the same way we limit filters in the normalizer for keyword field ? This would be breaking and would limit what you can do with synonyms but at least it would be impossible to have an invalid chain with this rule in place.

romseygeek · 2018-09-12T13:42:04Z

If you put a stemmer before a synonym filter, any synonym rule must apply the same normalization beforehand otherwise we would miss synonyms.

True, although I think the answer here is more "don't put a stemmer before a synonym filter". Limiting what can appear before the synonym filter is tricky though - what do we do with custom tokenfilters, for example?

cbuescher · 2018-09-12T14:20:14Z

Also note that this example works fine if you set the stemmer after the synonym_filter. Maybe we can limit the filters that can appear before a synonym filter the same way we limit filters in the normalizer for keyword field ?

As far as I understand some use cases though, it makes sense to put a stemmer in front of a synonym_filter. Imagine you have the synonym rule "programmer => developer" and your input text is "A lot of programmers are freaks", you would want the synonym rule to match, so you would perform stemming before the synonym filter. If you were forced to put the stemmer afterwards, you would need to include all kinds of inflected word forms in the synonyms list, which I think is a practical issue.

…ing chain (#33702) We currently special-case SynonymFilterFactory and SynonymGraphFilterFactory, which need to know their predecessors in the analysis chain in order to correctly analyze their synonym lists. This special-casing doesn't work with Referring filter factories, such as the Multiplexer or Conditional filters. We also have a number of filters (eg the Multiplexer) that will break synonyms when they appear before them in a chain, because they produce multiple tokens at the same position. This commit adds two methods to the TokenFilterFactory interface. * `getChainAwareTokenFilterFactory()` allows a filter factory to rewrite itself against its preceding filter chain, or to resolve references to other filters. It replaces `ReferringFilterFactory` and `CustomAnalyzerProvider.checkAndApplySynonymFilter`, and by default returns `this`. * `getSynonymFilter()` defines whether or not a filter should be applied when building a synonym list `Analyzer`. By default it returns `true`. Fixes #33609

sanjusagare · 2020-10-09T18:42:36Z

I recently saw an issue where an anlyzer chain was set up to perform some stemming on the input and then apply a synonym filter afterwards.
In order to also keep the unstemmed tokens in the output (and apply synonyms as well there if possible), a keyword_repeat filter was used, but
this already leads to errors on index creating because the synonyms in the filter are validated by running through the analysis chain:

PUT /index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "optimised => optimized"
          ]
        },
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "light_english_stemmer": {
          "type": "stemmer",
          "language": "light_english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "blogs_synonyms_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "keyword_repeat",
            "light_english_stemmer",
            "my_synonyms"
          ]
        }
      }
    }
  }
}

Gives:

    "type": "illegal_argument_exception",
    "reason": "failed to build synonyms",
    "caused_by": {
      "type": "parse_exception",
      "reason": "Invalid synonym rule at line 1",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "term: optimised analyzed to a token (optimise) with position increment != 1 (got: 0)"
      }
    }

I also tried using a multipexer like so, but that is running into similar issues:

PUT /index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "optimised => optimized"
          ]
        },
        "my_multiplexer": {
          "type": "multiplexer",
          "filters": ["light_english_stemmer"]
        },
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "light_english_stemmer": {
          "type": "stemmer",
          "language": "light_english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "blogs_synonyms_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_multiplexer",
            "my_synonyms"
            
          ]
        }
      }
    }
  }
}

I'm wondering if I'm using this the wrong way or if there are other ways to achieve similar effect.
Also I'm trying to understand what the position checks that are causing this rejection in SynonymMap#analyze are supposed to prevent
and if those checks could possibly be omitted for the case of the tokens generated by keyword_repeat or multiplexer.

I am facing the same issue . did u get solution?

cbuescher · 2020-10-12T08:42:54Z

@sanjusagare #33702 should have solved this. If you are having general questions or problems please use the support forums over at https://discuss.elastic.co. We prefer to use Github issues only for bug reports and feature requests, and we think it's more likely this is a question than a bug report and its better to continue the discussion there than on a closed issue.

cbuescher added :Search Relevance/Analysis How text is split into tokens v7.0.0 v6.5.0 labels Sep 11, 2018

romseygeek mentioned this issue Sep 14, 2018

Allow TokenFilterFactories to rewrite themselves against their preceding chain #33702

Merged

romseygeek closed this as completed in #33702 Sep 19, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`keyword_repeat` and `multiplexer` don't play well with subsequent synonym filters #33609

`keyword_repeat` and `multiplexer` don't play well with subsequent synonym filters #33609

cbuescher commented Sep 11, 2018

elasticmachine commented Sep 11, 2018

romseygeek commented Sep 11, 2018

romseygeek commented Sep 11, 2018

jpountz commented Sep 12, 2018

jimczi commented Sep 12, 2018

romseygeek commented Sep 12, 2018

cbuescher commented Sep 12, 2018

sanjusagare commented Oct 9, 2020

cbuescher commented Oct 12, 2020 •

edited

Loading

keyword_repeat and multiplexer don't play well with subsequent synonym filters #33609

keyword_repeat and multiplexer don't play well with subsequent synonym filters #33609

Comments

cbuescher commented Sep 11, 2018

elasticmachine commented Sep 11, 2018

romseygeek commented Sep 11, 2018

romseygeek commented Sep 11, 2018

jpountz commented Sep 12, 2018

jimczi commented Sep 12, 2018

romseygeek commented Sep 12, 2018

cbuescher commented Sep 12, 2018

sanjusagare commented Oct 9, 2020

cbuescher commented Oct 12, 2020 • edited Loading

`keyword_repeat` and `multiplexer` don't play well with subsequent synonym filters #33609

`keyword_repeat` and `multiplexer` don't play well with subsequent synonym filters #33609

cbuescher commented Oct 12, 2020 •

edited

Loading