Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply unidirectional synonyms at query-time #411

Open
missinglink opened this issue Dec 12, 2019 · 4 comments
Open

apply unidirectional synonyms at query-time #411

missinglink opened this issue Dec 12, 2019 · 4 comments
Labels

Comments

@missinglink
Copy link
Member

missinglink commented Dec 12, 2019

as of today we finally removed all unidirectional synonyms (ones using the a=>b syntax) from our default synonyms file 🎉

unfortunately, I realized that there is a bug which is preventing those unidirectional synonyms from working properly when users specify them in a custom configuration.

as per the example below, it's possible to index the term "hello" and then not be able to retrieve the document using the term "hello" 🤔

the solution to this problem is to split all the synonyms into two buckets, one for unidirectional synonyms (a=>b syntax) and one for bidirectional synonyms (a,b syntax), we will then need to apply both buckets at index-time and only the unidirectional synonyms at query-time.

curl -s -XDELETE "http://localhost:9200/foo?pretty=true"

curl -s -XPUT "http://localhost:9200/foo?pretty=true" \
  -H 'Content-Type: application/json' \
  -d '{
      "settings" : {
        "analysis": {
          "filter" : {
            "mySynonym" : {
              "type" : "synonym",
              "synonyms" : [
                "hello => world"
              ]
            }
          },
          "analyzer": {
            "myAnalyzer": {
              "type": "custom",
              "tokenizer": "standard",
              "filter": [
                "mySynonym"
              ]
            }
          }
        }
      },
      "mappings" : {
        "_doc" : {
          "properties" : {
            "field1": {
              "type": "text",
              "analyzer": "myAnalyzer",
              "search_analyzer": "standard"
            }
          }
        }
      }
    }'

curl -s -XPOST "http://localhost:9200/foo/_doc/example?pretty=true" \
  -H 'Content-Type: application/json' \
  -d '{
      "field1": "hello"
    }'

curl -s -XPOST "http://localhost:9200/foo/_refresh?pretty=true"

curl -XGET "http://localhost:9200/foo/_search?pretty=true" \
  -H 'Content-Type: application/json' \
  -d '{
      "query": {
        "match": {
          "field1": "hello"
        }
      }
    }'
@missinglink
Copy link
Member Author

missinglink commented Dec 12, 2019

a workaround, for now, is to duplicate the token from the left side of the => on the right side as such:

hello => hello, world

@orangejulius
Copy link
Member

So we've now done this for the name field, and the address_parts.street field with pelias/api#1444. Are there other fields we should do the same for, or is this all done?

@missinglink
Copy link
Member Author

missinglink commented Jun 26, 2020

This is only really relevant for custom user-defined synonyms and doesn't affect stock-standard Pelias.

So if a user added a synonym foo => bar in custom_name for instance then all instances of 'foo' at index-time would be replaced by 'bar' yet at query-time there is no such replacement, meaning the doc doesn't match a query that is verbatim the same as what was in the source data.

Let's leave this open for now so we remember, I'll try and fix it at some point but it's a relatively low priority because it may not even affect anyone!

@missinglink
Copy link
Member Author

One totally valid fix is just to say we don't support the => syntax at all, or that we warn anyone who uses it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants