Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

major synonyms upgrade #453

Merged
merged 10 commits into from
Jul 13, 2020
Merged

major synonyms upgrade #453

merged 10 commits into from
Jul 13, 2020

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Jun 18, 2020

This is a massive overhaul of the synonyms list, adding hundreds of new ones for English, French, Spanish and German.

It can probably replace these PRs:

After the success of #446 I'm going to do a full planet build of this and see how it performs.

cc/ @Joxit I added loads of French synonyms (no multi-word synonyms for now).

I wouldn't suggest merging this until after pelias/api#1444 because applying all these street synonyms at query-time when they aren't required is surely not going to be great from a performance perspective.

@missinglink
Copy link
Member Author

missinglink commented Jun 18, 2020

I've fixed the integration tests by loosening the test assertion method.

Previously it would ensure that the tokens listed in the $expected array were equal to the tokens in the elasticsearch index.
Now it's ensuring that the tokens listed in the $expected array are all present in the elasticsearch index.

I think this is preferable going forward as some of the tests were becoming brittle and hard to maintain.

The tests helped me catch some missing directional synonyms for english which I've fixed up.
There was previously a synonym for see, s (en:Lake) which I have removed since I've not seen that one in the wild.

@missinglink missinglink marked this pull request as ready for review June 18, 2020 12:55
@missinglink
Copy link
Member Author

I just added b537967 to ensure #452 is being addressed correctly.

Copy link
Member

@Joxit Joxit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say yes ! Yes ! Yes ! Yes ! 😄

BTW, I like the synonyms/folder_with_synonyms

@@ -0,0 +1,33 @@
nord, n
nördlich, nördl, nordl, nordlich, noerdlich
Copy link
Member

@orangejulius orangejulius Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't diacritics similar to punctuation in that they'll be stripped earlier in the analysis pipeline than synonym handling, and are therefore redundant?

I suppose it won't make a difference, since we will remove duplicate tokens even later, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was actually going to leave a comment about that...

You're right, as-is this might produce some tokens which will never be matched, which isn't ideal since it will take up extra disk space, but won't do any harm.

There's a few reasons I chose to leave them in:

  • the icu_folding filter implemention is currently inconsistent, in some places it's run before the synonyms and others after
  • I would consider removing the icu_folding filter at a later date since it can cause issues in some cases (like the recent Hütten vs. Hutten issue)
  • Generating these lists is super time-consuming and I'd prefer to not have to do it all again any time soon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. We can afford the extra disk space in the short/medium term. It makes sense to not sweat the details while we test out these synonyms.

Here's an idea for the longer term: what if we made a synonyms library that had many lists of synonyms and could process and expose them as a Javascript module in many different ways. The processed output could automatically exclude synonyms that differ only by diacritical.

It would also be helpful since there are at least 3 very different places where we probably want many of the same synonyms:

  • here, in pelias/schema for Elasticsearch synonyms. These will ensure we match properly even when input/data have different values
  • In pelias/api's deduplication layer. This allows us to deduplicate records that have, for example, differing abbreviations but refer to the same place
  • In pelias/fuzzy-tester, so that we can generate resilient test cases that don't have to care if their expectations exactly match what we happen to have in our data.

Copy link
Member Author

@missinglink missinglink Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like idea of running brücke, brucke, bruecke before the icu_folding filter because it produces 3 forms (umlaut, ASCII, English-keyboard) where we are only able to currently handle two of those.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah also pelias/parser and libpostal use these synonyms in some form so they're used throughout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to, the icu_folding filter ensures that only ASCII tokens are inserted in the index and the unique_only_same_position filter removes any duplicate tokens at the same position, so café, cafe will still only generate one token.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are benefits to having the synonyms come before the icu_folding filter, such as the brücke, brucke, bruecke case above.

My suspicion is that stripping the diacritics earlier in the pipeline will become an impediment to some languages in the future because there will be no way for users to generate tokens based on the native language representation of the word.

I'm not that well versed in languages but I suspect it's not a big problem in Western Europe but the further East you go the more valuable this would be to have more control over.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One interesting thing I've found is that, regardless of whether the icu_folding filter comes before or after the synonyms, either way, no tokens with diacritics are produces as a result:

elyzer --index pelias --analyzer peliasPhrase 'east west café'
CHAR_FILTER: punctuation
{0:east west café}
CHAR_FILTER: nfkc_normalizer
{0:east west café}
TOKENIZER: peliasNameTokenizer
{0:east}	{1:west}	{2:café}
TOKEN_FILTER: lowercase
{0:east}	{1:west}	{2:café}
TOKEN_FILTER: trim
{0:east}	{1:west}	{2:café}
TOKEN_FILTER: remove_duplicate_spaces
{0:east}	{1:west}	{2:café}
TOKEN_FILTER: synonyms/punctuation
{0:east}	{1:west}	{2:café}
TOKEN_FILTER: synonyms/custom_name
{0:east}	{1:west}	{2:café}
TOKEN_FILTER: synonyms/personal_titles
{0:east}	{1:west}	{2:café}
TOKEN_FILTER: synonyms/place_names
{0:east}	{1:west}	{2:café,cafe}
TOKEN_FILTER: synonyms/streets
{0:east}	{1:west}	{2:café,cafe}
TOKEN_FILTER: synonyms/directionals
{0:east,e}	{1:west,w,wst}	{2:café,cafe}
TOKEN_FILTER: icu_folding
{0:east,e}	{1:west,w,wst}	{2:cafe,cafe}
TOKEN_FILTER: remove_ordinals
{0:east,e}	{1:west,w,wst}	{2:cafe,cafe}
TOKEN_FILTER: unique_only_same_position
{0:east,e}	{1:west,w,wst}	{2:cafe}
TOKEN_FILTER: notnull
{0:east,e}	{1:west,w,wst}	{2:cafe}
TOKEN_FILTER: flatten_graph
{0:east,e}	{1:west,w,wst}	{2:cafe}
elyzer --index pelias --analyzer peliasIndexOneEdgeGram 'east west café'
CHAR_FILTER: punctuation
{0:east west café}
CHAR_FILTER: nfkc_normalizer
{0:east west café}
TOKENIZER: peliasNameTokenizer
{0:east}	{1:west}	{2:café}
TOKEN_FILTER: lowercase
{0:east}	{1:west}	{2:café}
TOKEN_FILTER: icu_folding
{0:east}	{1:west}	{2:cafe}
TOKEN_FILTER: trim
{0:east}	{1:west}	{2:cafe}
TOKEN_FILTER: synonyms/custom_name
{0:east}	{1:west}	{2:cafe}
TOKEN_FILTER: synonyms/personal_titles
{0:east}	{1:west}	{2:cafe}
TOKEN_FILTER: synonyms/place_names
{0:east}	{1:west}	{2:cafe,cafe}
TOKEN_FILTER: synonyms/streets
{0:east}	{1:west}	{2:cafe,cafe}
TOKEN_FILTER: synonyms/directionals
{0:east,e}	{1:west,w,wst}	{2:cafe,cafe}
TOKEN_FILTER: synonyms/punctuation
{0:east,e}	{1:west,w,wst}	{2:cafe,cafe}
TOKEN_FILTER: remove_ordinals
{0:east,e}	{1:west,w,wst}	{2:cafe,cafe}
TOKEN_FILTER: removeAllZeroNumericPrefix
{0:east,e}	{1:west,w,wst}	{2:cafe,cafe}
TOKEN_FILTER: peliasOneEdgeGramFilter
{0:e,ea,eas,east,e}	{1:w,we,wes,west,w,w,ws,wst}	{2:c,ca,caf,cafe,c,ca,caf,cafe}
TOKEN_FILTER: unique_only_same_position
{0:e,ea,eas,east}	{1:w,we,wes,west,ws,wst}	{2:c,ca,caf,cafe}
TOKEN_FILTER: notnull
{0:e,ea,eas,east}	{1:w,we,wes,west,ws,wst}	{2:c,ca,caf,cafe}
TOKEN_FILTER: flatten_graph
{0:e,ea,eas,east}	{1:w,we,wes,west,ws,wst}	{2:c,ca,caf,cafe}

Copy link
Member Author

@missinglink missinglink Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another interesting thing I found is that the synonyms don't 'explode' like I assumed they did.
(I was assuming that because street, str and straße, str that this meant street, straße, fortunately this isn't true!

This is great, so as long as the street names are indexed in their expanded form (ie. in the source data as 'Street', not 'St.') then they will only get the minimal amount of synonyms to be correct, whereas in the case where the source provides an abbreviated form which is ambiguous more synonyms are produced:

elyzer --index pelias --analyzer peliasStreet 'main str'
CHAR_FILTER: punctuation
{0:main str}
CHAR_FILTER: nfkc_normalizer
{0:main str}
TOKENIZER: peliasStreetTokenizer
{0:main}	{1:str}
TOKEN_FILTER: lowercase
{0:main}	{1:str}
TOKEN_FILTER: trim
{0:main}	{1:str}
TOKEN_FILTER: remove_duplicate_spaces
{0:main}	{1:str}
TOKEN_FILTER: synonyms/custom_street
{0:main}	{1:str}
TOKEN_FILTER: synonyms/personal_titles
{0:main}	{1:str}
TOKEN_FILTER: synonyms/streets
{0:main}	{1:str,straße,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: synonyms/directionals
{0:main}	{1:str,straße,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: icu_folding
{0:main}	{1:str,strasse,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: remove_ordinals
{0:main}	{1:str,strasse,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: trim
{0:main}	{1:str,strasse,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: unique_only_same_position
{0:main}	{1:str,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: notnull
{0:main}	{1:str,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: flatten_graph
{0:main}	{1:str,strasse,street,st,stre,stree,strt}
elyzer --index pelias --analyzer peliasStreet 'main street'
CHAR_FILTER: punctuation
{0:main street}
CHAR_FILTER: nfkc_normalizer
{0:main street}
TOKENIZER: peliasStreetTokenizer
{0:main}	{1:street}
TOKEN_FILTER: lowercase
{0:main}	{1:street}
TOKEN_FILTER: trim
{0:main}	{1:street}
TOKEN_FILTER: remove_duplicate_spaces
{0:main}	{1:street}
TOKEN_FILTER: synonyms/custom_street
{0:main}	{1:street}
TOKEN_FILTER: synonyms/personal_titles
{0:main}	{1:street}
TOKEN_FILTER: synonyms/streets
{0:main}	{1:street,st,str,stre,stree,strt}
TOKEN_FILTER: synonyms/directionals
{0:main}	{1:street,st,str,stre,stree,strt}
TOKEN_FILTER: icu_folding
{0:main}	{1:street,st,str,stre,stree,strt}
TOKEN_FILTER: remove_ordinals
{0:main}	{1:street,st,str,stre,stree,strt}
TOKEN_FILTER: trim
{0:main}	{1:street,st,str,stre,stree,strt}
TOKEN_FILTER: unique_only_same_position
{0:main}	{1:street,st,str,stre,stree,strt}
TOKEN_FILTER: notnull
{0:main}	{1:street,st,str,stre,stree,strt}
TOKEN_FILTER: flatten_graph
{0:main}	{1:street,st,str,stre,stree,strt}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds great! 🎉

@orangejulius
Copy link
Member

orangejulius commented Jun 22, 2020

Okay, I've run a bunch of acceptance tests against this change and the results are about as we might have expected:

On the positive side, there are many many queries that didn't work at all due to abbreviation issues that are now passing.

However, there are also a few tests that showing regressions. Overall, there were more regressions than improvements, so total test pass rate actually went down (for autocomplete tests at least).

Regressions

A good example of the type of regressions we can see are queries like 15 Call Street
Screenshot_2020-06-22_15-18-06

The c, calle, call, [etc] synonym line from the new Spanish file is probably what's causing this particular issue. We knew that single character synonyms are especially dangerous, but c is very commonly used as an abbreviation for calle in Spain, so 🤷.

Performance

There's also almost certainly a performance impact to this change: I wasn't able to run the regular autocomplete acceptance tests, even very slowly, on our standard sized dev Elasticsearch instance. I was seeing timeouts and had to jump to a larger instance size. We'll need to do some more testing to quantify exactly what the difference is though.

Going forward

I think at the the very least we'll want to back out some of the new synonyms in this PR, particularly any single character synonyms. While we definitely want a system to support c/calle and other abbreviations, I don't believe Elasticsearch synonyms are the way to do that. We will need something that allows us to only handle that abbreviation for results in areas where we know that abbreviation is common, such as Spain.

I think a way we can do this is to move towards a system where we are generating multiple Elasticsearch documents from a single record in our data, and generating variations of the input at query time. This sort of system is already under discussion for, and would solve, our current top issue with alt names and scoring penalties. While more complex, it would allow us to handle abbreviations with more precision and flexibility.

Edit: another thing that might let us buy a little more time with the current system is to use synonym expansion (at index time only, so we'll need pelias/api#1444) so that only calle is replaced with calle, c (something like calle => calle, c, call, etc).

@missinglink
Copy link
Member Author

Nice testing notes! just so we're on the same page, are those perf/quality measurements with or without pelias/api#1444?

@orangejulius
Copy link
Member

@missinglink I actually ended up running everything with and without pelias/api#1444. There's only minor differences in the results, but we should examine them more closely too.

@missinglink
Copy link
Member Author

Okay cool, let's chat about it on a call, I just added pelias/api@9d4c6db, prior to that the peliasQuery change was only for search.

I also just added e135146 here to remove those single-letter street synonyms.

@missinglink
Copy link
Member Author

missinglink commented Jun 24, 2020

I discovered an undesirable behaviour...

As mentioned above in #453 (comment) I was pleased to find that synonyms dont 'explode' (ie they don't apply virally).

Turns out that this is only true within a token filter, so it's true that street, st straße, st will not create an association street == straße.

There problem is that where we have saint, st in another token filter then indexing saint will result in saint, st, street, straße, ...

And to add to the confusion this is dependent on the order which the token filters are specified in the pipeline.

This makes sense logically because if the synonyms/personal_titles filter emits a st token and passes it to the synonyms/streets filter, I would have preferred at this point that since the token is of type SYNONYM that it wouldn't generate more synonyms, but this is not the case 🤷

Good news is that I've been looking for an excuse to use the multiplexer tokenfilter and this is the perfect situation.

How the multiplexer works is that it takes all the tokens in and then runs all the filters independently of each other and pipes the resulting tokens back to the main pipeline.

So where the the regular token filter pipeline is sequential the multiplexer is synchronic.

This is all fixed up in 7b01801 I also cleaned up some of the logic of which filter runs when, to make it more consistent.

@missinglink
Copy link
Member Author

missinglink commented Jun 24, 2020

That was a bit wordy, here's a picture:

# before

elyzer --index pelias --analyzer peliasStreet 'saint'
...
{0:saint,st,street,str,stre,stree,strt}
# after

elyzer --index pelias --analyzer peliasStreet 'saint'
...
{0:saint,st}

@missinglink
Copy link
Member Author

rebased against master

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jun 25, 2020
Most of these should be fixed by pelias/schema#453
@orangejulius
Copy link
Member

orangejulius commented Jun 26, 2020

Just ran our tests against the latest build from this branch, and the results are positive! Of course queries like r gay lussac, paris no longer work, because we've disabled those single character synonyms (for now), but otherwise things are good.

I also confirmed that this PR pairs really well with pelias/api#1444 where we use the peliasQuery analyzer at search time. With that PR, our abbreviation suite shows lots of improvements:
Screenshot_2020-06-26_08-54-29

With the master branch of API as a baseline, there are actually a few regressions, since we end up being a bit more loose with synonyms than we'd currently like:
Screenshot_2020-06-26_08-54-44

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jun 30, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jun 30, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jun 30, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jun 30, 2020
@orangejulius
Copy link
Member

Okay so, after quite a bit of testing of both result quality and performance, we are happy with this PR.

From our tests, 99th percentile response times for the autocomplete endpoint increase by about 7-8%. This is notable but not significant enough to avoid merging this PR. Otherwise, results are much improved and it also includes a foundation on which to add more synonyms as needed. 🎉

@orangejulius orangejulius merged commit e1ce956 into master Jul 13, 2020
@orangejulius orangejulius deleted the synonyms_upgrade branch July 13, 2020 21:12
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jul 13, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jul 13, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jul 13, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jul 13, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jul 13, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jul 13, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jul 13, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Jul 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants