-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
major synonyms upgrade #453
Conversation
I've fixed the integration tests by loosening the test assertion method. Previously it would ensure that the tokens listed in the $expected array were equal to the tokens in the elasticsearch index. I think this is preferable going forward as some of the tests were becoming brittle and hard to maintain. The tests helped me catch some missing directional synonyms for english which I've fixed up. |
acaabed
to
1b9eb74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say yes ! Yes ! Yes ! Yes ! 😄
BTW, I like the synonyms/folder_with_synonyms
@@ -0,0 +1,33 @@ | |||
nord, n | |||
nördlich, nördl, nordl, nordlich, noerdlich |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't diacritics similar to punctuation in that they'll be stripped earlier in the analysis pipeline than synonym handling, and are therefore redundant?
I suppose it won't make a difference, since we will remove duplicate tokens even later, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was actually going to leave a comment about that...
You're right, as-is this might produce some tokens which will never be matched, which isn't ideal since it will take up extra disk space, but won't do any harm.
There's a few reasons I chose to leave them in:
- the
icu_folding
filter implemention is currently inconsistent, in some places it's run before the synonyms and others after - I would consider removing the
icu_folding
filter at a later date since it can cause issues in some cases (like the recent Hütten vs. Hutten issue) - Generating these lists is super time-consuming and I'd prefer to not have to do it all again any time soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. We can afford the extra disk space in the short/medium term. It makes sense to not sweat the details while we test out these synonyms.
Here's an idea for the longer term: what if we made a synonyms library that had many lists of synonyms and could process and expose them as a Javascript module in many different ways. The processed output could automatically exclude synonyms that differ only by diacritical.
It would also be helpful since there are at least 3 very different places where we probably want many of the same synonyms:
- here, in pelias/schema for Elasticsearch synonyms. These will ensure we match properly even when input/data have different values
- In pelias/api's deduplication layer. This allows us to deduplicate records that have, for example, differing abbreviations but refer to the same place
- In pelias/fuzzy-tester, so that we can generate resilient test cases that don't have to care if their expectations exactly match what we happen to have in our data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like idea of running brücke, brucke, bruecke
before the icu_folding
filter because it produces 3 forms (umlaut, ASCII, English-keyboard) where we are only able to currently handle two of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah also pelias/parser
and libpostal
use these synonyms in some form so they're used throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to, the icu_folding
filter ensures that only ASCII tokens are inserted in the index and the unique_only_same_position
filter removes any duplicate tokens at the same position, so café, cafe
will still only generate one token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are benefits to having the synonyms come before the icu_folding
filter, such as the brücke, brucke, bruecke
case above.
My suspicion is that stripping the diacritics earlier in the pipeline will become an impediment to some languages in the future because there will be no way for users to generate tokens based on the native language representation of the word.
I'm not that well versed in languages but I suspect it's not a big problem in Western Europe but the further East you go the more valuable this would be to have more control over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One interesting thing I've found is that, regardless of whether the icu_folding
filter comes before or after the synonyms, either way, no tokens with diacritics are produces as a result:
elyzer --index pelias --analyzer peliasPhrase 'east west café'
CHAR_FILTER: punctuation
{0:east west café}
CHAR_FILTER: nfkc_normalizer
{0:east west café}
TOKENIZER: peliasNameTokenizer
{0:east} {1:west} {2:café}
TOKEN_FILTER: lowercase
{0:east} {1:west} {2:café}
TOKEN_FILTER: trim
{0:east} {1:west} {2:café}
TOKEN_FILTER: remove_duplicate_spaces
{0:east} {1:west} {2:café}
TOKEN_FILTER: synonyms/punctuation
{0:east} {1:west} {2:café}
TOKEN_FILTER: synonyms/custom_name
{0:east} {1:west} {2:café}
TOKEN_FILTER: synonyms/personal_titles
{0:east} {1:west} {2:café}
TOKEN_FILTER: synonyms/place_names
{0:east} {1:west} {2:café,cafe}
TOKEN_FILTER: synonyms/streets
{0:east} {1:west} {2:café,cafe}
TOKEN_FILTER: synonyms/directionals
{0:east,e} {1:west,w,wst} {2:café,cafe}
TOKEN_FILTER: icu_folding
{0:east,e} {1:west,w,wst} {2:cafe,cafe}
TOKEN_FILTER: remove_ordinals
{0:east,e} {1:west,w,wst} {2:cafe,cafe}
TOKEN_FILTER: unique_only_same_position
{0:east,e} {1:west,w,wst} {2:cafe}
TOKEN_FILTER: notnull
{0:east,e} {1:west,w,wst} {2:cafe}
TOKEN_FILTER: flatten_graph
{0:east,e} {1:west,w,wst} {2:cafe}
elyzer --index pelias --analyzer peliasIndexOneEdgeGram 'east west café'
CHAR_FILTER: punctuation
{0:east west café}
CHAR_FILTER: nfkc_normalizer
{0:east west café}
TOKENIZER: peliasNameTokenizer
{0:east} {1:west} {2:café}
TOKEN_FILTER: lowercase
{0:east} {1:west} {2:café}
TOKEN_FILTER: icu_folding
{0:east} {1:west} {2:cafe}
TOKEN_FILTER: trim
{0:east} {1:west} {2:cafe}
TOKEN_FILTER: synonyms/custom_name
{0:east} {1:west} {2:cafe}
TOKEN_FILTER: synonyms/personal_titles
{0:east} {1:west} {2:cafe}
TOKEN_FILTER: synonyms/place_names
{0:east} {1:west} {2:cafe,cafe}
TOKEN_FILTER: synonyms/streets
{0:east} {1:west} {2:cafe,cafe}
TOKEN_FILTER: synonyms/directionals
{0:east,e} {1:west,w,wst} {2:cafe,cafe}
TOKEN_FILTER: synonyms/punctuation
{0:east,e} {1:west,w,wst} {2:cafe,cafe}
TOKEN_FILTER: remove_ordinals
{0:east,e} {1:west,w,wst} {2:cafe,cafe}
TOKEN_FILTER: removeAllZeroNumericPrefix
{0:east,e} {1:west,w,wst} {2:cafe,cafe}
TOKEN_FILTER: peliasOneEdgeGramFilter
{0:e,ea,eas,east,e} {1:w,we,wes,west,w,w,ws,wst} {2:c,ca,caf,cafe,c,ca,caf,cafe}
TOKEN_FILTER: unique_only_same_position
{0:e,ea,eas,east} {1:w,we,wes,west,ws,wst} {2:c,ca,caf,cafe}
TOKEN_FILTER: notnull
{0:e,ea,eas,east} {1:w,we,wes,west,ws,wst} {2:c,ca,caf,cafe}
TOKEN_FILTER: flatten_graph
{0:e,ea,eas,east} {1:w,we,wes,west,ws,wst} {2:c,ca,caf,cafe}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another interesting thing I found is that the synonyms don't 'explode' like I assumed they did.
(I was assuming that because street, str
and straße, str
that this meant street, straße
, fortunately this isn't true!
This is great, so as long as the street names are indexed in their expanded form (ie. in the source data as 'Street', not 'St.') then they will only get the minimal amount of synonyms to be correct, whereas in the case where the source provides an abbreviated form which is ambiguous more synonyms are produced:
elyzer --index pelias --analyzer peliasStreet 'main str'
CHAR_FILTER: punctuation
{0:main str}
CHAR_FILTER: nfkc_normalizer
{0:main str}
TOKENIZER: peliasStreetTokenizer
{0:main} {1:str}
TOKEN_FILTER: lowercase
{0:main} {1:str}
TOKEN_FILTER: trim
{0:main} {1:str}
TOKEN_FILTER: remove_duplicate_spaces
{0:main} {1:str}
TOKEN_FILTER: synonyms/custom_street
{0:main} {1:str}
TOKEN_FILTER: synonyms/personal_titles
{0:main} {1:str}
TOKEN_FILTER: synonyms/streets
{0:main} {1:str,straße,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: synonyms/directionals
{0:main} {1:str,straße,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: icu_folding
{0:main} {1:str,strasse,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: remove_ordinals
{0:main} {1:str,strasse,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: trim
{0:main} {1:str,strasse,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: unique_only_same_position
{0:main} {1:str,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: notnull
{0:main} {1:str,strasse,street,st,stre,stree,strt}
TOKEN_FILTER: flatten_graph
{0:main} {1:str,strasse,street,st,stre,stree,strt}
elyzer --index pelias --analyzer peliasStreet 'main street'
CHAR_FILTER: punctuation
{0:main street}
CHAR_FILTER: nfkc_normalizer
{0:main street}
TOKENIZER: peliasStreetTokenizer
{0:main} {1:street}
TOKEN_FILTER: lowercase
{0:main} {1:street}
TOKEN_FILTER: trim
{0:main} {1:street}
TOKEN_FILTER: remove_duplicate_spaces
{0:main} {1:street}
TOKEN_FILTER: synonyms/custom_street
{0:main} {1:street}
TOKEN_FILTER: synonyms/personal_titles
{0:main} {1:street}
TOKEN_FILTER: synonyms/streets
{0:main} {1:street,st,str,stre,stree,strt}
TOKEN_FILTER: synonyms/directionals
{0:main} {1:street,st,str,stre,stree,strt}
TOKEN_FILTER: icu_folding
{0:main} {1:street,st,str,stre,stree,strt}
TOKEN_FILTER: remove_ordinals
{0:main} {1:street,st,str,stre,stree,strt}
TOKEN_FILTER: trim
{0:main} {1:street,st,str,stre,stree,strt}
TOKEN_FILTER: unique_only_same_position
{0:main} {1:street,st,str,stre,stree,strt}
TOKEN_FILTER: notnull
{0:main} {1:street,st,str,stre,stree,strt}
TOKEN_FILTER: flatten_graph
{0:main} {1:street,st,str,stre,stree,strt}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds great! 🎉
Okay, I've run a bunch of acceptance tests against this change and the results are about as we might have expected: On the positive side, there are many many queries that didn't work at all due to abbreviation issues that are now passing. However, there are also a few tests that showing regressions. Overall, there were more regressions than improvements, so total test pass rate actually went down (for autocomplete tests at least). RegressionsA good example of the type of regressions we can see are queries like 15 Call Street The PerformanceThere's also almost certainly a performance impact to this change: I wasn't able to run the regular autocomplete acceptance tests, even very slowly, on our standard sized dev Elasticsearch instance. I was seeing timeouts and had to jump to a larger instance size. We'll need to do some more testing to quantify exactly what the difference is though. Going forwardI think at the the very least we'll want to back out some of the new synonyms in this PR, particularly any single character synonyms. While we definitely want a system to support I think a way we can do this is to move towards a system where we are generating multiple Elasticsearch documents from a single record in our data, and generating variations of the input at query time. This sort of system is already under discussion for, and would solve, our current top issue with alt names and scoring penalties. While more complex, it would allow us to handle abbreviations with more precision and flexibility. Edit: another thing that might let us buy a little more time with the current system is to use synonym expansion (at index time only, so we'll need pelias/api#1444) so that only |
Nice testing notes! just so we're on the same page, are those perf/quality measurements with or without pelias/api#1444? |
@missinglink I actually ended up running everything with and without pelias/api#1444. There's only minor differences in the results, but we should examine them more closely too. |
Okay cool, let's chat about it on a call, I just added pelias/api@9d4c6db, prior to that the I also just added e135146 here to remove those single-letter street synonyms. |
I discovered an undesirable behaviour... As mentioned above in #453 (comment) I was pleased to find that synonyms dont 'explode' (ie they don't apply virally). Turns out that this is only true within a token filter, so it's true that There problem is that where we have And to add to the confusion this is dependent on the order which the token filters are specified in the pipeline. This makes sense logically because if the Good news is that I've been looking for an excuse to use the multiplexer tokenfilter and this is the perfect situation. How the So where the the regular token filter pipeline is sequential the multiplexer is synchronic. This is all fixed up in 7b01801 I also cleaned up some of the logic of which filter runs when, to make it more consistent. |
That was a bit wordy, here's a picture:
|
38b8ac8
to
7b01801
Compare
rebased against master |
Most of these should be fixed by pelias/schema#453
Just ran our tests against the latest build from this branch, and the results are positive! Of course queries like r gay lussac, paris no longer work, because we've disabled those single character synonyms (for now), but otherwise things are good. I also confirmed that this PR pairs really well with pelias/api#1444 where we use the With the master branch of API as a baseline, there are actually a few regressions, since we end up being a bit more loose with synonyms than we'd currently like: |
Fixed by pelias/schema#453 Connects pelias/api#783
Fixed by pelias/schema#453 Connects pelias/schema#301
These are all fixed by pelias/schema#453
…ynonyms which seem to only be present for the purpose of spelling correction
…ist for en,fr,es,de
…oken filter ordering
d843af2
to
9ad64f7
Compare
Okay so, after quite a bit of testing of both result quality and performance, we are happy with this PR. From our tests, 99th percentile response times for the autocomplete endpoint increase by about 7-8%. This is notable but not significant enough to avoid merging this PR. Otherwise, results are much improved and it also includes a foundation on which to add more synonyms as needed. 🎉 |
Fixed by pelias/schema#453 Connects pelias/api#783
Fixed by pelias/schema#453 Connects pelias/schema#301
These are all fixed by pelias/schema#453
Fixed by pelias/schema#453 Connects pelias/api#783
Fixed by pelias/schema#453 Connects pelias/schema#301
These are all fixed by pelias/schema#453
This is a massive overhaul of the synonyms list, adding hundreds of new ones for English, French, Spanish and German.
It can probably replace these PRs:
After the success of #446 I'm going to do a full planet build of this and see how it performs.
cc/ @Joxit I added loads of French synonyms (no multi-word synonyms for now).
I wouldn't suggest merging this until after pelias/api#1444 because applying all these street synonyms at query-time when they aren't required is surely not going to be great from a performance perspective.