-
-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: fuzzy autocomplete #1268
base: master
Are you sure you want to change the base?
WIP: fuzzy autocomplete #1268
Conversation
@missinglink it might be useful to expose |
Hi @mihneadb, I think it's better to leave it as configurable on startup. Having it variable per-request could cause quality-of-service issues for cloud providers like ourselves as there would be huge variability in the CPU usage of each request. I think it's important that operations engineers can control this setting. You can adjust the settings for your docker installation without generating a new docker image by copying the If you're not familiar with docker-compose you can look in the You'll then need to bring the API service down and up again to apply that configuration. |
Makes sense, thanks! |
Hi @missinglink, I'm testing this PR, and I found something for the parameters We may need some benchmarks to weigh the pros and cons 🤔 |
Thanks @Joxit, interesting feedback, that's certainly unexpected behaviour. |
f0bbd61
to
2ff49d4
Compare
So I tested this out a bit yesterday, and it's quite good! I do think @Joxit your example makes sense. With /v1/autocomplete?text=40 Rue De l arsenal, Bordeaux, France So I think we should try increasing the |
Connects pelias/schema#301 Connects pelias/api#1268 Connects pelias/api#1279
2ff49d4
to
6e528a1
Compare
Some new querty types were added since the fuzziness PR was created, this updates them with proper test expectations.
I just rebased this branch against the latest in |
Hi there, Here are the changes I've made. - 'fuzzy:prefix_length': 1,
- 'fuzzy:max_expansions': 10,
+ 'fuzzy:prefix_length': 0,
+ 'fuzzy:max_expansions': 25, Source: Joxit/pelias-api jawg/v3.63.0-fuzzy. I will see if we have some performance issues or incorrect responses 😄. |
I just found something than can be weird when we use fuzzy, perfect matches may not be in first positions. (This is completely random) {
"function_score":{
"query":{
"match":{
"name.default":{
"analyzer":"peliasQueryPartialToken",
"query":"vannes",
"operator":"and"
}
}
},
"max_boost":20,
"functions":[
{
"weight":1
}
],
"score_mode":"first",
"boost_mode":"replace"
}
} |
Hi @missinglink, is there a way I could help with this PR? |
Yes, please build a docker project from pelias/docker and compare the differences in result quality between Let us know what's better/worse |
This branch also needs to rebase |
The issue is that I already migrated to the version with Elastic 7, so testing it out won't be a quick one either.. |
I'm not sure if follow, why would that be an issue? Are you saying the queries are not compatible with ES7? |
I get this error when running a query:
|
Hi, any progress with this PR?. I believe is a very interesting functionality |
Any news on this? Would love to see this implemented. Any issues preventing it to be merged? |
The technical answer is that fuzziness increases recall at the cost of precision, and this alone does not increase relevance. The main issue is the way Lucene implements edit-distance queries (what they call fuzziness), these alternative terms are given an equal score as exact matches. This is obviously not ideal since you can spell a word perfectly and have spelling mistakes ranked higher due to other properties influencing the score. This is further complicated by the complexity of our queries, often a single query contains multiple sub-queries each of which targets a different field of the document. So if you set an edit-distance of 2 (for instance) then it's not clear how many total edits will be made to the input text. For instance if the parser detected 4 different classifications (eg. number, street, locality, country) in the input then the text may be mutated up to 8 times! I believe that the result of these two issues will only serve to slow down existing queries while also reducing precision, so it's lose-lose. I'd be happy to find a solution which provides the increased recall without the ranking issues. |
Hello.. |
This PR is WIP and I wouldn't suggest merging it as-is, it's open for testing, discussion and feedback.
One confusing and under-documented feature of the elasticsearch
match
query is that it supportstype: phrase
and alsofuzziness: *
but not both at the same time 🤷♀️There are unfortunately no warnings about this, which is what's contributing to the confusion, it might be the same for
cutoff_frequency
andphrase
.So this PR simply rewrites the top
MUST
query for most queries generated for autocomplete to removephrase
and enablefuzziness
.I've chosen conservative values for
fuzziness
in order to avoid increasing the CPU usage significantly, afuzziness
of 1 means that the Levenshtein edit distance is 1, so a single character can be added, removed or replaced.The
prefix_length
is set to 1, this means that the first character is not considered for edits, the default is 0 which would obviously generate a lot more permutations and require a lot more CPU, the tradeoff here is that if someone mistypes the first letter then they're doomed from the get-go.Lastly the
max_expansions
is set to 10, I'm not 100% sure what the correct setting for this should be so I again chose something conservative, my understanding of this is that only a maximum of 10 tokens in the index will be used to generate an 'OR' type condition.From what I've read online, the vast majority of typos are Levenshtein 1 and rarely on the first character, so this should catch a wide range of typing errors without increasing the CPU requirements too much.
Things I would like to change before merging:
The otherphrase
queries now have'fuzziness': 1
which, as I explained above, will do nothing and only serves to confuse.Consider the implications of not usingphrase
and whether it andslop
are even required here.autocomplete_defaults
values out of thephrase:*
namespace, maybe in to afuzzy:*
namespace?