-
-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duplicate terms in name affecting scoring #507
Comments
Oh yeah, this is a good one. It's kinda inherent to the scoring mechanisms of Elasticsearch. Maybe we can solve this by modifying the scoring settings of Elasticsearch, which is easier and more flexible in newer versions. We might be able to split altnames into separate fields in a clever way as well. But I think this sort of case will be pretty hard, though not impossible, to solve. |
Last night I went into a deep-dive on tuning the BM25 algorithm, I suspect that we could reduce It's a big topic in itself so I'll start a discussion over on [edit] I think maybe we actually want to set |
I was wondering why no tests alerted us of the focus point balance issues in pelias/pelias#849 and pelias/openstreetmap#507. It turned out that the `priorityThresh` value in the relevant tests was too generous. This is now changed and we should be able to tell if/when progress is made.
#118 added support for removing duplicate values from the name field. This logic was not also applied to the `phrase` field. Duplicate values do not affect whether or not a particular document will match for a given query, but they _do_ affect the scoring. In some cases, the scoring boost for having tokens match twice from duplicates will over-rank a particular result. In other cases, the scoring penalty for having longer fields will under-rank a particular result. To make sure our scoring is as fair as possible (pending other issues such as pelias/openstreetmap#507), we should apply our current deduplication on both the `name` and `phrase` fields.
Since we've decided this issue is mostly unsolvable without major changes to how we index data in Elasticsearch, let's talk mitigation: In this particular case, one name is a prefix of the other. Is it worth it to detect that and skip the shorter version? In the other main case I've noticed with OSM data, the names only differ by whitespace. That one might be trickier, but we could probably still do it. |
This should help fix some scoring issues identified in #511 While not a complete fix, it should mitigate the effects of pelias/openstreetmap#507 somewhat.
This should help fix _some_ issues associated with #507, as it will reduce the field length of the `phrase` fields where duplicate values were previously making it longer that it should have been.
Just like with venues, adding many alt names can create scoring penalties(pelias/pelias#862) or boosts(pelias/openstreetmap#507) that are undesirable. Unfortunately we don't currently have a great way to handle all intersection searches without _some_ alt-names, but this change tests removing some of them to see if we can stabilize scoring a bit.
Just like with venues, adding many alt names can create scoring penalties(pelias/pelias#862) or boosts(pelias/openstreetmap#507) that are undesirable. Unfortunately we don't currently have a great way to handle all intersection searches without _some_ alt-names, but this change tests removing some of them to see if we can stabilize scoring a bit.
It seems as though having duplicate tokens in a name is causing elasticsearch to score the result higher (this is due to how the TF/IDF scoring works)
While it's impossible to have exact duplicate names (this is taken care of by
pelias/model
in a 'post' step), it is possible to have two terms which are very similar such as this:https://www.openstreetmap.org/way/432890745
I'm opening this issue so I don't forget, we can either try to solve this during import or during search.
A query such as
/v1/search?text=whole foods market, NY
illustrates the issue (although there may be other things at play here)The text was updated successfully, but these errors were encountered: