Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate terms in name affecting scoring #507

Open
missinglink opened this issue Nov 15, 2019 · 3 comments
Open

duplicate terms in name affecting scoring #507

missinglink opened this issue Nov 15, 2019 · 3 comments

Comments

@missinglink
Copy link
Member

missinglink commented Nov 15, 2019

It seems as though having duplicate tokens in a name is causing elasticsearch to score the result higher (this is due to how the TF/IDF scoring works)

While it's impossible to have exact duplicate names (this is taken care of by pelias/model in a 'post' step), it is possible to have two terms which are very similar such as this:

Screenshot 2019-11-15 at 16 18 39

https://www.openstreetmap.org/way/432890745

I'm opening this issue so I don't forget, we can either try to solve this during import or during search.

A query such as /v1/search?text=whole foods market, NY illustrates the issue (although there may be other things at play here)

Screenshot 2019-11-15 at 16 22 28

Screenshot 2019-11-15 at 16 22 16

@orangejulius
Copy link
Member

Oh yeah, this is a good one. It's kinda inherent to the scoring mechanisms of Elasticsearch.

Maybe we can solve this by modifying the scoring settings of Elasticsearch, which is easier and more flexible in newer versions. We might be able to split altnames into separate fields in a clever way as well. But I think this sort of case will be pretty hard, though not impossible, to solve.

@orangejulius orangejulius changed the title duplicate terms in name effecting scoring duplicate terms in name affecting scoring Nov 15, 2019
@missinglink
Copy link
Member Author

missinglink commented Dec 11, 2019

Last night I went into a deep-dive on tuning the BM25 algorithm, I suspect that we could reduce k1 to either 0 or near 0 to resolve this issue.

It's a big topic in itself so I'll start a discussion over on pelias/schema

[edit] I think maybe we actually want to set k1=1?

orangejulius added a commit to pelias/acceptance-tests that referenced this issue May 8, 2020
I was wondering why no tests alerted us of the focus point balance
issues in pelias/pelias#849 and
pelias/openstreetmap#507.

It turned out that the `priorityThresh` value in the relevant tests was
too generous.

This is now changed and we should be able to tell if/when progress is
made.
orangejulius added a commit to pelias/model that referenced this issue Jun 9, 2020
#118 added support for removing
duplicate values from the name field. This logic was not also applied to the `phrase` field.

Duplicate values do not affect whether or not a particular document will
match for a given query, but they _do_ affect the scoring.

In some cases, the scoring boost for having tokens match twice from
duplicates will over-rank a particular result.

In other cases, the scoring penalty for having longer fields will
under-rank a particular result.

To make sure our scoring is as fair as possible (pending other issues
such as pelias/openstreetmap#507), we should
apply our current deduplication on both the `name` and `phrase` fields.
@orangejulius
Copy link
Member

Since we've decided this issue is mostly unsolvable without major changes to how we index data in Elasticsearch, let's talk mitigation:

In this particular case, one name is a prefix of the other. Is it worth it to detect that and skip the shorter version?

In the other main case I've noticed with OSM data, the names only differ by whitespace. That one might be trickier, but we could probably still do it.

orangejulius added a commit to pelias/whosonfirst that referenced this issue Jun 10, 2020
This should help fix some scoring issues identified in
#511

While not a complete fix, it should mitigate the effects of
pelias/openstreetmap#507 somewhat.
orangejulius added a commit that referenced this issue Jun 10, 2020
This should help fix _some_ issues associated with
#507, as it will reduce
the field length of the `phrase` fields where duplicate values were
previously making it longer that it should have been.
orangejulius added a commit to pelias/model that referenced this issue Jun 24, 2020
Just like with venues, adding many alt names can create scoring
penalties(pelias/pelias#862) or
boosts(pelias/openstreetmap#507) that are
undesirable.

Unfortunately we don't currently have a great way to handle all
intersection searches without _some_ alt-names, but this change tests
removing some of them to see if we can stabilize scoring a bit.
orangejulius added a commit to pelias/model that referenced this issue Jun 24, 2020
Just like with venues, adding many alt names can create scoring
penalties(pelias/pelias#862) or
boosts(pelias/openstreetmap#507) that are
undesirable.

Unfortunately we don't currently have a great way to handle all
intersection searches without _some_ alt-names, but this change tests
removing some of them to see if we can stabilize scoring a bit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants