duplicate terms in name affecting scoring #507

missinglink · 2019-11-15T15:24:32Z

It seems as though having duplicate tokens in a name is causing elasticsearch to score the result higher (this is due to how the TF/IDF scoring works)

While it's impossible to have exact duplicate names (this is taken care of by pelias/model in a 'post' step), it is possible to have two terms which are very similar such as this:

https://www.openstreetmap.org/way/432890745

I'm opening this issue so I don't forget, we can either try to solve this during import or during search.

A query such as /v1/search?text=whole foods market, NY illustrates the issue (although there may be other things at play here)

The text was updated successfully, but these errors were encountered:

orangejulius · 2019-11-15T20:39:49Z

Oh yeah, this is a good one. It's kinda inherent to the scoring mechanisms of Elasticsearch.

Maybe we can solve this by modifying the scoring settings of Elasticsearch, which is easier and more flexible in newer versions. We might be able to split altnames into separate fields in a clever way as well. But I think this sort of case will be pretty hard, though not impossible, to solve.

missinglink · 2019-12-11T13:50:18Z

Last night I went into a deep-dive on tuning the BM25 algorithm, I suspect that we could reduce k1 to either 0 or near 0 to resolve this issue.

It's a big topic in itself so I'll start a discussion over on pelias/schema

[edit] I think maybe we actually want to set k1=1?

I was wondering why no tests alerted us of the focus point balance issues in pelias/pelias#849 and pelias/openstreetmap#507. It turned out that the `priorityThresh` value in the relevant tests was too generous. This is now changed and we should be able to tell if/when progress is made.

#118 added support for removing duplicate values from the name field. This logic was not also applied to the `phrase` field. Duplicate values do not affect whether or not a particular document will match for a given query, but they _do_ affect the scoring. In some cases, the scoring boost for having tokens match twice from duplicates will over-rank a particular result. In other cases, the scoring penalty for having longer fields will under-rank a particular result. To make sure our scoring is as fair as possible (pending other issues such as pelias/openstreetmap#507), we should apply our current deduplication on both the `name` and `phrase` fields.

orangejulius · 2020-06-09T23:25:20Z

Since we've decided this issue is mostly unsolvable without major changes to how we index data in Elasticsearch, let's talk mitigation:

In this particular case, one name is a prefix of the other. Is it worth it to detect that and skip the shorter version?

In the other main case I've noticed with OSM data, the names only differ by whitespace. That one might be trickier, but we could probably still do it.

This should help fix some scoring issues identified in #511 While not a complete fix, it should mitigate the effects of pelias/openstreetmap#507 somewhat.

This should help fix _some_ issues associated with #507, as it will reduce the field length of the `phrase` fields where duplicate values were previously making it longer that it should have been.

Just like with venues, adding many alt names can create scoring penalties(pelias/pelias#862) or boosts(pelias/openstreetmap#507) that are undesirable. Unfortunately we don't currently have a great way to handle all intersection searches without _some_ alt-names, but this change tests removing some of them to see if we can stabilize scoring a bit.

orangejulius changed the title ~~duplicate terms in name effecting scoring~~ duplicate terms in name affecting scoring Nov 15, 2019

missinglink mentioned this issue Dec 11, 2019

discuss: similarity algorithms pelias/schema#408

Open

missinglink mentioned this issue Mar 5, 2020

Similar OSM alt names affecting scoring for focus point pelias/pelias#849

Open

thismakessand mentioned this issue Apr 30, 2020

New venue_tags + additional names #527

Closed

orangejulius mentioned this issue May 4, 2020

Records with multiple alternate names get a scoring penalty pelias/pelias#862

Open

orangejulius mentioned this issue May 8, 2020

Increase strictness of balance tests pelias/acceptance-tests#518

Merged

orangejulius mentioned this issue Jun 9, 2020

fix(deduplication): Deduplicate values in phrase field pelias/model#132

Merged

orangejulius added a commit to pelias/whosonfirst that referenced this issue Jun 10, 2020

fix(deps): Upgrade pelias-model to 7.2.1

8ffb457

This should help fix some scoring issues identified in #511 While not a complete fix, it should mitigate the effects of pelias/openstreetmap#507 somewhat.

orangejulius mentioned this issue Jun 10, 2020

fix(deps): Upgrade pelias-model to 7.2.1 pelias/whosonfirst#512

Merged

orangejulius mentioned this issue Jun 10, 2020

fix(deps): Upgrade pelias-model to 7.2.1 #536

Merged

orangejulius mentioned this issue Jun 24, 2020

draft: create fewer intersection variations pelias/model#133

Closed

orangejulius mentioned this issue Sep 10, 2020

Confidence seems to be calculated incorectly pelias/api#1488

Closed

orangejulius mentioned this issue Feb 4, 2021

Detect altnames that are a substring of name.default #548

Draft

orangejulius mentioned this issue Sep 20, 2021

Contracted separable street names working for autocomplete but not search pelias/api#1558

Open

missinglink mentioned this issue Mar 31, 2022

autocomplete: extend additional name fields used in multimatch queries pelias/api#1620

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate terms in name affecting scoring #507

duplicate terms in name affecting scoring #507

missinglink commented Nov 15, 2019 •

edited by orangejulius

Loading

orangejulius commented Nov 15, 2019

missinglink commented Dec 11, 2019 •

edited

Loading

orangejulius commented Jun 9, 2020

duplicate terms in name affecting scoring #507

duplicate terms in name affecting scoring #507

Comments

missinglink commented Nov 15, 2019 • edited by orangejulius Loading

orangejulius commented Nov 15, 2019

missinglink commented Dec 11, 2019 • edited Loading

orangejulius commented Jun 9, 2020

missinglink commented Nov 15, 2019 •

edited by orangejulius

Loading

missinglink commented Dec 11, 2019 •

edited

Loading