discuss: similarity algorithms #408

missinglink · 2019-12-11T14:00:54Z

For years now we've been fighting the TF/IDF algorithm and more recently we've changed to the BM25 similarity algo which is much better for short texts like ours but it's still not perfect.

There is a really great article here which talks about the caveats of scoring short title fields.

The cool thing about BM25 (and other similarity algos) is that they have some tunable parameters, albeit considered 'expert settings'.

One setting that interests me, in particular, is the k1 value which "Controls non-linear term frequency normalization (saturation).".

The default settings for BM25 are k1=1.2 and b=0.75, which are really nice settings for general use of elasticsearch, they work well for short fields like titles as well as for large fields like indexing a whole chapter of a book.

For geocoding specifically we almost exclusively deal with short strings (<50 chars).
I also personally feel that Term Frequencies are much less relevant for geocoding because they can cause issues like this.

I'd like to open this up to @pelias/contributors to discuss introducing our own custom similarity configuration (or multiple if required).
In particular, I would like to investigate the effects of setting k1=0 (or very very low).

Thoughts?

The text was updated successfully, but these errors were encountered:

missinglink · 2019-12-11T14:07:18Z

These articles are great, especially in the second one:

missinglink · 2020-01-10T08:40:33Z

Some examples of existing TF/IDF scoring I don't like (which I'll update over time):

search term	indexed term	description
"mcdonalds"	"mcdonalds - mcdonalds cafe"	TF causes duplicate terms to rank higher
"ave"	"avenue" (ngrams)	IDF of ngrams saturated by prefixes of other terms
"new york"	"new york" (city) "new york" (state)	multi-field matches rank higher in boolean `SHOULD` mode & using `multi_match` (in `cross_fields` mode)

shuaibiyy · 2020-01-21T20:50:31Z

Hi @missinglink we've found IDF to be useful in cases where documents contain words identifying their category. Let's say we have a query hotel park where the user is searching for a hotel called Park Inn. Some hotels might prefix their name with the word Hotel, while some might not. Assuming these are some of the documents retrieved:

Ibis Hotel
Hotel am Hauptbahnhof
Park Inn
...
We'd want to penalize the word hotel due to its commonality. Same applies to other categorical names such as Café.

shuaibiyy · 2020-01-21T21:06:28Z

Also, in our bayesian optimization experiments, we've generally found higher values of k1 (around 5) and very low values of b (around 0.05) to perform really well. We haven't run enough experiments to claim statistical significance though.

missinglink · 2020-03-05T11:55:29Z

I merged #430 a few months back so it should be available on all indices built this year.
In that PR I demonstrated how to close an index, change the setting and re-open it, so it's now easier than ever to experiment with this.
It does not require re-indexing the data 🚀

orangejulius · 2020-05-04T22:46:12Z

I did a little experimenting today and confirmed its pretty easy to modify similarity algorithm parameters on a development cluster.

I also confirmed that setting k1 to 0 removes any impact of the number of matches of a term on the score. This means it effectively eliminates issues such as pelias/pelias#849.

However, it also removes any impact of a document's length on the score, and as a result it's probably too drastic of a change for our needs. We should definitely investigate further.

orangejulius · 2020-05-08T22:31:23Z

A little more experimentation today confirms k1 at 0 is not going to be a cure all.

Here's an autocomplete query for United States:

Some stats from behind the scenes of the scoring

Obviously, both documents match the tokens united and states:

Mexico

          "name": {
            "default": [
              "Mexico",
              "México",
              "Estados Unidos Mexicanos",
              "Méjico",
              "Nueva España",
              "República Mejicana",
              "el país azteca",
              "Mexican",
              "Mexican Republic",
              "United Mexican States"
            ],

USA

     "name": {
            "default": [
              "United States",
              "America",
              "American",
              "United States of America",
              "Unitedstates",
              "Unitedstatesofamerica"
            ],

Relevant scoring values

	frequency of 'united'	field length
Mexico	1	20
United States	3	12

Elasticsearch reports the field length different than how I counted it, but aside from that oddity, it's clear what's happening. Previously the multiple matches and shorter field length would provide a healthy scoring advantage to USA. Now both of those are gone, they both match equally well. Population boosts are at the max of 20 for each record, so their total scores are identical as well.

Here's how Elasticsearch describes its calculation of term frequency:

"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:"

Setting k1 to zero essentially always makes tf 1, regardless of the length of matching fields or the number of times the desired token is in the field.

Setting b to zero might be better: it would ignore the length of the field, but prefer more frequent matches: freq / (freq + k1) will grow with freq as k1 is held constant.

However it seems like based on all this it's not worth it to explore extreme values of k1 and b, and instead we should try a couple more reasonable values and then move on to other optimizations. Personally I feel like no customization of these parameters is going to be a huge breakthrough.

Here's the full explain output for this query for further investigation:
usa-explain.txt

orangejulius · 2020-05-11T19:10:08Z

Today I tested out reducing k1 to 1 and b to 0.05 The idea with this less extreme configuration is that it would mostly eliminate penalties due to a longer field length (pelias/pelias#862), but not change scoring too drastically otherwise.

Again, there were pretty mixed results. Looking at the Search POI tests, there were nice improvements for some airport queries:

(this is output from diff, so the way to read this is to look at lines starting with - as "before" and + as after the change in scoring parameters. So for example, because there is more green on the + line for the "Phoenix international airport" query, that means the correct result comes up much sooner)

However, some very basic autocomplete queries broke, most notably San Francisco:

Looking into the explain output for this query, things are quite complicated, and there are lots of issues we should improve related to this query: San Francisco should really get a popularity boost, the alt names from Who's on First don't seem ideal, we should do a better job of de-duplicating similar alt names, and also the query generated the API is not that great. It looks like the pelias parser is working against us too:

"parsed_text": {
                "subject": "san",
                "locality": "san",
                "admin": "franci"
            },

However, through all of that, the underlying issue of scoring a single field with multiple alt names remains.

Here are the name.default values for the San Francisco and South San Francisco records:

South San Francisco

        "South San Francisco",
        "S S Fran",
        "S S Francisco",
        "S S. Fran",
        "S S. Francisco",
        "S San Fran",
        "S San Francisco",
        "South S Francisco",
        "South S. Francisco",
        "South San Francis",
        "Southsanfrancisco",
        "Ssf"

San Francisco

        "San Francisco",
        "S Francisco",
        "S. Francisco",
        "SFO",
        "Sanfran",
        "Sanfrancisco",
        "Frisco"

There's a lot of issues with both, but in this case it looks like the main problem is that for the substring francis South SF matches 7 times but SF matches only 3.

Fundamentally it looks like this exposes the tradeoff between solving pelias/openstreetmap#507 and pelias/pelias#862. If we penalize records with high term lengths, we lose some important results from records with lots of valid alt names. If we don't, we allow unimportant results that happen to have near-identical alt names to be boosted far too high.

missinglink · 2021-02-23T21:05:41Z

One possibility worth mentioning is to create a custom similarity (in Java) which adjusts the lengthNorm calculation.

Disabling/enabling 'norms' is already possible via norms: false but we could potentially improve on that by inspecting the FieldInvertState passed to lengthNorm() and modifying how it is calculated.

For example, the public float lengthNorm(FieldInvertState state) function could only return the longest alias in the field by looking for the position-increment-gap holes in the positions.

So instead of returning the 'number of words indexed in the field for this doc' it would return the 'longest number of words with consecutive positions indexed in the field for this doc'.

https://stackoverflow.com/a/35901213/339280

missinglink added the discuss label Dec 11, 2019

missinglink mentioned this issue Jan 10, 2020

add peliasDefaultSimilarity #430

Merged

missinglink mentioned this issue Mar 5, 2020

Similar OSM alt names affecting scoring for focus point pelias/pelias#849

Open

orangejulius mentioned this issue May 4, 2020

Records with multiple alternate names get a scoring penalty pelias/pelias#862

Open

missinglink mentioned this issue Jan 12, 2021

disable term frequencies for admin abbreviation fields #471

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discuss: similarity algorithms #408

discuss: similarity algorithms #408

missinglink commented Dec 11, 2019 •

edited

Loading

missinglink commented Dec 11, 2019

missinglink commented Jan 10, 2020 •

edited

Loading

shuaibiyy commented Jan 21, 2020

shuaibiyy commented Jan 21, 2020

missinglink commented Mar 5, 2020 •

edited

Loading

orangejulius commented May 4, 2020 •

edited

Loading

orangejulius commented May 8, 2020 •

edited

Loading

orangejulius commented May 11, 2020 •

edited

Loading

missinglink commented Feb 23, 2021

discuss: similarity algorithms #408

discuss: similarity algorithms #408

Comments

missinglink commented Dec 11, 2019 • edited Loading

missinglink commented Dec 11, 2019

missinglink commented Jan 10, 2020 • edited Loading

shuaibiyy commented Jan 21, 2020

shuaibiyy commented Jan 21, 2020

missinglink commented Mar 5, 2020 • edited Loading

orangejulius commented May 4, 2020 • edited Loading

orangejulius commented May 8, 2020 • edited Loading

Some stats from behind the scenes of the scoring

Mexico

USA

Relevant scoring values

orangejulius commented May 11, 2020 • edited Loading

South San Francisco

San Francisco

missinglink commented Feb 23, 2021

missinglink commented Dec 11, 2019 •

edited

Loading

missinglink commented Jan 10, 2020 •

edited

Loading

missinglink commented Mar 5, 2020 •

edited

Loading

orangejulius commented May 4, 2020 •

edited

Loading

orangejulius commented May 8, 2020 •

edited

Loading

orangejulius commented May 11, 2020 •

edited

Loading