-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discuss: similarity algorithms #408
Comments
These articles are great, especially in the second one: |
Some examples of existing TF/IDF scoring I don't like (which I'll update over time):
|
Hi @missinglink we've found IDF to be useful in cases where documents contain words identifying their category. Let's say we have a query
|
Also, in our bayesian optimization experiments, we've generally found higher values of |
I merged #430 a few months back so it should be available on all indices built this year. |
I did a little experimenting today and confirmed its pretty easy to modify similarity algorithm parameters on a development cluster. I also confirmed that setting However, it also removes any impact of a document's length on the score, and as a result it's probably too drastic of a change for our needs. We should definitely investigate further. |
A little more experimentation today confirms Here's an autocomplete query for United States: Some stats from behind the scenes of the scoringObviously, both documents match the tokens Mexico
USA
Relevant scoring values
Elasticsearch reports the field length different than how I counted it, but aside from that oddity, it's clear what's happening. Previously the multiple matches and shorter field length would provide a healthy scoring advantage to USA. Now both of those are gone, they both match equally well. Population boosts are at the max of 20 for each record, so their total scores are identical as well. Here's how Elasticsearch describes its calculation of term frequency:
Setting Setting However it seems like based on all this it's not worth it to explore extreme values of Here's the full explain output for this query for further investigation: |
Today I tested out reducing Again, there were pretty mixed results. Looking at the Search POI tests, there were nice improvements for some airport queries: However, some very basic autocomplete queries broke, most notably San Francisco: Looking into the
However, through all of that, the underlying issue of scoring a single field with multiple alt names remains. Here are the South San Francisco
San Francisco
There's a lot of issues with both, but in this case it looks like the main problem is that for the substring Fundamentally it looks like this exposes the tradeoff between solving pelias/openstreetmap#507 and pelias/pelias#862. If we penalize records with high term lengths, we lose some important results from records with lots of valid alt names. If we don't, we allow unimportant results that happen to have near-identical alt names to be boosted far too high. |
One possibility worth mentioning is to create a custom Disabling/enabling 'norms' is already possible via For example, the So instead of returning the 'number of words indexed in the field for this doc' it would return the 'longest number of words with consecutive positions indexed in the field for this doc'. |
For years now we've been fighting the
TF/IDF
algorithm and more recently we've changed to theBM25
similarity algo which is much better for short texts like ours but it's still not perfect.There is a really great article here which talks about the caveats of scoring short title fields.
The cool thing about BM25 (and other similarity algos) is that they have some tunable parameters, albeit considered 'expert settings'.
One setting that interests me, in particular, is the k1 value which "Controls non-linear term frequency normalization (saturation).".
The default settings for
BM25
arek1=1.2
andb=0.75
, which are really nice settings for general use of elasticsearch, they work well for short fields like titles as well as for large fields like indexing a whole chapter of a book.For geocoding specifically we almost exclusively deal with short strings (<50 chars).
I also personally feel that Term Frequencies are much less relevant for geocoding because they can cause issues like this.
I'd like to open this up to @pelias/contributors to discuss introducing our own custom similarity configuration (or multiple if required).
In particular, I would like to investigate the effects of setting
k1=0
(or very very low).Thoughts?
The text was updated successfully, but these errors were encountered: