-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added dot product everywhere were cosine similarity was used #676
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. I left some initial comments.
Here are some other items that don't quite fit as a comment:
- what happens when the vectors are not normalized? I don't want to return a negative score to Elasticsearch. I'm not sure how that would behave. I would rather detect this failure state and return an explicit exception. I'm not totally sure where we should detect it. I'm thinking maybe add an if-statement here and here. If the similarity is < 0, we throw an ElastiknnRuntimeException.
- We should add some regression tests to the RecallSuite, similar to these.
Three of these are problematic with respect to this scoring requirement. | ||
|
||
Specifically, L1 and L2 are generally defined as _distance_ functions, rather than similarity functions, | ||
which means that higher relevance (i.e., lower distance) yields _lower_ scores. | ||
Cosine similarity is defined over $$[-1, 1]$$, and we can't have negative scores. | ||
|
||
Dot similarity is defined over $$[-1, 1]$$, and we can't have negative scores, if vectors have a magnitude of 1, then it's equivalent to cosine similarity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rewrite this part slightly:
Cosine similarity is defined over $$[-1, 1]$$.
Dot similarity is defined over $$[-1, 1]$$, and if vectors have a magnitude of 1, then it's equivalent to cosine similarity.
Elasticsearch does not allow negative scores.
To work around this, Elastiknn applies simple transformations to produce L1, L2, Cosine, and Dot scores in accordance with the Elasticsearch requirements.
docs/pages/api.md
Outdated
|L1|`1 / (1 + l1 distance)`|0|1| | ||
|L2|`1 / (1 + l2 distance)`|0|1| | ||
|
||
Dot similirarity will produce negative scores if the vectors are not normalized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure to catch this and return an error in the plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i added max(0,distance) so, it will always be positive.
@@ -110,13 +110,16 @@ class XContentCodecSuite extends AnyFreeSpec with Matchers { | |||
("L2", Similarity.L2), | |||
("cosine", Similarity.Cosine), | |||
("Cosine", Similarity.Cosine), | |||
("COSINE", Similarity.Cosine) | |||
("COSINE", Similarity.Cosine), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment for this file: Let's also add a block of tests below similar to this:
elastiknn/elastiknn-api4s/src/test/scala/com/klibisz/elastiknn/api/XContentCodecSuite.scala
Lines 371 to 406 in 39fa610
"CosineLsh" - { | |
"roundtrip" in { | |
for { | |
_ <- 1 to 100 | |
(dims, l, k) = (rng.nextInt(), rng.nextInt(), rng.nextInt()) | |
mapping = Mapping.CosineLsh(dims, l, k) | |
expected = Json.obj( | |
"type" -> "elastiknn_dense_float_vector".asJson, | |
"elastiknn" -> Json.obj( | |
"model" -> "lsh".asJson, | |
"dims" -> dims.asJson, | |
"similarity" -> "cosine".asJson, | |
"L" -> l.asJson, | |
"k" -> k.asJson | |
) | |
) | |
} { | |
roundtrip[Mapping](expected, mapping) | |
} | |
} | |
"errors" in { | |
val ex1 = intercept[XContentParseException](decodeUnsafeFromString[Mapping](""" | |
|{ | |
| "type": "elastiknn_dense_float_vector", | |
| "elastiknn": { | |
| "model": "lsh", | |
| "dims": 33, | |
| "similarity": "cosine", | |
| "L": "33", | |
| "k": 3 | |
| } | |
|} | |
|""".stripMargin)) | |
ex1.getMessage shouldBe "Expected [L] to be one of [VALUE_NUMBER] but found [VALUE_STRING]" | |
} | |
} |
elastiknn-models/src/main/java/com/klibisz/elastiknn/models/DotLshModel.java
Outdated
Show resolved
Hide resolved
Hi @alexklibisz About the changes you asked. I did all of them . Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fixes! I can help with the tests once we're happy with the implementation.
I'll have to think about more about the strategy for accounting for non-normalized vectors. As a user I think I would rather get a failed response than a bunch of missing results. Also, we say that the similarity is in [0, 2], but AFAICT the current strategy would allow for scores > 2, wouldn't it?
docs/pages/api.md
Outdated
@@ -446,9 +470,12 @@ The exact transformations are described below. | |||
|Jaccard|N/A|0|1.0| | |||
|Hamming|N/A|0|1.0| | |||
|Cosine[^note-angular-cosine]|`cosine similarity + 1`|0|2| | |||
|Dot[^note-dot-product]|`Dot similarity + 1`|0|2| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @joancf can you try adding the exceptions for scores outside [0, 2]? If you're having trouble I can try this, but probably not til the weekend or next week. |
hi. @alexklibisz my company doesn't want to use it. So, I will finish it out-hours |
@alexklibisz Now I'm back with this code, I wanted to extract some logs, |
Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>
Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>
Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>
Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>
Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>
…tLshModel.java Co-authored-by: Alex Klibisz <aklibisz@protonmail.com>
281e6c4
to
a75e983
Compare
Hi, @alexklibisz , maybe this a problem of the version of Elasticsearch I'm used ( I'm using 12.2 instead of 13) but I need to work on that version and I don't want to upgrade my environment. |
In order to get this merged, it needs to be working with the latest version. If that doesn't work, you're welcome to fork and build your own artifacts for older versions. I'm not maintaining multiple/older versions of Elasticsearch. |
ok, |
Related Issue
Support for inner product similarity measures
Changes
almost everywhere (except some tests) where there was a cosine reference a parallel dot function/option is added
What changed?
most files where cosine was used
Testing and Validation
Not done , needs to be done, it is pending.
I'll try to generate and replace the plugin in my ES isntallation and check it.
but not sure which prodedures I must follow.