-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Add l1norm and l2norm distances for vectors #40255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add l1norm and l2norm distances for vectors #40255
Conversation
Add L1norm - Manhattan distance Add L2norm - Euclidean distance
|
Pinging @elastic/es-search |
|
|
||
| Note that, unlike `cosineSimilarity` that represent | ||
| similarity, `l1norm` and the shown below `l2norm` represent distances or | ||
| differences. This means, that the mose similar are vectors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mose -> more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... the more similar two vector are, the smaller is the score produced by the ...
cbuescher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mayya-sharipova, very interesting and great addition. I left a couple of comments, the most important one of which I think is that I don't understand how the calculation of the l1/l2 distances is handled for sparse vectors when the dimension entries don't exactly overlap (e.g. I belive an "empty" document vector and a query vector of e.g. ("1:10", "2:10) should have a Manhattan distance of 20 if I'm not mistaken)
If you agree with this general behaviour I think the calculations need to be updated slightly. In that case I would also prefer to add a couple of more tests to check these and more edge cases. Maybe a small randomized tests would also allow to make sure strange edge cases are covered.
| // NOTCONSOLE | ||
|
|
||
| Note that, unlike `cosineSimilarity` that represent | ||
| similarity, `l1norm` and the shown below `l2norm` represent distances or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not exactly sure since I'm no native speaker, but I would expect "the l2norm shown below".
|
|
||
| Note that, unlike `cosineSimilarity` that represent | ||
| similarity, `l1norm` and the shown below `l2norm` represent distances or | ||
| differences. This means, that the mose similar are vectors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... the more similar two vector are, the smaller is the score produced by the ...
| similarity, `l1norm` and the shown below `l2norm` represent distances or | ||
| differences. This means, that the mose similar are vectors, | ||
| the less will be the scores produced by `l1norm` and `l2norm` functions. | ||
| Thus, if you need more similar vectors to score higher, you should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably simpler and easier to understand: This means you need to reverse... if you want vectors to increase the search score.
| `"source": " 1/ l1norm(params.queryVector, doc['my_dense_vector'])"` | ||
|
|
||
| For sparse_vector fields, `l1normSparse` calculates L^1^ distance | ||
| between a given query vector and document vectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just me, but this sounds a bit like the distance calculation is different from the above? I guess its not, you just need this for sparse vectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher Thanks Christoph. Right, this is exactly same function as l1norm but for sparse vectors. We have all functions for dense vectors being duplicated with sparse ending for sparse vectors.
| */ | ||
| public static double l1norm(List<Number> queryVector, VectorScriptDocValues.DenseVectorScriptDocValues dvs){ | ||
| BytesRef value = dvs.getEncodedValue(); | ||
| if (value == null) return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: enclose block in brackets
| i++; | ||
| } | ||
| // Sort dimensions in the ascending order and sort values in the same order as their corresponding dimensions | ||
| sortSparseDimsDoubleValues(queryDims, queryValues, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks almost identical to the constructor in the above class. Maybe this can be shared to large extents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher Thanks, Christoph. Indeed I was also thinking how to share the code. But I am working under 2 constraints here:
- painless script. Painless script has requirements how the code should be structured to use caching and bindings
- performance. I was thinking to have a singular function that iterates over query and documents, and pass it a lambda -
BiFunctionas an argument that tells what computation needs to be done. ButBiFunctionaccepts only classes, and I did not want to covert primitive floats to Float instances. This will significantly slow down computations.
Still, I will keep thinking how this code can be restructured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I didn't thing about performance issues about boxing in lambdas. Does Painless prohibit e.g. L1NormSparse and L2NormSparse sharing a common abstract superclass for e.g. sharing the common code in the constructor?
| docIndex++; | ||
| } else { | ||
| queryIndex++; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same general remarks as above. Also, the implementations of the two norms look very similar with the exception of how the vector diffs are treated (squared vs. just summing them). I think it would be worth trying to share huge parts of the function and maybe only use a differing lambda to do the different calculations.
|
|
||
| // test l2norm | ||
| double result4 = l2norm(queryVector, dvs); | ||
| assertEquals("l2norm result is not equal to the expected value!", 301.36, result4, 0.1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see an additional test for differing vector lengths, probably asserting that this throws an error if we decide to go that route.
| // test l2norm | ||
| L2NormSparse l2Norm = new L2NormSparse(queryVector); | ||
| double result4 = l2Norm.l2normSparse(dvs); | ||
| assertEquals("l2normSparse result is not equal to the expected value!", 301.36, result4, 0.1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests don't cover the cases mentioned above where the queryVector contains dimensions not present in the document vector and vice versa.
| script: | ||
| source: "l1normSparse(params.query_vector, doc['my_sparse_vector'])" | ||
| params: | ||
| query_vector: {"2": 0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, please add a few more tests that don't only check the behaviour when all dimensions are matching.
|
Closing this PR in favour of #40473 |
Add L1norm - Manhattan distance
Add L2norm - Euclidean distance
relates to #37947