Add l1norm and l2norm distances for vectors #40255

mayya-sharipova · 2019-03-20T12:29:44Z

Add L1norm - Manhattan distance
Add L2norm - Euclidean distance

relates to #37947

Add L1norm - Manhattan distance Add L2norm - Euclidean distance

elasticmachine · 2019-03-20T12:30:18Z

Pinging @elastic/es-search

mayya-sharipova · 2019-03-20T14:29:14Z

docs/reference/query-dsl/script-score-query.asciidoc

+
+Note that, unlike `cosineSimilarity` that represent
+similarity, `l1norm` and the shown below `l2norm` represent distances or
+differences. This means, that the mose similar are vectors,


mose -> more

... the more similar two vector are, the smaller is the score produced by the ...

cbuescher

Hi @mayya-sharipova, very interesting and great addition. I left a couple of comments, the most important one of which I think is that I don't understand how the calculation of the l1/l2 distances is handled for sparse vectors when the dimension entries don't exactly overlap (e.g. I belive an "empty" document vector and a query vector of e.g. ("1:10", "2:10) should have a Manhattan distance of 20 if I'm not mistaken)
If you agree with this general behaviour I think the calculations need to be updated slightly. In that case I would also prefer to add a couple of more tests to check these and more edge cases. Maybe a small randomized tests would also allow to make sure strange edge cases are covered.

cbuescher · 2019-03-20T21:33:04Z

docs/reference/query-dsl/script-score-query.asciidoc

+// NOTCONSOLE
+
+Note that, unlike `cosineSimilarity` that represent
+similarity, `l1norm` and the shown below `l2norm` represent distances or


nit: not exactly sure since I'm no native speaker, but I would expect "the l2norm shown below".

cbuescher · 2019-03-20T21:35:30Z

docs/reference/query-dsl/script-score-query.asciidoc

+
+Note that, unlike `cosineSimilarity` that represent
+similarity, `l1norm` and the shown below `l2norm` represent distances or
+differences. This means, that the mose similar are vectors,


... the more similar two vector are, the smaller is the score produced by the ...

cbuescher · 2019-03-20T21:37:16Z

docs/reference/query-dsl/script-score-query.asciidoc

+similarity, `l1norm` and the shown below `l2norm` represent distances or
+differences. This means, that the mose similar are vectors,
+the less will be the scores produced by `l1norm` and `l2norm` functions.
+Thus, if you need more similar vectors to score higher, you should


Probably simpler and easier to understand: This means you need to reverse... if you want vectors to increase the search score.

cbuescher · 2019-03-20T21:39:21Z

docs/reference/query-dsl/script-score-query.asciidoc

+`"source": " 1/ l1norm(params.queryVector, doc['my_dense_vector'])"`
+
+For sparse_vector fields, `l1normSparse` calculates L^1^ distance
+between a given query vector and document vectors.


Maybe just me, but this sounds a bit like the distance calculation is different from the above? I guess its not, you just need this for sparse vectors?

@cbuescher Thanks Christoph. Right, this is exactly same function as l1norm but for sparse vectors. We have all functions for dense vectors being duplicated with sparse ending for sparse vectors.

cbuescher · 2019-03-20T21:42:40Z

modules/mapper-extras/src/main/java/org/elasticsearch/index/query/ScoreScriptUtils.java

+     */
+    public static double l1norm(List<Number> queryVector, VectorScriptDocValues.DenseVectorScriptDocValues dvs){
+        BytesRef value = dvs.getEncodedValue();
+        if (value == null) return 0;


nit: enclose block in brackets

cbuescher · 2019-03-20T22:02:57Z

modules/mapper-extras/src/main/java/org/elasticsearch/index/query/ScoreScriptUtils.java

+                i++;
+            }
+            // Sort dimensions in the ascending order and sort values in the same order as their corresponding dimensions
+            sortSparseDimsDoubleValues(queryDims, queryValues, n);


This looks almost identical to the constructor in the above class. Maybe this can be shared to large extents.

@cbuescher Thanks, Christoph. Indeed I was also thinking how to share the code. But I am working under 2 constraints here:

painless script. Painless script has requirements how the code should be structured to use caching and bindings

performance. I was thinking to have a singular function that iterates over query and documents, and pass it a lambda - BiFunction as an argument that tells what computation needs to be done. But BiFunction accepts only classes, and I did not want to covert primitive floats to Float instances. This will significantly slow down computations.

Still, I will keep thinking how this code can be restructured.

Sure, I didn't thing about performance issues about boxing in lambdas. Does Painless prohibit e.g. L1NormSparse and L2NormSparse sharing a common abstract superclass for e.g. sharing the common code in the constructor?

cbuescher · 2019-03-20T22:04:45Z

modules/mapper-extras/src/main/java/org/elasticsearch/index/query/ScoreScriptUtils.java

+                    docIndex++;
+                } else {
+                    queryIndex++;
+                }


Same general remarks as above. Also, the implementations of the two norms look very similar with the exception of how the vector diffs are treated (squared vs. just summing them). I think it would be worth trying to share huge parts of the function and maybe only use a differing lambda to do the different calculations.

cbuescher · 2019-03-20T22:06:06Z

modules/mapper-extras/src/test/java/org/elasticsearch/index/query/ScoreScriptUtilsTests.java

+
+        // test l2norm
+        double result4 = l2norm(queryVector, dvs);
+        assertEquals("l2norm result is not equal to the expected value!", 301.36, result4, 0.1);


I'd like to see an additional test for differing vector lengths, probably asserting that this throws an error if we decide to go that route.

cbuescher · 2019-03-20T22:07:17Z

modules/mapper-extras/src/test/java/org/elasticsearch/index/query/ScoreScriptUtilsTests.java

+        // test l2norm
+        L2NormSparse l2Norm = new L2NormSparse(queryVector);
+        double result4 = l2Norm.l2normSparse(dvs);
+        assertEquals("l2normSparse result is not equal to the expected value!", 301.36, result4, 0.1);


These tests don't cover the cases mentioned above where the queryVector contains dimensions not present in the document vector and vice versa.

cbuescher · 2019-03-20T22:08:27Z

modules/mapper-extras/src/test/resources/rest-api-spec/test/sparse-vector/10_basic.yml

+            script:
+              source: "l1normSparse(params.query_vector, doc['my_sparse_vector'])"
+              params:
+                query_vector: {"2": 0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0}


Same here, please add a few more tests that don't only check the behaviour when all dimensions are matching.

mayya-sharipova · 2019-03-26T16:13:30Z

Closing this PR in favour of #40473

Add l1norm and l2norm distances for vectors

5e69732

Add L1norm - Manhattan distance Add L2norm - Euclidean distance

mayya-sharipova added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label Mar 20, 2019

mayya-sharipova added >enhancement v8.0.0 v7.2.0 labels Mar 20, 2019

mayya-sharipova commented Mar 20, 2019

View reviewed changes

cbuescher self-assigned this Mar 20, 2019

cbuescher requested changes Mar 20, 2019

View reviewed changes

mayya-sharipova closed this Mar 26, 2019

jakelandis removed v7.2.0 v8.0.0 labels Jun 24, 2019

mayya-sharipova mentioned this pull request Jul 9, 2019

Add l1norm and l2norm distances for vectors #44116

Merged

jtibshirani added :Search Relevance/Vectors Vector search and removed :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Jul 21, 2022

Add l1norm and l2norm distances for vectors #40255

Add l1norm and l2norm distances for vectors #40255

Uh oh!

Conversation

mayya-sharipova commented Mar 20, 2019

Uh oh!

elasticmachine commented Mar 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbuescher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova Mar 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova commented Mar 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mayya-sharipova Mar 21, 2019 •

edited

Loading