Skip to content

Conversation

@mayya-sharipova
Copy link
Contributor

Add L1norm - Manhattan distance
Add L2norm - Euclidean distance

relates to #37947

Add L1norm - Manhattan distance
Add L2norm - Euclidean distance
@mayya-sharipova mayya-sharipova added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label Mar 20, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search


Note that, unlike `cosineSimilarity` that represent
similarity, `l1norm` and the shown below `l2norm` represent distances or
differences. This means, that the mose similar are vectors,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mose -> more

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... the more similar two vector are, the smaller is the score produced by the ...

@cbuescher cbuescher self-assigned this Mar 20, 2019
Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mayya-sharipova, very interesting and great addition. I left a couple of comments, the most important one of which I think is that I don't understand how the calculation of the l1/l2 distances is handled for sparse vectors when the dimension entries don't exactly overlap (e.g. I belive an "empty" document vector and a query vector of e.g. ("1:10", "2:10) should have a Manhattan distance of 20 if I'm not mistaken)
If you agree with this general behaviour I think the calculations need to be updated slightly. In that case I would also prefer to add a couple of more tests to check these and more edge cases. Maybe a small randomized tests would also allow to make sure strange edge cases are covered.

// NOTCONSOLE

Note that, unlike `cosineSimilarity` that represent
similarity, `l1norm` and the shown below `l2norm` represent distances or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not exactly sure since I'm no native speaker, but I would expect "the l2norm shown below".


Note that, unlike `cosineSimilarity` that represent
similarity, `l1norm` and the shown below `l2norm` represent distances or
differences. This means, that the mose similar are vectors,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... the more similar two vector are, the smaller is the score produced by the ...

similarity, `l1norm` and the shown below `l2norm` represent distances or
differences. This means, that the mose similar are vectors,
the less will be the scores produced by `l1norm` and `l2norm` functions.
Thus, if you need more similar vectors to score higher, you should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably simpler and easier to understand: This means you need to reverse... if you want vectors to increase the search score.

`"source": " 1/ l1norm(params.queryVector, doc['my_dense_vector'])"`

For sparse_vector fields, `l1normSparse` calculates L^1^ distance
between a given query vector and document vectors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just me, but this sounds a bit like the distance calculation is different from the above? I guess its not, you just need this for sparse vectors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher Thanks Christoph. Right, this is exactly same function as l1norm but for sparse vectors. We have all functions for dense vectors being duplicated with sparse ending for sparse vectors.

*/
public static double l1norm(List<Number> queryVector, VectorScriptDocValues.DenseVectorScriptDocValues dvs){
BytesRef value = dvs.getEncodedValue();
if (value == null) return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: enclose block in brackets

i++;
}
// Sort dimensions in the ascending order and sort values in the same order as their corresponding dimensions
sortSparseDimsDoubleValues(queryDims, queryValues, n);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks almost identical to the constructor in the above class. Maybe this can be shared to large extents.

Copy link
Contributor Author

@mayya-sharipova mayya-sharipova Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher Thanks, Christoph. Indeed I was also thinking how to share the code. But I am working under 2 constraints here:

  1. painless script. Painless script has requirements how the code should be structured to use caching and bindings
  2. performance. I was thinking to have a singular function that iterates over query and documents, and pass it a lambda - BiFunction as an argument that tells what computation needs to be done. But BiFunction accepts only classes, and I did not want to covert primitive floats to Float instances. This will significantly slow down computations.

Still, I will keep thinking how this code can be restructured.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I didn't thing about performance issues about boxing in lambdas. Does Painless prohibit e.g. L1NormSparse and L2NormSparse sharing a common abstract superclass for e.g. sharing the common code in the constructor?

docIndex++;
} else {
queryIndex++;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same general remarks as above. Also, the implementations of the two norms look very similar with the exception of how the vector diffs are treated (squared vs. just summing them). I think it would be worth trying to share huge parts of the function and maybe only use a differing lambda to do the different calculations.


// test l2norm
double result4 = l2norm(queryVector, dvs);
assertEquals("l2norm result is not equal to the expected value!", 301.36, result4, 0.1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see an additional test for differing vector lengths, probably asserting that this throws an error if we decide to go that route.

// test l2norm
L2NormSparse l2Norm = new L2NormSparse(queryVector);
double result4 = l2Norm.l2normSparse(dvs);
assertEquals("l2normSparse result is not equal to the expected value!", 301.36, result4, 0.1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests don't cover the cases mentioned above where the queryVector contains dimensions not present in the document vector and vice versa.

script:
source: "l1normSparse(params.query_vector, doc['my_sparse_vector'])"
params:
query_vector: {"2": 0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, please add a few more tests that don't only check the behaviour when all dimensions are matching.

@mayya-sharipova
Copy link
Contributor Author

Closing this PR in favour of #40473

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants