Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make similarities dynamically updatable where possible #6727

Closed
clintongormley opened this issue Jul 4, 2014 · 17 comments
Closed

Make similarities dynamically updatable where possible #6727

clintongormley opened this issue Jul 4, 2014 · 17 comments
Labels
>enhancement help wanted adoptme high hanging fruit :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@clintongormley
Copy link
Contributor

The core similarities can be swapped in dynamically on an existing index, as long as discount_overlaps is the same. Currently we disallow updating similarities, because custom similarities may not be compatible.

The logic for deciding whether a similarity can be changed should be more fine grained.

@ghost
Copy link

ghost commented Jun 17, 2015

This would be a really nice addition - for an example we have now millions of documents, and would like to experiment with the scoring to deliver that last 10%, but we need to reindex the whole lot each time we want to change the similarity.. This means that we probably end up with as many clusters as there are scoring algos available, which costs time, money, effort and motivation.

@ghost
Copy link

ghost commented Jun 17, 2015

Maybe you should add interface to the Similarity implementations and add

/**
 * Returns the type names that are compatible metadata wise with this Similarity.
 **/
String[] getMetadataCompatibleTypeNames()

This would allow logic to be added to determine if the change should be allowed or not, solving the custom similarity compatibility problem. (As obviously the custom implementations should implement it too.) Downside for this is that you'd need a wrapper implementation for the Lucene provided Similarities and it would break existing custom ones. (Though a default wrapper that declares compatibility with none would allow everything to work as now.)

I'm not familiar enough with the Lucene classes to know if there is already a built in way to infer that knowledge though. Why this is not solved in Lucene level btw?

@rmuir
Copy link
Contributor

rmuir commented Jun 17, 2015

Because its not an issue with Lucene. You just call IndexWriterConfig.setSimilarity() and that is what IndexWriter will use to encode normalization factors.

Similarity impls can shove whatever it wants in there, up to 64-bits of stuff encoded in whatever form it wants. So ES does the right thing to prevent you from changing this here (in general). It is the same as changing index analyzer for a field, its generally just an unsafe thing to do.

But the core similarities introduced in lucene 4.0 have a special property, in that by default they all encode the index-time information (normalization factor) in a backwards compatible way as DefaultSimilarity historically did: as 1/sqrt(length) with a certain single-byte encoding.

This was done intentionally to make experimentation and "simple" testing of these ranking algorithms easier. It should not be enforced with any interface or anything like that, because subclasses and even setter methods can easily break it. It is just a way to quickly experiment with different ranking algorithms without reindexing.

I think its nice to expose (safely) this optimization to users of ES, too, so they can play in the same way. But it does not need any additional APIs for experts or custom implementations, that is misleading and dangerous.

If you are really trying to get the last 10% then I don't think this issue is really relevant: its just not going to hold for "tuning". If you are really tuning, you will likely break this property yourself anyway: the default encoding used here is very general purpose and must support a crazy range for documents large and small and various values for index-time boost. If those assumptions don't hold, in many cases you can tweak normalization to be better by adjusting the encoding.

@ghost
Copy link

ghost commented Jun 17, 2015

Hi, thanks for the response!

I meant that I'm trying to cater a better search experience for the end users, and tuning the relevance ranking, which for me as the user of ES, is the last 10%. (It does pretty well with the defaults, but there are cases where simple per field/query boosting is not delivering. Hence the need to experiment with similarities.)

I'd love to have this exposed to ES users too if possible, though I understand that I'm asking here the permission to (possibly) shoot myself to the foot :)

The API thingy was just a proposal to formalize the now unofficial contract which similarities are interchangeable, but as said I don't know if it makes sense or not. (Well, now I do know that it does not.)

@rmuir
Copy link
Contributor

rmuir commented Jun 17, 2015

I don't think we should give users the ability to shoot themselves in the foot, ever. Its easily prevented.

A common use case for this issue would be to allow someone to switch from the default similarity to BM25 and then tweak k1 and b parameter values all without reindexing. This is totally safe, and expert enough!

Having a custom similarity (subclass) is a much more expert thing and we don't need to make things complicated for that. If you already know enough to make your own similarity class, then you already have an expert way to tune without reindexing: you can tune your parameters by changing some constant in your code and ES is none the wiser.

@s1monw
Copy link
Contributor

s1monw commented Jun 17, 2015

I don't think we should give users the ability to shoot themselves in the foot, ever. Its easily prevented.

👍

I agree with rob here and I don't see really a need to do much on this issue.

@ghost
Copy link

ghost commented Jun 17, 2015

Did I understand it correctly that you wish to close this as won't fix?

@s1monw
Copy link
Contributor

s1monw commented Jun 17, 2015

@villeapvirtanen yeah that is what I propose

@rmuir
Copy link
Contributor

rmuir commented Jun 17, 2015

I think the simple case is nice to have for the core similarities from lucene (see my BM25 example above). But i have no idea how tricky it is to implement this.

@rjernst
Copy link
Member

rjernst commented Jun 17, 2015

The similarity parameters are set outside of the mappings (they are in a parallel section called "similarity"). But glancing at the code, I cannot see how it is possible they are updated (or even adding new ones after index creation). I agree this should be fixed: like with mappings, you should be able to tweak the parameters of the similarity, but not change the type, for a given name.

@rmuir
Copy link
Contributor

rmuir commented Jun 17, 2015

like with mappings, you should be able to tweak the parameters of the similarity, but not change the type, for a given name.

Its not like mappings at all though.

Changing DefaultSimilarity to BM25Similarity is ok: the on-disk encoding is the same.
Changing BM25Similarity k1/b parameters is ok: the on-disk encoding is the same.
Changing BM25Similarity.discountOverlaps is not ok, you need to reindex.

@jpountz
Copy link
Contributor

jpountz commented Mar 13, 2018

cc @elastic/es-search-aggs

@jimczi jimczi removed their assignment Aug 21, 2018
@robinp
Copy link

robinp commented Sep 17, 2019

Hello - is this still on the plate? Changing b and k1 on the fly for BM25 would be really nice. It seems a waste to reindex if no actual on-disk data would change.

@missinglink
Copy link
Contributor

I found my way here for the same reason, I would like to experiment with tweaking k1 for BM25 but it currently requires a full reindex.

@robinp
Copy link

robinp commented Jan 10, 2020

To ease the pain, I found that if you define a custom similarity, then you can later change the parameters (after closing the index, using the API). So it is safest to add a custom similarity with the stock parameters before indexing.

Once indexed, you can easily change the parameters.

@missinglink
Copy link
Contributor

missinglink commented Jan 10, 2020

Excellent thank you! Sounds like that solves my issue.

I've added some examples of how to achieve this in this PR.

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@javanna
Copy link
Member

javanna commented May 31, 2024

We have no plans on implementing this for the time being. Closing.

@javanna javanna closed this as not planned Won't fix, can't repro, duplicate, stale May 31, 2024
@javanna javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement help wanted adoptme high hanging fruit :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

10 participants