-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make similarities dynamically updatable where possible #6727
Comments
This would be a really nice addition - for an example we have now millions of documents, and would like to experiment with the scoring to deliver that last 10%, but we need to reindex the whole lot each time we want to change the similarity.. This means that we probably end up with as many clusters as there are scoring algos available, which costs time, money, effort and motivation. |
Maybe you should add interface to the Similarity implementations and add /**
* Returns the type names that are compatible metadata wise with this Similarity.
**/
String[] getMetadataCompatibleTypeNames() This would allow logic to be added to determine if the change should be allowed or not, solving the custom similarity compatibility problem. (As obviously the custom implementations should implement it too.) Downside for this is that you'd need a wrapper implementation for the Lucene provided Similarities and it would break existing custom ones. (Though a default wrapper that declares compatibility with none would allow everything to work as now.) I'm not familiar enough with the Lucene classes to know if there is already a built in way to infer that knowledge though. Why this is not solved in Lucene level btw? |
Because its not an issue with Lucene. You just call IndexWriterConfig.setSimilarity() and that is what IndexWriter will use to encode normalization factors. Similarity impls can shove whatever it wants in there, up to 64-bits of stuff encoded in whatever form it wants. So ES does the right thing to prevent you from changing this here (in general). It is the same as changing index analyzer for a field, its generally just an unsafe thing to do. But the core similarities introduced in lucene 4.0 have a special property, in that by default they all encode the index-time information (normalization factor) in a backwards compatible way as DefaultSimilarity historically did: as 1/sqrt(length) with a certain single-byte encoding. This was done intentionally to make experimentation and "simple" testing of these ranking algorithms easier. It should not be enforced with any interface or anything like that, because subclasses and even setter methods can easily break it. It is just a way to quickly experiment with different ranking algorithms without reindexing. I think its nice to expose (safely) this optimization to users of ES, too, so they can play in the same way. But it does not need any additional APIs for experts or custom implementations, that is misleading and dangerous. If you are really trying to get the last 10% then I don't think this issue is really relevant: its just not going to hold for "tuning". If you are really tuning, you will likely break this property yourself anyway: the default encoding used here is very general purpose and must support a crazy range for documents large and small and various values for index-time boost. If those assumptions don't hold, in many cases you can tweak normalization to be better by adjusting the encoding. |
Hi, thanks for the response! I meant that I'm trying to cater a better search experience for the end users, and tuning the relevance ranking, which for me as the user of ES, is the last 10%. (It does pretty well with the defaults, but there are cases where simple per field/query boosting is not delivering. Hence the need to experiment with similarities.) I'd love to have this exposed to ES users too if possible, though I understand that I'm asking here the permission to (possibly) shoot myself to the foot :) The API thingy was just a proposal to formalize the now unofficial contract which similarities are interchangeable, but as said I don't know if it makes sense or not. (Well, now I do know that it does not.) |
I don't think we should give users the ability to shoot themselves in the foot, ever. Its easily prevented. A common use case for this issue would be to allow someone to switch from the default similarity to BM25 and then tweak Having a custom similarity (subclass) is a much more expert thing and we don't need to make things complicated for that. If you already know enough to make your own similarity class, then you already have an expert way to tune without reindexing: you can tune your parameters by changing some constant in your code and ES is none the wiser. |
👍 I agree with rob here and I don't see really a need to do much on this issue. |
Did I understand it correctly that you wish to close this as won't fix? |
@villeapvirtanen yeah that is what I propose |
I think the simple case is nice to have for the core similarities from lucene (see my BM25 example above). But i have no idea how tricky it is to implement this. |
The similarity parameters are set outside of the mappings (they are in a parallel section called "similarity"). But glancing at the code, I cannot see how it is possible they are updated (or even adding new ones after index creation). I agree this should be fixed: like with mappings, you should be able to tweak the parameters of the similarity, but not change the type, for a given name. |
Its not like mappings at all though. Changing DefaultSimilarity to BM25Similarity is ok: the on-disk encoding is the same. |
cc @elastic/es-search-aggs |
Hello - is this still on the plate? Changing |
I found my way here for the same reason, I would like to experiment with tweaking |
To ease the pain, I found that if you define a custom similarity, then you can later change the parameters (after closing the index, using the API). So it is safest to add a custom similarity with the stock parameters before indexing. Once indexed, you can easily change the parameters. |
Excellent thank you! Sounds like that solves my issue. I've added some examples of how to achieve this in this PR. |
We have no plans on implementing this for the time being. Closing. |
The core similarities can be swapped in dynamically on an existing index, as long as
discount_overlaps
is the same. Currently we disallow updating similarities, because custom similarities may not be compatible.The logic for deciding whether a similarity can be changed should be more fine grained.
The text was updated successfully, but these errors were encountered: