-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Possible 2.x search regression related to mandatory soft deletes #7621
Comments
Hey @jainankitk, thanks for reporting this. I would like to know more how OpenSearch will handle deletes if you disabled soft-deletes. My understanding soft-delete is required because of the immutable nature of Lucene segment files. @nknize any thoughts? |
Yes, It was a conscious decision to reduce storage cost and improve performance of peer recoveries and I have yet to see a detailed reproducible scenario w/ the open source OpenSearch distribution to substantiate the need. So I don't suggest we do this without strong justification. Even then I'd look at other mechanisms before unilaterally rolling back mandatory soft deletes.
@jainankitk can you post a reproducible benchmark including segment sizes, geometries, merge policy, etc. for further investigation? Full details are needed to determine this isn't a red herring request. Out of curiosity what is the behavior when you lower |
removing |
Reporting a slightly different side effect (regression) due to soft deletes on indexing latency. Observed that p100 indexing latency goes as high as 10+ seconds with soft deletes enabled where as the p100 latency was approx. 300 ms with soft deletes disabled. Thread dump revealed that the write thread is blocked on the IndexWriter object which is held by the refresh thread. The thread dumps are from OpenSearch 1.3 but I validated the same issue exists in latest OpenSearch version (2.11)
Further inspection revealed that the refresh thread is writing doc values for soft delete field.
Digging further on why doc value writes taking time, it showed min, max and gcd computation for the soft delete field consuming the time. Though the soft delete field value is hard coded to 1 in https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/common/lucene/Lucene.java#L980 it still consumes time as it iterates over million of documents.
As soft delete field is NumericDocValue field (https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/common/lucene/Lucene.java#L980), each doc value write with soft delete field computes the min, max and gcd for all the soft delete documents in the segment (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesConsumer.java#L203-L230) and hence as the number of soft delete document grows in a segment the time taken to write a new soft delete document increases proportionately. Mitigation Questions
|
@nknize - Thoughts? |
Describe the bug
Soft deletes have been enforced from 2.x without any option to disable it through #1903. Enabling soft deletes can leave segments with large number of deleted documents impacting search performance.
Workaround is to lower the default flush threshold from 512MB to lower values like 1MB to ensure the documents can be marked eligible for being expunged. Soft deletes retention policy is based on trying to retain all deleted documents above a sequence number which is based on the minimum of global checkpoint (dependent on the all follower shard copies) and local checkpoint for the safe (durable) commit of the shard. The latter does not get updated due to infrequent flushes, and causing the policy to retain the documents.
To Reproduce
Steps to reproduce the behavior:
The behavior can be reproduced using update heavy workload where the document size is not very high. OS flushes will not get triggered as the default flush threshold requires translog to reach 512MB in size before the local checkpoint of safe commit on shard is updated
Expected behavior
Since this can impact search performance and updating the flush threshold can have other consequences, we should provide option to disable soft deletes similar to OS 1.x
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: