-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting an index during concurrent taking of more than one snapshot causes future restore to fail #1779
Comments
This is specific to Amazon OpenSearch Service which has been well documented here https://docs.aws.amazon.com/opensearch-service/latest/developerguide/supported-operations.html#version_7_10 |
Is there something we should/can do in OpenSearch itself? If not close this? |
@dblock the bare bone OpenSearch works just fine:
As per @Bukhtawar , it is specific to AWS OpenSearch offering: only limited settings are supported. |
I understand it is the problem of Amazon OpenSearch Service that the mitigation setting doesn't work. However, my question is "Does OpenSearch have the know issue which Elasticsearch 7.10.2 has?" To clarify, I quote the known issue from Elasticsearch 7.10.2 Release Note:
...
|
@aYukiSekiguchi OpenSearch is a fork of Elasticsearch as of 7.10.2, so it is very likely that the issue is still present |
@aYukiSekiguchi @Bukhtawar lets reopen an re-describe the original problem? It sounds like the root issue is that “If an index is deleted while the cluster is concurrently taking more than one snapshot then there is a risk that one of the snapshots may never complete and also that some shard data may be lost from the repository, causing future restore operations to fail.” We should fix this. Please note that we cannot take non-AL2 code from ES. |
I updated the description and removed about the mitigation setting because it confused some people. Note for future readers: |
Updating my thoughts on the root cause: OpenSearch/server/src/main/java/org/opensearch/snapshots/SnapshotsService.java Lines 2897 to 2905 in 658f7a6
If there is a snapshot creation in the queue during the snapshot deletion, which means there are unfinished shard snapshots in
When an index is deleted before the check,
The method withStartedShards assumes that the ShardSnapshotStatus is completed, see:OpenSearch/server/src/main/java/org/opensearch/cluster/SnapshotsInProgress.java Lines 552 to 556 in 6f6e84e
It lends to the completed snapshots not being finalized, and this is the code for finalizing the completed snapshots: OpenSearch/server/src/main/java/org/opensearch/snapshots/SnapshotsService.java Lines 2972 to 2974 in 6f6e84e
The code defect may cause state consistency issue during restoring the snapshot: the snapshot status is uncompleted, while the
|
Describe the bug
Elasticsearch 7.10.2 which OpenSearch forked has a known issue about snapshot and restore:
https://www.elastic.co/guide/en/elasticsearch/reference/7.10/release-notes-7.10.2.html#known-issues-7.10.2
To Reproduce
Steps to reproduce the behavior:
I haven't reproduced the behavior, but I guess...
Delete index while the cluster is concurrently taking more than one snapshot.
There is a risk that one of the snapshots may never complete and also that some shard data may be lost from the repository, causing future restore operations to fail.
Expected behavior
Concurrent snapshot and restore work.
Plugins
I haven't reproduced the behavior, but I guess no plugin is needed.
Screenshots
None
Host/Environment (please complete the following information):
I haven't reproduced the behavior, but I guess all environment are affected.
Additional context
I checked the patch in Elasticsearch and the same file in OpenSearch main branch. It looks like OpenSearch has the same issue.
The text was updated successfully, but these errors were encountered: