Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot _status API to return correct status for partial snapshots #12812

Merged
merged 17 commits into from
Apr 17, 2024

Conversation

aggarwalShivani
Copy link
Contributor

@aggarwalShivani aggarwalShivani commented Mar 21, 2024

Description

Issue: The /_snapshot/repo/snapshot/_status API endpoint returns snapshot.state=SUCCESS for snapshots that have failed shards and are actually in PARTIAL status.
Change - To fix this behaviour, a new PARTIAL state is added in SnapshotsInProgress.State enum.
Also, currently, for both successful and partial snapshots, success state was being returned as the response. This is now changed to return PARTIAL for snapshots in partial status.
Test case is also added, to test for partial snapshots.

Related Issues

Resolves #7952

Check List

  • New functionality includes testing.
    • All tests pass
  • [ ] New functionality has been documented.
    • [ ] New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • [ ] Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: aggarwalShivani <shivani.aggarwal@nokia.com>
Copy link
Contributor

❌ Gradle check result for d1e6656: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@aggarwalShivani
Copy link
Contributor Author

@aggarwalShivani there is one issue that I am concerned about, the snapshot state is stored in cluster state:

                final SnapshotsInProgress snapshots = currentState.custom(SnapshotsInProgress.TYPE);

I believe the cluster state is read by passing local node version, if we backport to 2.x, we may need to check that cluster state for snapshots will be readable in 3.x and 2.x irrespectively which node created it (was cluster manager).

@reta okay thanks, I'll have to dig deeper to understand that :) But do you think we should not backport it to 2.x?

@reta
Copy link
Collaborator

reta commented Apr 13, 2024

@aggarwalShivani there is one issue that I am concerned about, the snapshot state is stored in cluster state:

                final SnapshotsInProgress snapshots = currentState.custom(SnapshotsInProgress.TYPE);

I believe the cluster state is read by passing local node version, if we backport to 2.x, we may need to check that cluster state for snapshots will be readable in 3.x and 2.x irrespectively which node created it (was cluster manager).

@reta okay thanks, I'll have to dig deeper to understand that :) But do you think we should not backport it to 2.x?

@aggarwalShivani the answer depends on if cluster in mixed cluster is correctly handled

@aggarwalShivani
Copy link
Contributor Author

@aggarwalShivani there is one issue that I am concerned about, the snapshot state is stored in cluster state:

                final SnapshotsInProgress snapshots = currentState.custom(SnapshotsInProgress.TYPE);

I believe the cluster state is read by passing local node version, if we backport to 2.x, we may need to check that cluster state for snapshots will be readable in 3.x and 2.x irrespectively which node created it (was cluster manager).

@reta okay thanks, I'll have to dig deeper to understand that :) But do you think we should not backport it to 2.x?

@aggarwalShivani the answer depends on if cluster in mixed cluster is correctly handled

Hi @reta ,
I have managed to run some tests on a mixed-cluster - where 2 nodes were on v3.0.0-SNAPSHOT and 2 were on v2.14.0-SNAPSHOT.
I reproduced the PARTIAL snapshot scenario and as expected, the status API returns SUCCESS in this case. I have issued the snapshot _status REST API to each of the 4 nodes' IPs individually and they all return SUCCESS and not PARTIAL.

I have repeated this test on 2.x branch too - where 2 nodes were on v2.14.0-SNAPSHOT and 2 were on v2.13.0 - there too the results were same.
So I think it is okay to backport it to 2.x too. Please let me know if this is fine.


Additionally, since we are backporting it to 2.14, I think we can change the version condition right away as suggested by @owaiskazi19,

out.getVersion().before(Version.V_2_14_0)).

The same PR could be backported to 2.x branch as-is. Please let me know if this understanding is right.

Copy link
Contributor

✅ Gradle check result for 054e53e: SUCCESS

@owaiskazi19
Copy link
Member

Additionally, since we are backporting it to 2.14, I think we can change the version condition right away as suggested by @owaiskazi19,

As @reta mentioned since we are backporting this to 2.x, we can change the version in the backport PR. For main we can keep it as 3.0. @reta looks good to merge?

@reta
Copy link
Collaborator

reta commented Apr 15, 2024

As @reta mentioned since we are backporting this to 2.x, we can change the version in the backport PR. For main we can keep it as 3.0. @reta looks good to merge?

Yeah, I approved, thanks @owaiskazi19 !

Signed-off-by: aggarwalShivani <shivani.aggarwal@nokia.com>
@aggarwalShivani
Copy link
Contributor Author

Additionally, since we are backporting it to 2.14, I think we can change the version condition right away as suggested by @owaiskazi19,

As @reta mentioned since we are backporting this to 2.x, we can change the version in the backport PR. For main we can keep it as 3.0. @reta looks good to merge?

Oh okay sure. Thanks :)

Although my understanding was slightly different. Please correct me where i went wrong.
If there is a mixed cluster of version 3.0.0 and 2.14.0:

  • both the versions would support PARTIAL status change.
  • In 3.0.0 src code, condition would always check for version>=3.0.0
  • So for anything older than 3.0.0, it would set SUCCESS (instead of PARTIAL) due to bwc. Even though 2.14.0 supports it, it wouldn't be able to set PARTIAL.
    So I assumed in the main branch (3.0.0) itself, we could check for >= 2.14.0.

Copy link
Contributor

✅ Gradle check result for c0b1ae1: SUCCESS

@owaiskazi19
Copy link
Member

@peternied any idea about the failure in detect breaking change GHA?

@peternied
Copy link
Member

The failure is from a change in how snapshots are resolved with the Lucene version bump [1], another merge from main will resolve this. I was able to validate the failure on my local machine and confirm it was resolved with a merge from main.

@aggarwalShivani Could you fetch and merge from main to be sure that 39ac2df is included?

@reta
Copy link
Collaborator

reta commented Apr 16, 2024

So I assumed in the main branch (3.0.0) itself, we could check for >= 2.14.0.

We could do that right after backporting your code to 2.x, otherwise mixed cluster tests will fail since 2.14.0 (2.x) does not know what to do with PARTIAL

Copy link
Contributor

✅ Gradle check result for 9066b85: SUCCESS

@peternied peternied added the backport 2.x Backport to 2.x branch label Apr 16, 2024
@owaiskazi19 owaiskazi19 removed the backport 2.x Backport to 2.x branch label Apr 17, 2024
Copy link
Member

@owaiskazi19 owaiskazi19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @aggarwalShivani thanks for being patient with the review. I have removed the auto backport 2.x label. Can you raise a manual backport PR to 2.x branch with Version.V_2_14_0 change as discussed here?

@owaiskazi19 owaiskazi19 added backport 2.x Backport to 2.x branch and removed backport 2.x Backport to 2.x branch labels Apr 17, 2024
@aggarwalShivani
Copy link
Contributor Author

LGTM! @aggarwalShivani thanks for being patient with the review. I have removed the auto backport 2.x label. Can you raise a manual backport PR to 2.x branch with Version.V_2_14_0 change as discussed here?

Thanks @owaiskazi19 for your review and suggestions. Sure, I will raise another PR on 2.x branch with these changes.

However, I see the changelog-verifier check is now failing with this error -

Run # The check was possibly skipped leading to success for both the jobs
error: Please make sure that the PR has a backport label associated with it when making an entry to the CHANGELOG.md file

I'm assuming this is due to removing the auto backport 2.x label. How do we proceed with the merge then?

@owaiskazi19
Copy link
Member

owaiskazi19 commented Apr 17, 2024

I'm assuming this is due to removing the auto backport 2.x label. How do we proceed with the merge then?

Yes, that's correct. I'll go ahead and merge this. You can raise a manual 2.x PR after that

@owaiskazi19 owaiskazi19 merged commit b899a27 into opensearch-project:main Apr 17, 2024
43 of 47 checks passed
reta pushed a commit that referenced this pull request Apr 17, 2024
…ial snapshots (#12812) (#13260)

* Snapshot _status API to return correct status for partial snapshots

Signed-off-by: aggarwalShivani <shivani.aggarwal@nokia.com>

* Updated CHANGELOG.md

Signed-off-by: aggarwalShivani <shivani.aggarwal@nokia.com>

---------

Signed-off-by: aggarwalShivani <shivani.aggarwal@nokia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework good first issue Good for newcomers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Request _snapshot/repo/snapshot/_status returns wrong status
8 participants