-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Graceful storage controller cluster restarts RFC #7704
Conversation
2946 tests run: 2829 passed, 0 failed, 117 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
8faa437 at 2024-07-01T09:46:16.764Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great and thoughtful RFC.
It might also make sense to mention that draining is a no-op for tenants that lack secondaries, i.e. non-HA ones. It makes sense to keep this as a TODO in the implementation but ideally the RFC would mention it as well.
Also, I wonder what to do if the secondary of a to-be-drained tenant is on an unresponsive pageserver. Do we want to drain that secondary? Maybe the answer is still a yes, generations will take care of it, but we should be robust wrt that in the implementation, to not stall the draining process indefinitely while waiting for an okay from a pageserver that is unresponsive.
Right, in theory we can cater to non-HA tenants as well by changing their attached location. Depending on tenant size,
This is basically the same scenario as previously. If the node we are moving to is unresponsive, the reconciliation will fail, leaving the tenant on the original node. It's a good point though. I think it's a good idea to make sure the node is online |
yeah my point was mainly about that: there is a difference between doing retries indefinitely, and doing retries but giving up at a point. We need to implement latter because of the scenario I mentioned above. |
…r restarts (#8014) ## Problem Pageserver restarts cause read availablity downtime for tenants. See `Motivation` section in the [RFC](#7704). ## Summary of changes * Introduce a new `NodeSchedulingPolicy`: `PauseForRestart` * Implement the first take of drain and fill algorithms * Add a node status endpoint which can be polled to figure out when an operation is done The implementation follows the RFC, so it might be useful to peek at it as you're reviewing. Since the PR is rather chunky, I've made sure all commits build (with warnings), so you can review by commit if you prefer that. RFC: #7704 Related #7387
RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387
RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387
RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387
RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387
RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387
RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related #7387
RFC for "Graceful Restarts of Storage Controller Managed Clusters".
POC which implements nearly everything mentioned here apart from the optimizations
and some of the failure handling: #7682
Related #7387