-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Ephemeral Storage #887
Comments
Sorry for the delay @vlerenc. There's a few strategies that we can take to solve this issue, and I will be describing the pros and cons for each. I've also performed tests with each of these strategies, and at the moment I'm not convinced there's a clear winner between two strategies; but I've eliminated the rest. Kubernetes provides the following volume types that are relevant to running etcd pods utilizing storage on the node:
|
@renormalize thank you very much for the very detailed analysis and for penning down your thoughts so clearly. One small correction from my side regarding
During hibernation / scale-down of the etcd cluster, ie, when To me, the more pressing question is, how do we scale the etcd cluster back up. With backups disabled, it's fairly straightforward, since there is no restoration of snapshots involved. So the sts can simply be scaled up from 0 to 3 replicas, and the cluster should be back up, in a fresh state, with no data from the previous run. This could make sense for If a user chooses to use ephemeral storage for their etcd clusters, then we must assume that they are ok with losing the etcd cluster data and we should make this clear to them, because kubernetes pods can die at any time (due to evictions (which can be blocked by PDBs) or node failures which are out of anybody's control), and if the storage is tied to the lifecycle of the pod, then we have a possibility of losing data from all the etcd cluster members at any given point of time. So recovering from quorum loss becomes fairly easy, because the all we need to do is to scale the sts down to 0 replicas, and once all the pods are terminated, then simply scale the sts back to 3 replicas, and the etcd cluster is good as new, which is what we would expect even from a hibernation scenario when using ephemeral storage here. |
@shreyas-s-rao thanks for your feedback! When I was speaking of hibernation, I was speaking from a Shoot cluster (a managed seed) point of view ; should have clarified that. However, I don't really think what I was talking about is too relevant (apologies for it being unclear), and etcd cluster hibernation is a far more important aspect to be discussed. What I did which led me to talk about Pending pods was the following:
After this, I observed that the etcd cluster's pods were stuck in Pending. So with regards to:
yes, you're 100% correct. If a shoot cluster (managed seed) is ever to be hibernated, I don't think it's much of a stretch to think that the etcd cluster is explicitly set
Agreed. It should be made extremely clear that etcd-druid does not guarantee etcd cluster state to any extent when using ephemeral storage. This need not be the case in the future if the necessary changes are made to the scale-out logic, after which backups can be used to successfully scale out from 0 -> 1 and then 1 -> 3. |
@renormalize this is a completely valid scenario, where the underlying nodes can be destroyed at any time. But did you observe that the existing pod went into |
@shreyas-s-rao
The shoot cluster where https://github.com/renormalize/etcd-druid/tree/storage and an etcd cluster are running is then hibernated. This shoot cluster is then woken up from hibernation.
I'm seeing different behavior now - the existing pods do not go to pending, but have entered the Running state. The older PVs which where created before hibernating the shoot cluster still exist, and checking their events show that these PVs were unable to be deleted (since the backing nodes do not exist anymore)
Finally, the etcd cluster does not become healthy and ready even after 10 minutes. Will also update my original comment with the corresponding findings. |
Sure @renormalize , thanks. |
The following changes were required in etcd-backup-restore to handle the case when etcd clusters are scaled in to 0 replicas and scaled back out to 3 replicas: gardener/etcd-backup-restore@master...renormalize:etcd-backup-restore:storage In essence, without these changes, when the etcd cluster is scaled back out from 0 to 3 replicas, there is no data directory, which causes etcd-backup-restore to enter a restoration flow. The reason etcd-backup-restore enters a restoration flow is because member leases for the etcd pods are still present when the cluster is scaled in to 0. |
After a discussion with @vlerenc and @gardener/etcd-druid-maintainers, it was decided that support for ephemeral storage will be provided with However, there are still a few areas of concern which need to be addressed:
|
What would you like to be added:
Please support the operation of ETCD with ephemeral persistent volumes (sounds like a contradiction), e.g. hostpath or better/safer yet local, so that network attached persistent volumes can be avoided that are often a scarce machine resource (e.g. AWS can only attach 26 resp. 32 volumes for most machine types; Alicloud and Azure even less).
Why is this needed:
We observe that machines can rarely be fully utilised because of the high ratio of pods-with-volumes to pods-without-volumes in a Gardener managed shoot cluster control plane. If the ETCD for events could be configured to avoid network attached persistent volumes, we could improve the machine utilisation considerably (at the expense of only limited additional network costs to "catch up" when a pod is moved to another node).
Considerations:
The text was updated successfully, but these errors were encountered: