You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two providers reported that their provider closed ALL leases when control plane node was offline.
7 Aug 2024 - DCNorse provider - control plane node crashed (was down for some time).
9 Aug 2024 - Europlots provider - control plane node was unavailable for 15 minutes during provider migration.
Both have only 1 etcd and 1 control plane in total (etcd is running on the control plane node).
My guess is that either is happening: A.akash-provider can't reach K8s to query the deployments in Running state, which triggers the monitorMaxRetries counter and closes the leases after 8 minutes.
B.akash-provider can't reach etcd (since it's also running on control plane nodes) to query the manifests CRD and decides to remove them;
When akash-provider does not have enough information (from control plane / etcd) about the deployments (either Pods or Deployment / StatefulSet resources, and their state), it should not make critical decisions (such as to terminate all the leases). Ideally, it should ensure K8s/etcd is healthy in the first place.
I guess the biggest issue is if akash-provider receives wrong information from the control plane / etcd for whatever reason, such as that K8s has no Akash Deployments running at all. But I believe this wasn't the case.
The text was updated successfully, but these errors were encountered:
Two providers reported that their provider closed ALL leases when control plane node was offline.
7 Aug 2024
- DCNorse provider - control plane node crashed (was down for some time).9 Aug 2024
- Europlots provider - control plane node was unavailable for 15 minutes during provider migration.Both have only 1 etcd and 1 control plane in total (etcd is running on the control plane node).
My guess is that either is happening:
A.
akash-provider
can't reach K8s to query the deployments inRunning
state, which triggers the monitorMaxRetries counter and closes the leases after 8 minutes.B.
akash-provider
can't reachetcd
(since it's also running on control plane nodes) to query the manifests CRD and decides to remove them;When akash-provider does not have enough information (from control plane / etcd) about the deployments (either Pods or Deployment / StatefulSet resources, and their state), it should not make critical decisions (such as to terminate all the leases). Ideally, it should ensure K8s/etcd is healthy in the first place.
I guess the biggest issue is if akash-provider receives wrong information from the control plane / etcd for whatever reason, such as that K8s has no Akash Deployments running at all. But I believe this wasn't the case.
The text was updated successfully, but these errors were encountered: