You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note Creating in rapidsai/deployment so I can use tasklists. When tasklists are GA I'll migrate this issue to the dask/dask-kubernetes repo.
At the end of the summer I want to release V2 of the Dask Kubernetes Operator and fully remove the deprecated classic implementations. This issue outlines the roadmap that we need to complete to get us to a point where we can do that.
High-level goals:
Improve stability
Ensure feature completeness compared to other implementations
Some of the sections here may want to be split off into separate issues, and some tasks may want to be broken down into smaller chunks. But this will be the high-level milestone tracker issue for this work.
Features
Cluster idle timeout
Cleaning up idle clusters automatically becomes critical for cost-reduction when deploying at scale. Especially when using GPUs.
The content you are editing has changed. Please copy your edits and refresh the page.
Currently we have partial Istio support where the scheduler uses it but workers do not. This can be a blocker for clusters that enforce Istio on all comms.
The content you are editing has changed. Please copy your edits and refresh the page.
Replace Pod resources with higher abstractions like Deployment or at least ReplicaSet
Currently we manage bare Pods. There are downsides to this such as pods not being recreated when they are evicted from a node. It would be good to explore higher-level resources and how they could be used to simplify our controller logic.
The content you are editing has changed. Please copy your edits and refresh the page.
Currently we rely on bad configuration being validated by the Kubernetes API, but this doesn't always happen as we expect. We should do more checking and sanitization before calling the Kubernetes API.
The content you are editing has changed. Please copy your edits and refresh the page.
The controller event handlers should be idempotent and should be able to be called multiple times. Today they are not which can cause problems when restarting the controller while operations are running.
The content you are editing has changed. Please copy your edits and refresh the page.
Today we use pykube-ng, dask_kubernetes.aiopykube, kubernetes_asyncio and subprocess/kubectl to interact with the Kubernetes API. We should consolidate everything around kr8s which was spun out from here with the intention of unifying our API usage.
The content you are editing has changed. Please copy your edits and refresh the page.
Dask Kubernetes Summer Roadmap
At the end of the summer I want to release V2 of the Dask Kubernetes Operator and fully remove the deprecated classic implementations. This issue outlines the roadmap that we need to complete to get us to a point where we can do that.
High-level goals:
Some of the sections here may want to be split off into separate issues, and some tasks may want to be broken down into smaller chunks. But this will be the high-level milestone tracker issue for this work.
Features
Cluster idle timeout
Cleaning up idle clusters automatically becomes critical for cost-reduction when deploying at scale. Especially when using GPUs.
Tasks
Full Istio support
Currently we have partial Istio support where the scheduler uses it but workers do not. This can be a blocker for clusters that enforce Istio on all comms.
Tasks
UX Improvements
UX can always be improved
Tasks
Fixes
Replace Pod resources with higher abstractions like Deployment or at least ReplicaSet
Currently we manage bare Pods. There are downsides to this such as pods not being recreated when they are evicted from a node. It would be good to explore higher-level resources and how they could be used to simplify our controller logic.
Tasks
Ensure patches to DaskCluster and DaskWorkerGroup are propagated to child resources
In the context of CRUD we only have create, read and delete implemented for our resource. We also need to correctly handle updating them.
Tasks
Ensure scaling/autoscaling is solid
Some users are reporting unwanted behaviour when autoscaling at scale. This needs to be solid.
Tasks
Input sanitisation
Currently we rely on bad configuration being validated by the Kubernetes API, but this doesn't always happen as we expect. We should do more checking and sanitization before calling the Kubernetes API.
Tasks
Controller idempotency
The controller event handlers should be idempotent and should be able to be called multiple times. Today they are not which can cause problems when restarting the controller while operations are running.
Tasks
Hygeine/Tech Debt
Migrate Kubernetes client library to kr8s
Today we use
pykube-ng
,dask_kubernetes.aiopykube
,kubernetes_asyncio
andsubprocess
/kubectl
to interact with the Kubernetes API. We should consolidate everything around kr8s which was spun out from here with the intention of unifying our API usage.Tasks
Other
Tasks
The text was updated successfully, but these errors were encountered: