Replies: 1 comment
-
Dynamic sticky states seems great, and seems to solve a lot of challenges that current orchestrators have with modern ML workloads! Even the static variant alone is quite useful, this enables a bunch of use cases, e.g.:
Conceptually, I think the design seems similar to k8s's node affinity feature using the I do quite like the Unless I'm misunderstanding the current proposal, since everything is done via string matching on the resource, a tag list design should be equivalent in selection power. I think where the
K8S's design allows for a node affinity weight. If the design was modified to the tag list, the API call could then be More ambitious would be to modify to |
Beta Was this translation helpful? Give feedback.
-
Problem
Hatchet customers often have steps that require specific worker states or resources, such as loaded machine learning models in memory. Currently, when executing steps, Hatchet does not consider the state of the workers, leading to potential issues such as:
To address these problems, we propose a feature called "Sticky Steps" that allows steps to express their resource requirements and favors assigning them to workers that are already in a compatible state.
Proposed Solution
Step Resource Requirements
Introduce a new field called
requiredResources
in the step definition object. This field will be an array of strings representing the resources or states required by the step.Example usage:
In this example,
step1
requires the resources "modelA" and "datasetX", whilestep2
requires the resources "modelB" and "datasetY".Worker State Management
Introduce a new method called
updateWorkerResourceState
in thecontext
object passed to the step function. This method allows the step to update the worker's state, indicating the resources or states that the worker has acquired during the step execution.Example usage:
In this example, the step function loads the required resources (e.g.,
modelA
anddatasetX
) and updates the worker state using thecontext.updateWorkerResourceState
method. The worker state is represented as an object where the keys are the resource names and the values are the loaded resources.Sticky Step Scheduling
When a step is ready to be executed, the Hatchet engine will follow these steps:
requiredResources
specified.requiredResources
are specified, search for workers that have all the required resources available in their state.requiredResources
are specified, follow the default worker assignment logic.When a step is assigned to a worker that doesn't have the required resources in its state, the worker will load and initialize the necessary resources before executing the step. After the step is completed, the worker will update its state using the
context.updateWorkerResourceState
method to reflect the acquired resources.Resource Eviction and Replacement
If a worker needs to execute a step that requires resources different from its current state, it will replace the existing resources with the newly required ones. This ensures that workers can adapt to changing step requirements while optimizing for resource reuse when possible.
Risks and Considerations
context.updateWorkerResourceState
method and handle resource initialization and cleanup efficiently.Beta Was this translation helpful? Give feedback.
All reactions