-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
🚀 Describe the new functionality needed
This is a follow-up to this discussion. It didn't attract much attention, so moving it to an issue and assuming we can proceed with the implementation unless I hear otherwise.
This is a tracker for relevant work.
Specific details about the proposed design can be found in the link above, so I won't repeat it here.
Items to take care of:
Canary integration for post_training:
- P0: land "naive" scheduler implementation integrated with torchtune provider (@booxter looking)
- P0: move the scheduler module to common space for reuse (done above)
- P0: fix torchtune post_training integration tests (@booxter looking)
- P0: enable post_training integration tests in CI (@booxter looking) (stuck because llama models are not available in CI)
- P1: implement job cancel verb for post_training jobs
Other APIs' consistency:
- P1: adjust eval job API to use a consistent field name (job_id in eval vs job_uuid in post_training)
- P1: switch synthetic data generation API to jobs pattern (@booxter looking)
- P1: don't require passing job ID when creating a job (generate it for the user and return)
Expanded features:
- P2: expose collected logs as a job artifact
- P2: implement persistence for jobs state (server may restart and recover knowledge about jobs); probably sqlite for transactional consistency
- P2: implement logs streaming for jobs
- P2: expand possible jobs states to reflect expanded needs (e.g. paused, resumed)
- P3: implement resuming of jobs after server crash
- P3: consider if using FastAPI BackgroundTasks is applicable for scheduler needs (a new backend?)
- P4: implement jobs pause (may need addressing of hardware management story first!)
New /jobs API:
- P2: introduce new /jobs API definition (bare spec) (@booxter looking)
- P2: plug provider schedulers into /jobs API (iterate over schedulers to collect all jobs)
- P2: implement /jobs DELETE API (clean up artifacts, logs etc.)
- P3: enable filtering by job type for /jobs API
- P4: model dependencies between jobs in API
- P4: implement job dependencies in scheduler module
- P5: retire old job management APIs from post_training, eval, sdg etc. (get_artifacts, get_status, cancel etc.)
💡 Why is this needed? What if we don't build it?
The existing APIs have inconsistent behavior for long running tasks (jobs). E.g. synthetic data generation doesn't return job IDs at all. Training does but currently (for torchtune) locks the server to run the job (which makes API time out). Eval is the same plus uses a field name for job ID that is different from post_training (job_id vs job_uuid).
The API to work with jobs can be generalized across different types of jobs: all jobs will run; would need to be monitored, cancelled, removed. All of them will have some kind of artifacts as their result. Etc. That's why a common API is being proposed here to give the user consistency.
Other thoughts
No response