Skip to content

Streamline Job management for long running tasks (eval, synthetic data generation, post_training etc.) #1587

@booxter

Description

@booxter

🚀 Describe the new functionality needed

This is a follow-up to this discussion. It didn't attract much attention, so moving it to an issue and assuming we can proceed with the implementation unless I hear otherwise.

This is a tracker for relevant work.

Specific details about the proposed design can be found in the link above, so I won't repeat it here.

Items to take care of:

Canary integration for post_training:

Other APIs' consistency:

Expanded features:

  • P2: expose collected logs as a job artifact
  • P2: implement persistence for jobs state (server may restart and recover knowledge about jobs); probably sqlite for transactional consistency
  • P2: implement logs streaming for jobs
  • P2: expand possible jobs states to reflect expanded needs (e.g. paused, resumed)
  • P3: implement resuming of jobs after server crash
  • P3: consider if using FastAPI BackgroundTasks is applicable for scheduler needs (a new backend?)
  • P4: implement jobs pause (may need addressing of hardware management story first!)

New /jobs API:

  • P2: introduce new /jobs API definition (bare spec) (@booxter looking)
  • P2: plug provider schedulers into /jobs API (iterate over schedulers to collect all jobs)
  • P2: implement /jobs DELETE API (clean up artifacts, logs etc.)
  • P3: enable filtering by job type for /jobs API
  • P4: model dependencies between jobs in API
  • P4: implement job dependencies in scheduler module
  • P5: retire old job management APIs from post_training, eval, sdg etc. (get_artifacts, get_status, cancel etc.)

💡 Why is this needed? What if we don't build it?

The existing APIs have inconsistent behavior for long running tasks (jobs). E.g. synthetic data generation doesn't return job IDs at all. Training does but currently (for torchtune) locks the server to run the job (which makes API time out). Eval is the same plus uses a field name for job ID that is different from post_training (job_id vs job_uuid).

The API to work with jobs can be generalized across different types of jobs: all jobs will run; would need to be monitored, cancelled, removed. All of them will have some kind of artifacts as their result. Etc. That's why a common API is being proposed here to give the user consistency.

Other thoughts

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions