Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a timeout to the workflow API #937

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

javiermtorres
Copy link
Contributor

@javiermtorres javiermtorres commented Feb 19, 2025

What's changing

Jobs in a workflow currently expire at a fixed timeout of 600 seconds. Also, timed out jobs are not stopped. This PR provides a timeout field in the workflow that sets the waiting time for all jobs in the workflow, with the expected default value of 600 seconds. It should also stop jobs that have timed out.

Closes #934

How to test it

Steps to test the changes:

  1. Post a new job with a very different timeout value than 600 (1 or 90000 for example)
  2. Check that the job is set to the right status after the set time

Additional notes for reviewers

N/A

I already...

  • Tested the changes in a working environment to ensure they work as expected
  • Added some tests for any new functionality
  • Updated the documentation (both comments in code and product documentation under /docs)
  • Checked if a (backend) DB migration step was required and included it if required
    • No DB update is needed

@github-actions github-actions bot added backend schemas Changes to schemas (which may be public facing) labels Feb 19, 2025
@javiermtorres javiermtorres force-pushed the issue-934-allow-configurable-timeout branch from 6cbb975 to 96b3c55 Compare February 21, 2025 07:49
@javiermtorres javiermtorres marked this pull request as ready for review February 21, 2025 07:52
Copy link
Contributor

@njbrake njbrake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new parameter in the schema should have a unit associated with it, imo. Either sec or min.

Comment on lines +63 to +72
# Maybe move to the job service?
def _stop_job(self, job_id: UUID):
resp = requests.post(urljoin(settings.RAY_JOBS_URL, f"{job_id}/stop"), timeout=5) # 5 seconds
if resp.status_code == HTTPStatus.NOT_FOUND:
raise JobUpstreamError("ray", "job_id not found when retrieving logs") from None
elif resp.status_code != HTTPStatus.OK:
raise JobUpstreamError(
"ray",
f"Unexpected status code getting job logs: {resp.status_code}, error: {resp.text or ''}",
) from None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Yeah this does seem like something that should be in the job service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend schemas Changes to schemas (which may be public facing)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: Long-running jobs time out after 600 seconds.
2 participants