Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature sbg task retries #5

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yfan2012
Copy link

This PR adds task retry functionality to the SBG platform as the launcher waits for tasks to complete. This is implemented in the monitor_task() function which does the following:

  • for SBG: the function checks the task status every 60 seconds. If it finds that a task exited with a pre-specified list of error messages (defined inSevenBridgesInstance class), it will resubmit the task up to three times on progressively larger instances also defined in the SevenBridgesInstance class. It returns either a completed task or a failed task that has been retried.
  • for arvados: the function checks the state of the task every 60 seconds. It exits and returns either a completed or failed task.

@yfan2012 yfan2012 marked this pull request as ready for review April 26, 2024 16:21
@golharam
Copy link

We already have a function call get_task_state(). This PR proposes a higher level method called monitor_task() that will return when the task succeeds or fails.

]
settings = {
0: {
"instance_type": "m5.8xlarge;ebs-gp2;4096",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there instance types hard-coded here? This assumes that is what should be used. There doesn't seem to be a way to override that either.

Copy link

@ryanporterfield ryanporterfield May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @golharam and @yfan2012,

I see what you're saying here. When I wrote this, I just had the 10X launcher in mind. I forgot that the sevenbridges_platform.py should work for all the launchers. I can implement something similar to how the launcher functioned previously and move the instance types to the config file. I can also add arguments to specify whether or not a task should be retried, as well as a specific instance type to re-run the task on.

Is this something you'd like us to work on?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I want to make sure this package stays generic enough to not include anything specific to an individual pipeline. The config settings should be part of the launcher for the pipeline.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @golharam,

I rewrote this logic to be launcher agnostic, but need permissions to create a PR on this repo. Would you mind granting me those? And the rest of our team if that's okay with you?

Ryan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants