Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time limits for each state for a task #618

Open
kmavrommatis opened this issue Oct 5, 2019 · 3 comments
Open

Time limits for each state for a task #618

kmavrommatis opened this issue Oct 5, 2019 · 3 comments

Comments

@kmavrommatis
Copy link

Hi,
is there a way to set some default time limits for each state of a job?
e.g. if a job stays in INITIALIZING state for over 6h then consider it failed, cancel it and transition to an CANCELLED or ERROR state?
Thanks in advance for your help

@adamstruck
Copy link
Contributor

If tasks are getting stuck in the INITIALIZING state, it probably indicates that the scheduler (e.g. AWS Batch) is killing the jobs for some reason or another and the state update from the worker isn't making it to the database. You can turn on state reconciliation for your backend which may help:

Relevant config section:
https://github.com/ohsu-comp-bio/funnel/blob/master/config/default-config.yaml#L276-L282

Code doc:
https://github.com/ohsu-comp-bio/funnel/blob/master/compute/batch/backend.go#L140-L155

It probably wouldn't be all that hard to implement a routine that periodically scans QUEUED/INITIALIZING/RUNNING tasks and cancels them if they hit some sort of wall time specified in the config. However, it seems to me that this would just be masking an underlying issue.

@kmavrommatis
Copy link
Author

Thanks for the pointers.

I enabled the reconciliation but I could not see any improvement (set it to check every 30m).

I occasionally have jobs that are stuck either to INITIALIZATION or RUNNING state for days (until I kill them. The only common thread I have found between those is that they are stuck at stages that require transfers of many files (e.g. >40 files) each of several GB in size.
Unfortunately, this is not reproducible, i.e. if I start the same job again it will probably go through.
I was wondering if this is really a network problem. Check for example the following plot.

image

It comes from a job that has finished running, and is stuck transferring files to s3 for hours. Initially it starts transferring with high speeds and then drops to a constant very slow speed. I have had similar plots for all other stuck jobs i checked.
I wonder if this is a result of trying many parallel transfers, or there is an IO block somewhere. In all these cases the funnel task is in uninterruptible sleep caused (presumably) by I/O.

@adamstruck
Copy link
Contributor

I've added an option to the worker config to limit the number of concurrent uploads/downloads. The default value is 10.

#619

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants