Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cylc message should have submit number info #2528

Closed
matthewrmshin opened this issue Jan 8, 2018 · 5 comments · Fixed by #2582
Closed

cylc message should have submit number info #2528

matthewrmshin opened this issue Jan 8, 2018 · 5 comments · Fixed by #2582
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@matthewrmshin
Copy link
Contributor

Long story short, something bad happened to a task host, and we ended up with something like this:

  1. Suite got a message from (submit 1 of) a task when it died with a TERM signal.
  2. Task host became unavailable for a time.
  3. Task host came back.
  4. Suite (retry) submitted job 2 of the task to the task host.
  5. Suite got a message from (submit 2 of) the task that it succeeded.
  6. For some reason, job 1 was resumed on the task host, which then sent a message back to the suite to say that it failed.
  7. Tasks downstream of this task were a bit messed up.

If we have the submit number in the task message, the suite should be able to discard messages from earlier submits.

@matthewrmshin matthewrmshin added this to the soon milestone Jan 8, 2018
@matthewrmshin matthewrmshin self-assigned this Jan 8, 2018
@matthewrmshin
Copy link
Contributor Author

matthewrmshin commented Jan 10, 2018

May be worth looking at #439, #2214 and/or #2502 when we look at this one.

@matthewrmshin
Copy link
Contributor Author

The other usual way to trigger this problem is when users reset the state of a submitted or running task to ready - without killing the original job first. This should also be handled correctly. Should we have the suite make an automatic attempt to kill the original job when user resets the state of a submitted or running task?

@hjoliver
Copy link
Member

hjoliver commented Jan 19, 2018

Should we have the suite make an automatic attempt to kill the original job when user resets the state of a submitted or running task?

I think yes. It doesn't really make sense to have to two instances of the same task job running at once.

@hjoliver
Copy link
Member

hjoliver commented Jan 19, 2018

Somewhat relatedly (although not a submit number issue): careless (but common) use of suicide triggers can result in suiciding an active task proxy. Currently we just log a warning about this; we should probably kill the active job as well.

@matthewrmshin
Copy link
Contributor Author

#2505 not really related, but I'll put a link here any way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants