-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix polling of jobs submitted to Loadleveler. #1762
Conversation
a4c25d9
to
32dac39
Compare
Remote poll command failure is not detected in the current batch-system-specific cylc-poll tests because they aren't using polling to detect unexpected task states (i.e. the outcome is the same even if the remote poll command fails). This was probably because a silent failed task may stay in the batch queue for a few minutes, during which time the failure won't be detected by polling. So: this change does the following:
@matthewrmshin - please review or reassign. |
try: | ||
bad_ids.remove(id_) | ||
except: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LL is currently the only batch system that defines filter_poll_many_output()
.
There is an entry like this in the suite log in each case, but nothing for
|
The change looks OK. Test OK (with and without site/user global configuration) apart from the failures mentioned above. @arjclark please review 2. |
How long does an exited job stay in the pbs and slurm batch queues - is my 4 minute wait before polling not long enough? |
I think you may need to update their |
Ah, that'll be it - I just assumed the suite.rc's were symlinked like the .t files! (will fix tomorrow...) |
The batch system polling tests are now "as one". |
Now the PBS one dies with something like this:
And the SLURM one behaves the same as before. I am puzzled. |
Bouncing back to you @matthewrmshin while these problems are ongoing. |
@matthewrmshin - that's the expected result if the finished job is still visible in the PBS queue after PT4M. If that's what's happening then this test might need a ridiculously long timeout, which is not desirable, so maybe we need to think of a better way of testing this. One idea I had was this:
Here, However, this doesn't currently work because:
Actually it seems to me we should allow polling of both |
Yes, we should allow poll of failed tasks. The main problem with message failed is that of #1514 when we have the batch scheduler pre-empting a job, but later resurrect it. The job would have sent a failed message, but this would have been a false failure. |
OK, but you didn't comment on the cause of your test suite problem (and therefore, do we need polling of failed tasks on this branch, to enable the alternative test idea). |
(Sorry, just to clarify. I think the test failure here is simply because jobs are left with the queueing system for a ridiculous amount of time. The comment on #1514 is a problem in general, but is probably the main use case to support polling for failure.) |
|
|
42a8fad
to
53727ec
Compare
Grep for polled message in the suite logs, and use polling to detect a changed task state in the batch-system-specific tests. CHERRY PICKED FROM cylc#1762 - NEXT STEP CHANGE TESTS TO NOT RELY ON JOBS EXITING THE BATCH SYS LISTING (e.g. poll failed -> running).
53727ec
to
80a9b27
Compare
@matthewrmshin, @arjclark - as discussed above, tests that detect a change of state by polling but do not rely on jobs quickly disappearing from the batch system listing require the ability to poll non-active tasks. I have therefore cherry-picked my test changes to a new branch that will end up as a PR for #1792, removed those commits from this branch, and rebased it to current master. So, please review this as a simple bug fix for Loadleveler jobs (that I need in for next release, if possible). This will conflict with #1775; I'll deal with that depending on which is merged first. |
Currently affects loadleveler jobs, via filter_poll_many_output, and at jobs if the same jobs are listed as both queued and running tasks by atq (I've seen it happen, not sure what caused it).
80a9b27
to
c621c1f
Compare
Test battery passes here. |
Change + test battery now good. |
Looks OK to me. No problems from the test-battery in my environment. |
Loadleveler job poll failing in 6.8.1: