Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix polling of jobs submitted to Loadleveler. #1762

Merged
merged 2 commits into from
Apr 19, 2016

Conversation

hjoliver
Copy link
Member

Loadleveler job poll failing in 6.8.1:

2016-03-16T10:55:52+13 ERROR - 2016-03-16T10:55:52+13 [jobs-poll cmd] cylc jobs-poll --host=wrh-1.hpcf.niwa.co.nz -- '$HOME/cylc-run/foo/log/job' 1/foo/01
2016-03-16T10:55:52+13 [jobs-poll ret_code] 1
2016-03-16T10:55:52+13 [jobs-poll err]

Traceback (most recent call last):
  File "/home/oliverh/cylc/cylc.git/bin/cylc-jobs-poll", line 43, in <module>
    main()
  File "/home/oliverh/cylc/cylc.git/bin/cylc-jobs-poll", line 37, in main
    BATCH_SYS_MANAGER.jobs_poll(args[0], args[1:])
  File "/gpfs_hpcf/filesets/hpcf/home/oliverh/cylc/cylc.git/lib/cylc/batch_sys_manager.py", line 321, in jobs_poll
    job_log_root, batch_sys_name, my_ctx_list)
  File "/gpfs_hpcf/filesets/hpcf/home/oliverh/cylc/cylc.git/lib/cylc/batch_sys_manager.py", line 638, in _jobs_poll_batch_sys
    bad_ids.remove(id_)
ValueError: list.remove(x): x not in list
ERROR: remote command terminated by signal 1

@hjoliver hjoliver added the bug Something is wrong :( label Mar 15, 2016
@hjoliver
Copy link
Member Author

Remote poll command failure is not detected in the current batch-system-specific cylc-poll tests because they aren't using polling to detect unexpected task states (i.e. the outcome is the same even if the remote poll command fails). This was probably because a silent failed task may stay in the batch queue for a few minutes, during which time the failure won't be detected by polling.

So: this change does the following:

  • fix the bug that causes the traceback above for polled Loadleveler jobs
  • logs poll results with "(polled)" appended to the message
  • improves the cylc-poll tests by:
    • grep for the expected poll results in the log
    • and wait 4 minutes to poll for silent task failure in the batch-system-specific tests (this probably isn't too extreme given how long it takes to run the entire test battery???)

@matthewrmshin - please review or reassign.

@hjoliver hjoliver added this to the next-release milestone Mar 15, 2016
try:
bad_ids.remove(id_)
except:
pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LL is currently the only batch system that defines filter_poll_many_output().

@matthewrmshin
Copy link
Contributor

./tests/cylc-poll/07-pbs.t and ./tests/cylc-poll/08-slurm.t are failing with something like this:

Can't find a\.1.*failed (polled) in /home/h01/frsn/cylc-run/20160316T094524Z_cylc_test_cylc-poll_07-pbs/log/suite/log

There is an entry like this in the suite log in each case, but nothing for a.1 failed (polled).

2016-03-16T09:46:07Z INFO - [a.1] -(current:running)> a.1 started (polled)

@matthewrmshin
Copy link
Contributor

The change looks OK. Test OK (with and without site/user global configuration) apart from the failures mentioned above.

@arjclark please review 2.

@hjoliver
Copy link
Member Author

How long does an exited job stay in the pbs and slurm batch queues - is my 4 minute wait before polling not long enough?

@matthewrmshin
Copy link
Contributor

I think you may need to update their suite.rc as well. (Or make them symbolic links to the loadleveler suite?)

@hjoliver
Copy link
Member Author

Ah, that'll be it - I just assumed the suite.rc's were symlinked like the .t files! (will fix tomorrow...)

@hjoliver
Copy link
Member Author

The batch system polling tests are now "as one".

@matthewrmshin
Copy link
Contributor

Now the PBS one dies with something like this:

WARNING: self-suicide is not recommended: a:fail => !a.
2016-03-17T12:12:51Z WARNING - suite timed out after PT5M
'Abort on suite timeout is set'
2016-03-17T12:12:51Z WARNING - some active tasks will be orphaned
---  

+++  

@@ -1,1 +1,2 @@

 [a.1] -triggered off []
+[b.1] -triggered off ['a.1']
'ERROR: triggering is NOT consistent with the reference log'
ERROR: Triggering check FAILED
ERROR: shutdown EVENT HANDLER FAILED
ERROR: SUITE REFERENCE TEST FAILED

And the SLURM one behaves the same as before. I am puzzled.

@arjclark
Copy link
Contributor

Bouncing back to you @matthewrmshin while these problems are ongoing.

@arjclark arjclark assigned matthewrmshin and unassigned arjclark Mar 17, 2016
@hjoliver
Copy link
Member Author

@matthewrmshin - that's the expected result if the finished job is still visible in the PBS queue after PT4M. If that's what's happening then this test might need a ridiculously long timeout, which is not desirable, so maybe we need to think of a better way of testing this. One idea I had was this:

graph = foo:start => force_failer => poller

Here, foo doesn't fail, but force_failer uses cylc reset to force it's state to failed, and then poller is supposed to correct the state allowing the suite to shut down normally.

However, this doesn't currently work because:

  • we don't allow failed tasks to be polled
  • we ignore "late polling results", to handle the case of polling a running task that happens to be in the process of sending its succeeded message.

Actually it seems to me we should allow polling of both failed and succeeded tasks, to verify those states against what really happened (we might not want to poll all succeeded tasks by default in a "poll all" operation, ... but perhaps even that wouldn't be a problem now that we poll multiple tasks with a single command?). Thoughts?

@matthewrmshin
Copy link
Contributor

Actually it seems to me we should allow polling of both failed and succeeded tasks...

Yes, we should allow poll of failed tasks. The main problem with message failed is that of #1514 when we have the batch scheduler pre-empting a job, but later resurrect it. The job would have sent a failed message, but this would have been a false failure.

@hjoliver
Copy link
Member Author

OK, but you didn't comment on the cause of your test suite problem (and therefore, do we need polling of failed tasks on this branch, to enable the alternative test idea).

@matthewrmshin
Copy link
Contributor

(Sorry, just to clarify. I think the test failure here is simply because jobs are left with the queueing system for a ridiculous amount of time. The comment on #1514 is a problem in general, but is probably the main use case to support polling for failure.)

@hjoliver hjoliver assigned hjoliver and unassigned matthewrmshin Mar 22, 2016
@hjoliver
Copy link
Member Author

hjoliver commented Mar 23, 2016

polling of failed tasks in progress... [UPDATE: see #1792]

@hjoliver
Copy link
Member Author

hjoliver commented Mar 30, 2016

This branch is temporarily on hold while I do some minor refactoring of the task state module (which this depends on somewhat). [UPDATE: see #1775 ]

hjoliver added a commit to hjoliver/cylc-flow that referenced this pull request Apr 14, 2016
Grep for polled message in the suite logs, and use polling to
detect a changed task state in the batch-system-specific tests.

CHERRY PICKED FROM cylc#1762 - NEXT STEP CHANGE TESTS TO NOT RELY
ON JOBS EXITING THE BATCH SYS LISTING (e.g. poll failed -> running).
@hjoliver
Copy link
Member Author

@matthewrmshin, @arjclark - as discussed above, tests that detect a change of state by polling but do not rely on jobs quickly disappearing from the batch system listing require the ability to poll non-active tasks. I have therefore cherry-picked my test changes to a new branch that will end up as a PR for #1792, removed those commits from this branch, and rebased it to current master.

So, please review this as a simple bug fix for Loadleveler jobs (that I need in for next release, if possible). This will conflict with #1775; I'll deal with that depending on which is merged first.

@hjoliver hjoliver assigned matthewrmshin and unassigned hjoliver Apr 14, 2016
Currently affects loadleveler jobs, via filter_poll_many_output, and at jobs if
the same jobs are listed as both queued and running tasks by atq (I've seen it
happen, not sure what caused it).
@hjoliver
Copy link
Member Author

Test battery passes here.

@matthewrmshin
Copy link
Contributor

Change + test battery now good.

@arjclark
Copy link
Contributor

Looks OK to me. No problems from the test-battery in my environment.

@arjclark arjclark merged commit 8246a5d into cylc:master Apr 19, 2016
@hjoliver hjoliver deleted the fix.ll-polling branch June 24, 2016 13:17
@hjoliver hjoliver mentioned this pull request Feb 10, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants