Fix polling of jobs submitted to Loadleveler. #1762

hjoliver · 2016-03-15T21:58:38Z

Loadleveler job poll failing in 6.8.1:

2016-03-16T10:55:52+13 ERROR - 2016-03-16T10:55:52+13 [jobs-poll cmd] cylc jobs-poll --host=wrh-1.hpcf.niwa.co.nz -- '$HOME/cylc-run/foo/log/job' 1/foo/01
2016-03-16T10:55:52+13 [jobs-poll ret_code] 1
2016-03-16T10:55:52+13 [jobs-poll err]

Traceback (most recent call last):
  File "/home/oliverh/cylc/cylc.git/bin/cylc-jobs-poll", line 43, in <module>
    main()
  File "/home/oliverh/cylc/cylc.git/bin/cylc-jobs-poll", line 37, in main
    BATCH_SYS_MANAGER.jobs_poll(args[0], args[1:])
  File "/gpfs_hpcf/filesets/hpcf/home/oliverh/cylc/cylc.git/lib/cylc/batch_sys_manager.py", line 321, in jobs_poll
    job_log_root, batch_sys_name, my_ctx_list)
  File "/gpfs_hpcf/filesets/hpcf/home/oliverh/cylc/cylc.git/lib/cylc/batch_sys_manager.py", line 638, in _jobs_poll_batch_sys
    bad_ids.remove(id_)
ValueError: list.remove(x): x not in list
ERROR: remote command terminated by signal 1

hjoliver · 2016-03-15T22:33:06Z

Remote poll command failure is not detected in the current batch-system-specific cylc-poll tests because they aren't using polling to detect unexpected task states (i.e. the outcome is the same even if the remote poll command fails). This was probably because a silent failed task may stay in the batch queue for a few minutes, during which time the failure won't be detected by polling.

So: this change does the following:

fix the bug that causes the traceback above for polled Loadleveler jobs
logs poll results with "(polled)" appended to the message
improves the cylc-poll tests by:
- grep for the expected poll results in the log
- and wait 4 minutes to poll for silent task failure in the batch-system-specific tests (this probably isn't too extreme given how long it takes to run the entire test battery???)

@matthewrmshin - please review or reassign.

hjoliver · 2016-03-15T22:34:55Z

lib/cylc/batch_sys_manager.py

+                    try:
+                        bad_ids.remove(id_)
+                    except:
+                        pass


LL is currently the only batch system that defines filter_poll_many_output().

matthewrmshin · 2016-03-16T09:55:27Z

./tests/cylc-poll/07-pbs.t and ./tests/cylc-poll/08-slurm.t are failing with something like this:

Can't find a\.1.*failed (polled) in /home/h01/frsn/cylc-run/20160316T094524Z_cylc_test_cylc-poll_07-pbs/log/suite/log

There is an entry like this in the suite log in each case, but nothing for a.1 failed (polled).

2016-03-16T09:46:07Z INFO - [a.1] -(current:running)> a.1 started (polled)

matthewrmshin · 2016-03-16T09:57:47Z

The change looks OK. Test OK (with and without site/user global configuration) apart from the failures mentioned above.

@arjclark please review 2.

hjoliver · 2016-03-16T09:58:52Z

How long does an exited job stay in the pbs and slurm batch queues - is my 4 minute wait before polling not long enough?

matthewrmshin · 2016-03-16T10:05:44Z

I think you may need to update their suite.rc as well. (Or make them symbolic links to the loadleveler suite?)

hjoliver · 2016-03-16T10:11:04Z

Ah, that'll be it - I just assumed the suite.rc's were symlinked like the .t files! (will fix tomorrow...)

hjoliver · 2016-03-16T23:15:37Z

The batch system polling tests are now "as one".

matthewrmshin · 2016-03-17T12:17:59Z

Now the PBS one dies with something like this:

WARNING: self-suicide is not recommended: a:fail => !a.
2016-03-17T12:12:51Z WARNING - suite timed out after PT5M
'Abort on suite timeout is set'
2016-03-17T12:12:51Z WARNING - some active tasks will be orphaned
---  

+++  

@@ -1,1 +1,2 @@

 [a.1] -triggered off []
+[b.1] -triggered off ['a.1']
'ERROR: triggering is NOT consistent with the reference log'
ERROR: Triggering check FAILED
ERROR: shutdown EVENT HANDLER FAILED
ERROR: SUITE REFERENCE TEST FAILED

And the SLURM one behaves the same as before. I am puzzled.

arjclark · 2016-03-17T12:38:32Z

Bouncing back to you @matthewrmshin while these problems are ongoing.

hjoliver · 2016-03-17T21:07:13Z

@matthewrmshin - that's the expected result if the finished job is still visible in the PBS queue after PT4M. If that's what's happening then this test might need a ridiculously long timeout, which is not desirable, so maybe we need to think of a better way of testing this. One idea I had was this:

graph = foo:start => force_failer => poller

Here, foo doesn't fail, but force_failer uses cylc reset to force it's state to failed, and then poller is supposed to correct the state allowing the suite to shut down normally.

However, this doesn't currently work because:

we don't allow failed tasks to be polled
we ignore "late polling results", to handle the case of polling a running task that happens to be in the process of sending its succeeded message.

Actually it seems to me we should allow polling of both failed and succeeded tasks, to verify those states against what really happened (we might not want to poll all succeeded tasks by default in a "poll all" operation, ... but perhaps even that wouldn't be a problem now that we poll multiple tasks with a single command?). Thoughts?

matthewrmshin · 2016-03-18T09:44:01Z

Actually it seems to me we should allow polling of both failed and succeeded tasks...

Yes, we should allow poll of failed tasks. The main problem with message failed is that of #1514 when we have the batch scheduler pre-empting a job, but later resurrect it. The job would have sent a failed message, but this would have been a false failure.

hjoliver · 2016-03-21T02:47:19Z

OK, but you didn't comment on the cause of your test suite problem (and therefore, do we need polling of failed tasks on this branch, to enable the alternative test idea).

matthewrmshin · 2016-03-21T09:17:25Z

(Sorry, just to clarify. I think the test failure here is simply because jobs are left with the queueing system for a ridiculous amount of time. The comment on #1514 is a problem in general, but is probably the main use case to support polling for failure.)

hjoliver · 2016-03-23T05:27:20Z

~~polling of failed tasks in progress...~~ [UPDATE: see #1792]

hjoliver · 2016-03-30T03:56:54Z

~~This branch is temporarily on hold while I do some minor refactoring of the task state module (which this depends on somewhat).~~ [UPDATE: see #1775 ]

Grep for polled message in the suite logs, and use polling to detect a changed task state in the batch-system-specific tests. CHERRY PICKED FROM cylc#1762 - NEXT STEP CHANGE TESTS TO NOT RELY ON JOBS EXITING THE BATCH SYS LISTING (e.g. poll failed -> running).

hjoliver · 2016-04-14T23:11:55Z

@matthewrmshin, @arjclark - as discussed above, tests that detect a change of state by polling but do not rely on jobs quickly disappearing from the batch system listing require the ability to poll non-active tasks. I have therefore cherry-picked my test changes to a new branch that will end up as a PR for #1792, removed those commits from this branch, and rebased it to current master.

So, please review this as a simple bug fix for Loadleveler jobs (that I need in for next release, if possible). This will conflict with #1775; I'll deal with that depending on which is merged first.

Currently affects loadleveler jobs, via filter_poll_many_output, and at jobs if the same jobs are listed as both queued and running tasks by atq (I've seen it happen, not sure what caused it).

hjoliver · 2016-04-15T01:44:01Z

Test battery passes here.

matthewrmshin · 2016-04-15T10:14:02Z

Change + test battery now good.

arjclark · 2016-04-19T10:12:37Z

Looks OK to me. No problems from the test-battery in my environment.

hjoliver added the bug Something is wrong :( label Mar 15, 2016

hjoliver force-pushed the fix.ll-polling branch from a4c25d9 to 32dac39 Compare March 15, 2016 22:25

hjoliver assigned matthewrmshin Mar 15, 2016

hjoliver added this to the next-release milestone Mar 15, 2016

hjoliver reviewed Mar 15, 2016
View reviewed changes

matthewrmshin assigned arjclark and unassigned matthewrmshin Mar 16, 2016

arjclark assigned matthewrmshin and unassigned arjclark Mar 17, 2016

hjoliver assigned hjoliver and unassigned matthewrmshin Mar 22, 2016

hjoliver force-pushed the fix.ll-polling branch from e3e23d6 to 42a8fad Compare April 6, 2016 23:51

hjoliver mentioned this pull request Apr 14, 2016

Proposed new task polling logic. #1792

Closed

hjoliver force-pushed the fix.ll-polling branch from 42a8fad to 53727ec Compare April 14, 2016 22:49

hjoliver force-pushed the fix.ll-polling branch from 53727ec to 80a9b27 Compare April 14, 2016 23:02

hjoliver assigned matthewrmshin and unassigned hjoliver Apr 14, 2016

hjoliver added 2 commits April 15, 2016 11:26

Fixed bug in batch_sys_manager.

0c4c74c

Currently affects loadleveler jobs, via filter_poll_many_output, and at jobs if the same jobs are listed as both queued and running tasks by atq (I've seen it happen, not sure what caused it).

Log task poll messages.

c621c1f

hjoliver force-pushed the fix.ll-polling branch from 80a9b27 to c621c1f Compare April 14, 2016 23:28

matthewrmshin assigned arjclark and unassigned matthewrmshin Apr 15, 2016

arjclark merged commit 8246a5d into cylc:master Apr 19, 2016

hjoliver deleted the fix.ll-polling branch June 24, 2016 13:17

hjoliver mentioned this pull request Feb 10, 2022

Only poll non-waiting tasks #4658

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix polling of jobs submitted to Loadleveler. #1762

Fix polling of jobs submitted to Loadleveler. #1762

hjoliver commented Mar 15, 2016

hjoliver commented Mar 15, 2016

hjoliver Mar 15, 2016

matthewrmshin commented Mar 16, 2016

matthewrmshin commented Mar 16, 2016

hjoliver commented Mar 16, 2016

matthewrmshin commented Mar 16, 2016

hjoliver commented Mar 16, 2016

hjoliver commented Mar 16, 2016

matthewrmshin commented Mar 17, 2016

arjclark commented Mar 17, 2016

hjoliver commented Mar 17, 2016

matthewrmshin commented Mar 18, 2016

hjoliver commented Mar 21, 2016

matthewrmshin commented Mar 21, 2016

hjoliver commented Mar 23, 2016 •

edited

Loading

hjoliver commented Mar 30, 2016 •

edited

Loading

hjoliver commented Apr 14, 2016

hjoliver commented Apr 15, 2016

matthewrmshin commented Apr 15, 2016

arjclark commented Apr 19, 2016

Fix polling of jobs submitted to Loadleveler. #1762

Fix polling of jobs submitted to Loadleveler. #1762

Conversation

hjoliver commented Mar 15, 2016

hjoliver commented Mar 15, 2016

hjoliver Mar 15, 2016

Choose a reason for hiding this comment

matthewrmshin commented Mar 16, 2016

matthewrmshin commented Mar 16, 2016

hjoliver commented Mar 16, 2016

matthewrmshin commented Mar 16, 2016

hjoliver commented Mar 16, 2016

hjoliver commented Mar 16, 2016

matthewrmshin commented Mar 17, 2016

arjclark commented Mar 17, 2016

hjoliver commented Mar 17, 2016

matthewrmshin commented Mar 18, 2016

hjoliver commented Mar 21, 2016

matthewrmshin commented Mar 21, 2016

hjoliver commented Mar 23, 2016 • edited Loading

hjoliver commented Mar 30, 2016 • edited Loading

hjoliver commented Apr 14, 2016

hjoliver commented Apr 15, 2016

matthewrmshin commented Apr 15, 2016

arjclark commented Apr 19, 2016

hjoliver commented Mar 23, 2016 •

edited

Loading

hjoliver commented Mar 30, 2016 •

edited

Loading