Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed new task polling logic. #1792

Closed
hjoliver opened this issue Apr 14, 2016 · 6 comments
Closed

Proposed new task polling logic. #1792

hjoliver opened this issue Apr 14, 2016 · 6 comments
Assignees
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Apr 14, 2016

This arises from the need to allowing polling of failed tasks as discussed in #1762, partly in order to make tests that detect failure of a remote poll operation (given than batch schedulers typically list tasks for some minutes after they've exited, during which time they will poll as running if the batch queue has to be interrogated).

  • Allow all tasks with a Job ID (i.e. 'submitted' or later) to be polled
    • but only active ones ('submitted' or 'running') by default in the poll-all case, to avoid unnecessary mass polling of succeeded tasks.
  • Allow all tasks to be resurrect-able Poll tasks when "allow resurrection" is True. #1514
    • i.e. any 'failed' task can be returned to 'submitted' or 'running' as a result of polling.
    • ditch the current "enable resurrection" config item.
  • Always believe a poll result if it takes the task state forward
    • e.g. 'running' => 'succeeded' or 'failed'.
  • If a poll result would take the state backwards, e.g. 'succeeded' => 'running' it could mean the poll result was late (task was sending "succeeded" while it was being polled as running), in which case ignore the poll and immediately issue a second poll.
    • Always believe a second poll.

[UPDATE] the last (more difficult) bit is not needed, because the job status file (reliably) records job success or failure - we only interrogate the batch queue if the this information is not in the status file yet [I think that misses the point - removing the strike-through over the last bullet point]

[UPDATE 2] - if batch scheduler preempts by kill and re-queue, we might want a poll to take the job state backwards ('failed' => 'submitted') - but would need to interrogate the batch scheduler rather than the job status file.

@hjoliver hjoliver self-assigned this Apr 14, 2016
@hjoliver hjoliver added this to the soon milestone Apr 14, 2016
@hjoliver
Copy link
Member Author

(TBD - polling of 'retrying' tasks, or not)

@matthewrmshin
Copy link
Contributor

Good idea. Note that job poll results should contain time information from the job status file, so we should know what to trust or not.

@matthewrmshin
Copy link
Contributor

matthewrmshin commented Apr 14, 2016

I have just realised that my argument above will fall apart if multiple entries are written to the job status file, e.g. pipe issue #1783, pre-emption/resurrection #1514, etc. In normal circumstances, however, the time information from the job status file should be trustworthy.

@hjoliver
Copy link
Member Author

Regarding polling and pre-emption (/resurrection) see #1514

@hjoliver
Copy link
Member Author

hjoliver commented Apr 19, 2016

I'll return to this once #1762 and #1775 are merged. [update: these are DONE]

@hjoliver
Copy link
Member Author

hjoliver commented Jun 24, 2016

@matthewrmshin says:

  • poll results are now recorded in the job status file - if a status line exists, it can be trusted (if records that the job was polled as still queued or running, we interrogate the batch queue again, of course).
  • signalled kills may be untrustworthy - need to poll to confirm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants