reload: fix submission errors for jobs awaiting preparation #4984

oliver-sanders · 2022-07-15T10:51:50Z

When we reload a workflow we create new TaskProxy instances for all non-active tasks in the pool and replaces the previous instances with these new ones. The job submission pipeline maintains references to the pre-reload TaskProxy's which means if the workflow is reloaded whilst a tasks is going through job prep the pre-reload task gets submitted but the post-reload task is left in the pool. This results in submission issues as the submit number between the two instances are one off.

For busy workflows where the job submission pipeline is constantly churning away this bug is almost guaranteed to strike with any reload operation.

This PR attempts to resolve this issue by:

Ensuring job submission is fed the same TaskProxy instance as present in the task pool.
Incrementing the job submission number at preparation time (rather than after job submit).

Explanation:

scheduler: re-compute pre_prep_tasks for each iteration
- Addresses stuck in preparing state #4974
- Tasks which are awaiting job preparation used to be stored in
  Scheduler.pre_prep_tasks, however, this effectively created an
  intermediate "task pool" which had nasty interactions with reload.
- This commit removes the pre_prep_tasks list by merging the listing
  of these tasks in with TaskPool.release_queued_tasks (to avoid
  unnecessary task pool iteration).
- waiting_on_job_prep now defaults to False rather than True.
job: increment the submission number at preparation time
- Addresses stuck in preparing state #4974
- Job submission number used to be incremented after submission
  (i.e. only once there is a "submission" of which to speak).
- However, we also incremented the submission number if submission
  (or preparation) failed (in which cases there isn't really a
  "submission" but we need one for internal purposes).
- Now the submission number is incremented when tasks enter the
  "preparing" state.
- This resolves an issue where jobs which were going through the
  submission pipeline during a reload got badly broken in the scheduler
  (until restarted).
- It also makes the scheduler logs nicer as the submission number for
  preparing tasks now matches that of the subsequent submitted or
  submit-failed outcome.

Testing this one is a nightmare, I can't really see a way to do it meaningfully. Look at the issue for example tests which demonstrates how tasks could get

Requirements check-list

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg and conda-environment.yml.
Already covered by existing tests, cannot really test this
Appropriate change log entry included.
No documentation update required.

* Addresses cylc#4974 * Job submission number used to be incremented *after* submission (i.e. only once there is a "submission" of which to speak). * However, we also incremented the submission number if submission (or preparation) failed (in which cases there isn't really a "submission" but we need one for internal purposes). * Now the submission number is incremented when tasks enter the "preparing" state. * This resolves an issue where jobs which were going through the submission pipeline during a reload got badly broken in the scheduler (until restarted).

oliver-sanders · 2022-07-15T10:56:11Z

This leaves one remaining question of reload safety:

cylc-flow/cylc/flow/task_pool.py

Lines 858 to 861 in d4938dd

    
           if itask.state(*TASK_STATUSES_ACTIVE, TASK_STATUS_PREPARING): 
        
               LOG.warning( 
        
                   f"[{itask}] active with pre-reload settings" 
        
               )

Should preparing tasks be included here? I think there is the potential for preparation to begin with pre-reload settings but possibly to receive post-reload settings later on?

hjoliver · 2022-07-19T02:04:36Z

It also makes the scheduler logs nicer as the submission number for
preparing tasks now matches that of the subsequent submitted or
submit-failed outcome.

👍 must admit this annoyed me!

hjoliver · 2022-07-19T04:44:29Z

Your explanation sounds reasonable, but the functional tests aren't happy.

This leaves one remaining question of reload safety:...

While thinking about how to test this, I discovered this bug: #4987

hjoliver · 2022-07-19T04:49:26Z

(Agreed good to have this fix in 8.0 if we can nail it in time).

oliver-sanders · 2022-07-19T10:06:29Z

the functional tests aren't happy.

Couple of small breaks, will investigate, expect I can shift them quickly.

(Agreed good to have this fix in 8.0 if we can nail it in time).

Unfortunately I think this is 8.0 essential as at the moment cylc reload is almost guaranteed to break workflows every time. The workaround is to restart.

* Addresses cylc#4974 * Tasks which are awaiting job preparation used to be stored in `Scheduler.pre_prep_tasks`, however, this effectively created an intermediate "task pool" which had nasty interactions with reload. * This commit removes the pre_prep_tasks list by merging the listing of these tasks in with TaskPool.release_queued_tasks (to avoid unnecessary task pool iteration). * `waiting_on_job_prep` now defaults to `False` rather than `True`.

oliver-sanders · 2022-07-19T12:49:02Z

I had got my if/elif branches the wrong way around, fixed, think that should hold it.

* Previously if submission on a host fails 255 (SSH error), then we put a submission retry on it to allow the task to retry on another host We decremented the submission number to make it look like the same attempt. * Now we set the flag which sends the task back through the submission pipeline allowing it to retry without intermediate state changes.

hjoliver · 2022-07-20T01:10:47Z

This leaves one remaining question of reload safety:

cylc-flow/cylc/flow/task_pool.py

Lines 858 to 861 in d4938dd

if itask.state(*TASK_STATUSES_ACTIVE, TASK_STATUS_PREPARING):

LOG.warning(

f"[{itask}] active with pre-reload settings"

)

Should preparing tasks be included here? I think there is the potential for preparation to begin with pre-reload settings but possibly to receive post-reload settings later on?

I think we include preparing here because pre-reload settings are fixed once written to the job file. Presumably it's possible to reload a preparing task before it writes the job file, though. Maybe we should have a separate warning for preparing tasks: may be active with pre-reload settings. Or be specific by flagging if the job file was written yet, or not. (Can be a follow-up if needed).

hjoliver

Makes sense to me, and the tests are happy now.

hjoliver · 2022-07-20T01:34:21Z

(The failed functional test is just codecov upload)

dwsutherland

Works, tests passed, code makes sense..
From what I can see, the order of events in the main loop is:

Do reload if flagged (reloads task pool at this point (along with DB and store etc))
Process the command queue:
-- reload command sets up new config, then reconfigures broadcasts, pool (just config), tasks events, and DB... Which flags reload.
Releases/computes runahead (uses pre-reload pool with new post reload config)
Proc Pool Processed
Checks triggers (xtrigger etc), and sets expired tasks
Releases queued tasks (self.release_queued_tasks() changed with this PR) (still pre-reload pool)
. . .
Processes queued task messages
Processes the command queue (Again)
Processes task events
DB, data-store, health checks, shutdown ...etc
Sleep interval..

END LOOP

Probably a bigger issue to look into the ideal order of events, however there's (intentionally or otherwise) a number of things that happen between processing a reload command (reconfiguring) and reloading the task pool..

Would it be better to do the reload straight after processing the reload command (setting up the new config)?
Because at present tasks are potentially released/prepped between setting up a new config and reloading the task pool. Perhaps the first two steps can be switched? (or does it not matter?)

oliver-sanders · 2022-07-20T08:28:59Z

Or be specific by flagging if the job file was written yet, or not. (Can be a follow-up if needed).

Not sure, needs a bit more investigation - #4990

oliver-sanders · 2022-07-20T08:35:11Z

Probably a bigger issue to look into the ideal order of events

Definitely, this code is fragile as the order will be critical to particular niche behaviours so would need some time to unravel :(

Because at present tasks are potentially released/prepped between setting up a new config and reloading the task pool

Because of remote init etc, prep can span multiple main loop iterations so I don't think swapping the order will solve the problem outright.

oliver-sanders · 2022-07-20T08:37:20Z

cylc/flow/task_job_mgr.py

-        itask.submit_num -= 1
-        self.task_events_mgr._retry_task(
-            itask, time(), submit_retry=True
-        )
-        return


@wxtim could you take a look at this bit to make sure you're happy.

It used to reset the task back to waiting and slap a submission-retry trigger on it.

It now leaves the task unchanged and sets a flag to send it back through job submission.

wxtim

Looks legit and much nicer than my original solution 👍🏼

oliver-sanders added the bug Something is wrong :( label Jul 15, 2022

oliver-sanders added this to the cylc-8.0.0 milestone Jul 15, 2022

oliver-sanders self-assigned this Jul 15, 2022

oliver-sanders requested a review from hjoliver July 15, 2022 10:56

oliver-sanders force-pushed the 4974 branch from 9adab1f to 56f9bf1 Compare July 19, 2022 12:48

hjoliver approved these changes Jul 20, 2022

View reviewed changes

dwsutherland self-requested a review July 20, 2022 03:08

dwsutherland approved these changes Jul 20, 2022

View reviewed changes

oliver-sanders mentioned this pull request Jul 20, 2022

reload: address warning for preparing tasks #4990

Closed

oliver-sanders commented Jul 20, 2022

View reviewed changes

changelog [skip ci]

2940bfd

wxtim approved these changes Jul 21, 2022

View reviewed changes

Merge branch 'master' into 4974

7e9af2f

wxtim merged commit e085c22 into cylc:master Jul 21, 2022

oliver-sanders deleted the 4974 branch July 21, 2022 10:33

oliver-sanders mentioned this pull request Jul 21, 2022

data store: preparing jobs appearing on restart #4994

Closed

oliver-sanders mentioned this pull request Aug 3, 2022

reload: xtriggers should be preserved by reloads #4866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reload: fix submission errors for jobs awaiting preparation #4984

reload: fix submission errors for jobs awaiting preparation #4984

oliver-sanders commented Jul 15, 2022 •

edited

Loading

oliver-sanders commented Jul 15, 2022

hjoliver commented Jul 19, 2022

hjoliver commented Jul 19, 2022

hjoliver commented Jul 19, 2022

oliver-sanders commented Jul 19, 2022 •

edited

Loading

oliver-sanders commented Jul 19, 2022

hjoliver commented Jul 20, 2022 •

edited

Loading

hjoliver left a comment

hjoliver commented Jul 20, 2022

dwsutherland left a comment •

edited

Loading

oliver-sanders commented Jul 20, 2022

oliver-sanders commented Jul 20, 2022 •

edited

Loading

oliver-sanders Jul 20, 2022

wxtim left a comment

reload: fix submission errors for jobs awaiting preparation #4984

reload: fix submission errors for jobs awaiting preparation #4984

Conversation

oliver-sanders commented Jul 15, 2022 • edited Loading

oliver-sanders commented Jul 15, 2022

hjoliver commented Jul 19, 2022

hjoliver commented Jul 19, 2022

hjoliver commented Jul 19, 2022

oliver-sanders commented Jul 19, 2022 • edited Loading

oliver-sanders commented Jul 19, 2022

hjoliver commented Jul 20, 2022 • edited Loading

hjoliver left a comment

Choose a reason for hiding this comment

hjoliver commented Jul 20, 2022

dwsutherland left a comment • edited Loading

Choose a reason for hiding this comment

oliver-sanders commented Jul 20, 2022

oliver-sanders commented Jul 20, 2022 • edited Loading

oliver-sanders Jul 20, 2022

Choose a reason for hiding this comment

wxtim left a comment

Choose a reason for hiding this comment

oliver-sanders commented Jul 15, 2022 •

edited

Loading

oliver-sanders commented Jul 19, 2022 •

edited

Loading

hjoliver commented Jul 20, 2022 •

edited

Loading

dwsutherland left a comment •

edited

Loading

oliver-sanders commented Jul 20, 2022 •

edited

Loading