platforms: broadcasted platform ignored after ssh failure #6320

oliver-sanders · 2024-08-22T14:19:55Z

We can use broadcasts to change the platform a task submits to.

Under normal circumstances this works fine, however, when hosts go down and the submission is retried, the broadcast seems to be forgotten about and the new submission uses the configured platform.

This could lead to jobs being submitted to the wrong platform.

Reproducible example:

Run the following workflow.

Once the "remote_init_one" and "remote_init_two" tasks have submitted, break your SSH config to force subsequent calls to fail.

[scheduling]
    [[graph]]
        R1 = remote_init_one & remote_init_two & local => remote

[runtime]
    # ensure that the workflow has remote-init'ed on platforms "one" and "two"
    [[remote_init_one]]
        platform = one-bg
    [[remote_init_two]]
        platform = two-bg

    # change the platform of "remote" via broadcast
    [[local]]
        script = """
            cylc broadcast "${CYLC_WORKFLOW_ID}" -n remote -p "${CYLC_TASK_CYCLE_POINT}" -s 'platform=one'
            sleep 10
        """

    [[remote]]
        platform = localhost

The "remote" task should attempt to submit to each of the hosts in the "one" platform. All SSH connections will fail so the task will run out of hosts and become submit-failed.

However, that's not what happens! Running this command reveals that after running out of hosts, the task then attempted to submit to localhost (the platform defined before the broadcast):

$ grep 'DEBUG - \[jobs-submit cmd\].*1/remote/01' --color=never ~/cylc-run/<workflow>/log/scheduler/log
... ssh ... one.01 ... cylc jobs-submit ... 1/remote/01
... ssh ... one.02 ... cylc jobs-submit ... 1/remote/01
... cylc jobs-submit ... 1/remote/01

Note: This erroneous submission appears to happen after all the hosts of the broadcasted platform have been exhausted which may help pin down the offending code pathway.

Interestingly, when I try this, the attempted submission to localhost actually fails due to the qsub command not being in $PATH. In my case platform one uses PBS so this suggests that it is attempting to submit to localhost, but with the configuration of one?!

The text was updated successfully, but these errors were encountered:

wxtim · 2024-08-23T10:13:03Z

Looks like it's remote initing on the same host?

    [[remote_init_one]]
        platform = one-bg
    [[remote_init_two]]
        platform = one-bg

oliver-sanders · 2024-08-23T10:20:57Z

Typo, corrected in OP

wxtim · 2024-08-23T13:47:06Z

Replicated it with local site installation. Now working out how to replicate in a more debuggable way.

oliver-sanders · 2024-08-23T14:06:53Z

I think this example should be enough to debug with. Here's my stab in the dark over debugging strategy if it helps....

I would start by identifying the bits of the code where a host is selected and logging each of these. This should allow you to pinpoint the particular branch / method where the incorrect host comes from. Given the convoluted nature of the call/callback code, the same method can be called multiple times, so this might not actually be that much help. If so, I would then try to log the relevant function calls (likely prep/submit methods and their 255 callbacks) so you can map out the callchain. After that, no idea!

wxtim · 2024-08-29T09:22:17Z

Checks for similar bugs:

Search for rtconfig\[["']platform["']\]:

data_store_mgr.runtime_from_config - Looks like it's used to initialize fields at startup so should be safe to not check for broadcasts. Checked by looking at TUI.
subprocpool.SubProcPoll.run_command_exit - Functionally safe because it's only used for logging. Might concievable produce strange log output, but even this shouldn't happen if the callback is given sensible arguments - an apparent bug found in this code at this point during the investigation dissapeared once the fix in Ensure that platform from group selection checks broadcast manager #6330 was made.
All other lookups are in task_job_mgr.TaskJobManager._prep_submit_task_job on a function scoped copy of the rtconfig which has broadcasts applied.

oliver-sanders · 2024-09-03T14:45:59Z

Here's a version of the workflow in the OP that has been adapted to use [remote]host and [job]batch system rather than platform.

This example does not replicate the bug (presumably uses a different code pathway):

[scheduling]
    [[graph]]
        R1 = remote_init_one & remote_init_two & local => remote

[runtime]
    # ensure that the workflow has remote-init'ed on platforms "one" and "two"
    [[remote_init_one]]
        [[[remote]]]
            host = one.login.01
    [[remote_init_two]]
        [[[remote]]]
            host = two.login.01

    # change the platform of "remote" via broadcast
    [[local]]
        script = """
            cylc broadcast "${CYLC_WORKFLOW_ID}" -n remote -p "${CYLC_TASK_CYCLE_POINT}" -s '[remote]host=one.login.01'
            sleep 10
        """

    [[remote]]
        [[[remote]]]
            host = localhost
        [[[job]]]
            batch system = pbs

Posting this here as I'm using this to test the fix to ensure it still works as intended.

oliver-sanders · 2024-09-26T16:38:39Z

Closed by #6330

oliver-sanders added the bug Something is wrong :( label Aug 22, 2024

oliver-sanders added this to the 8.3.x milestone Aug 22, 2024

wxtim self-assigned this Aug 23, 2024

wxtim mentioned this issue Aug 27, 2024

Ensure that platform from group selection checks broadcast manager #6330

Merged

8 tasks

wxtim linked a pull request Sep 5, 2024 that will close this issue

Ensure that platform from group selection checks broadcast manager #6330

Merged

8 tasks

oliver-sanders closed this as completed Sep 26, 2024

oliver-sanders modified the milestones: 8.3.x, 8.3.4 Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

platforms: broadcasted platform ignored after ssh failure #6320

platforms: broadcasted platform ignored after ssh failure #6320

oliver-sanders commented Aug 22, 2024 •

edited

Loading

wxtim commented Aug 23, 2024

oliver-sanders commented Aug 23, 2024

wxtim commented Aug 23, 2024

oliver-sanders commented Aug 23, 2024

wxtim commented Aug 29, 2024

oliver-sanders commented Sep 3, 2024

oliver-sanders commented Sep 26, 2024

platforms: broadcasted platform ignored after ssh failure #6320

platforms: broadcasted platform ignored after ssh failure #6320

Comments

oliver-sanders commented Aug 22, 2024 • edited Loading

Reproducible example:

wxtim commented Aug 23, 2024

oliver-sanders commented Aug 23, 2024

wxtim commented Aug 23, 2024

oliver-sanders commented Aug 23, 2024

wxtim commented Aug 29, 2024

Checks for similar bugs:

oliver-sanders commented Sep 3, 2024

oliver-sanders commented Sep 26, 2024

oliver-sanders commented Aug 22, 2024 •

edited

Loading