-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
platforms: broadcasted platform ignored after ssh failure #6320
Comments
Looks like it's remote initing on the same host?
|
Typo, corrected in OP |
Replicated it with local site installation. Now working out how to replicate in a more debuggable way. |
I think this example should be enough to debug with. Here's my stab in the dark over debugging strategy if it helps.... I would start by identifying the bits of the code where a host is selected and logging each of these. This should allow you to pinpoint the particular branch / method where the incorrect host comes from. Given the convoluted nature of the call/callback code, the same method can be called multiple times, so this might not actually be that much help. If so, I would then try to log the relevant function calls (likely prep/submit methods and their 255 callbacks) so you can map out the callchain. After that, no idea! |
Checks for similar bugs:Search for
|
Here's a version of the workflow in the OP that has been adapted to use This example does not replicate the bug (presumably uses a different code pathway): [scheduling]
[[graph]]
R1 = remote_init_one & remote_init_two & local => remote
[runtime]
# ensure that the workflow has remote-init'ed on platforms "one" and "two"
[[remote_init_one]]
[[[remote]]]
host = one.login.01
[[remote_init_two]]
[[[remote]]]
host = two.login.01
# change the platform of "remote" via broadcast
[[local]]
script = """
cylc broadcast "${CYLC_WORKFLOW_ID}" -n remote -p "${CYLC_TASK_CYCLE_POINT}" -s '[remote]host=one.login.01'
sleep 10
"""
[[remote]]
[[[remote]]]
host = localhost
[[[job]]]
batch system = pbs Posting this here as I'm using this to test the fix to ensure it still works as intended. |
Closed by #6330 |
We can use broadcasts to change the platform a task submits to.
Under normal circumstances this works fine, however, when hosts go down and the submission is retried, the broadcast seems to be forgotten about and the new submission uses the configured platform.
This could lead to jobs being submitted to the wrong platform.
Reproducible example:
Run the following workflow.
Once the "remote_init_one" and "remote_init_two" tasks have submitted, break your SSH config to force subsequent calls to fail.
The "remote" task should attempt to submit to each of the hosts in the "one" platform. All SSH connections will fail so the task will run out of hosts and become submit-failed.
However, that's not what happens! Running this command reveals that after running out of hosts, the task then attempted to submit to localhost (the platform defined before the broadcast):
Note: This erroneous submission appears to happen after all the hosts of the broadcasted platform have been exhausted which may help pin down the offending code pathway.
Interestingly, when I try this, the attempted submission to
localhost
actually fails due to theqsub
command not being in$PATH
. In my case platformone
uses PBS so this suggests that it is attempting to submit tolocalhost
, but with the configuration ofone
?!The text was updated successfully, but these errors were encountered: