-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broadcast - GraphQLLocatedError: dictionary changed size during iteration #6222
Comments
That last broadcast to D isn't actually needed, so I'm going to remove that, but there is still some race condition at play here I think. |
Unfortunately the promise library is rather good at hiding the origin of the actual error (which is not line 10 in iterate_promise.py) making this tricky to debug. I suspect this isn't a new bug in 8.3.2 but something that's been lurking for a while that's hard to activate. The best thing I can think to do is to hammer a workflow with the commands above until one of them fails. If we managed to replicate it in this way, then we can start subtracting commands until we have a minimal reproducible example. We can then single step the logic from within the scheduler to locate the point of breakage. |
I think I may know the cause of me triggering this. I'll try to find time to create a simple test workflow to help. It was exposed on my end because I accidentally made an infinite task loop following the above steps. |
Hi! Just wanted to say I encountered it too. Looks like a race condition to me, since the broadcasting task succeeded on a retry. [runtime]
[[_catch_raw]]
script = """
cylc broadcast "${CYLC_WORKFLOW_ID}" \
-p "${CYLC_TASK_CYCLE_POINT}" \
-s "[environment]RAWFILE_PATH=${catch_raw_file}"
cylc broadcast "${CYLC_WORKFLOW_ID}" \
-p "${CYLC_TASK_CYCLE_POINT}" \
-s "[environment]RAWFILE_STEM=$(basename "$catch_raw_file" .raw)"
"""
[[[meta]]]
title = Catch Raw
description = """
This helper task follows the `catch_raw` external trigger, and propagates raw file
path and stem to downstream tasks.
""" |
I've tried to reproduce this problem by performing multiple broadcasts in a task, and running that task in parallel. So far I've not managed to replicate the bug. There's probably some other factor involved. |
Sadly, I just cannot reproduce this one. Here's my latest attempt:
I've also been scanning for Unfortunately, it doesn't look like there's anything we can do about this. If you are still experiencing the issue, please let us know and drop any context that might help here. I doubt it will reveal much, but running workflows in The traceback reported is not actually coming from the Cylc code, it's coming from the "promise" package which is a dependency of the GraphQL tools that we use. We will need to refresh our Python GraphQL toolchain soon which will remove this dependency, so if it is an issue in the underlying library, we should be rid of it then. |
Not that big of a deal honestly, especially as automated retries mitigate it 👍 |
I just hit this 3 times in succession (30 second retries, 3 times in a row and then the task gave up retrying and failed). Only a broadcast was used. Using Cylc-8.3.4
Context - there would have been lots of tasks running at once, maybe broadcasts from different sources in parallel. Looking in the logs, I've seen it a few times but this is the first time its happened multiple times in a row for the same task. From the logs I can grep out this:
In this particular case, here is the grep for
And a different, once-off case
I'm not running in debug mode and would prefer not to if I can avoid it. If there is anything else to grep from the logs I can do that though. My best guess is, broadcasts from a set of tasks running around the same time OR broadcasts running whilst lots of other tasks are running, inserting new ones and removing old ones from the N-window. |
Thanks for the report. I don't think there's going to be much more info you can glean from the logs (with or without debug mode) in this case. I just need to find a way to reproduce this locally. I'll try scaling up parallel broadcasts as far as I can, see if that does it. |
One suggestion in case you aren't, use remote platforms for the broadcast. Maybe the extra latency causes an issue? |
Replicated!!! I had to push the scaling really, really far to encounter the issue (probably why I had failed to replicate it before). This example seems to reliably reproduce the issue within ~60 seconds. It's running ~50,000 broadcasts in parallel! I put the tasks onto an external job runner (don't try running these locally, they will take out your box) but I used a local job runner: [task parameters]
x = 1..100
[[templates]]
x = x_%(x)03d
[scheduling]
initial cycle point = 1
cycling mode = integer
[[graph]]
P1 = """
<x>[-P1] => <x>
"""
[runtime]
[[<x>]]
script = """
for x in $(seq 1 50); do
cylc broadcast "${CYLC_WORKFLOW_ID}" \
-n "x_$(printf '%03d' "$x")" \
-p "$(( CYLC_TASK_CYCLE_POINT + 1 ))" \
-s "[environment]x_${RANDOM}=${RANDOM}" \
-s "[environment]x_${RANDOM}=${RANDOM}" \
-s "[environment]x_${RANDOM}=${RANDOM}" \
-s "[environment]x_${RANDOM}=${RANDOM}" \
-s "[environment]x_${RANDOM}=${RANDOM}" \
-s "[environment]x_${RANDOM}=${RANDOM}" \
-s "[environment]x_${RANDOM}=${RANDOM}" \
-s "[environment]x_${RANDOM}=${RANDOM}" \
-s "[environment]x_${RANDOM}=${RANDOM}" \
-s "[environment]x_${RANDOM}=${RANDOM}" \
-s 'pre-script=true' \
-s 'env-script=true' \
-s 'post-script=' \
-s 'exit-script=' \
-s 'err-script=' &
done
wait
sleep $(( RANDOM % 3 ))
""" |
Simple fix, will put this into 8.3.5. |
I believe this can be closed now? |
Closed by #6397 |
@hjoliver and @oliver-sanders - any chance I could convince you to put out 8.3.5 for this fix? The bug is hitting me somewhat regularly in my large workflows and I'm wanting to start routine distribution of data to downstream users for UAT. |
Yes we were looking to release 8.3.5 soon. How're you placed for that at the UK end @oliver-sanders ? (I'm out of time for today but can check the milestone status tomorrow). |
Description
This is in CYLC_VERSION=8.3.2
It appears to be a race condition with a dictionary not being threadsafe.
Reproducible Example
I've not had time to explore this in great detail. I am roughly doing
e.g.
Expected Behaviour
No failure should be seen.
The text was updated successfully, but these errors were encountered: