-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] Restart k8s log stream on urllib3 failure #26760
[bugfix] Restart k8s log stream on urllib3 failure #26760
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of small things inline and then this looks good!
There are actually several tests of execute_k8s_job fwiw: https://github.com/dagster-io/dagster/blob/master/integration_tests/test_suites/k8s-test-suite/tests/test_k8s_job_op.py#L1-L586
context.log.warning( | ||
f"urllib3.exceptions.ProtocolError. Pausing and will reconnect. {e}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be context.log.warning(f"urllib3.exceptions.ProtocolError. Pausing and will reconnect.", exc_info=True)
to provide the full stack trace?
context.log.warning( | ||
f"urllib3.exceptions.ProtocolError. Pausing and will reconnect. {e}" | ||
) | ||
time.sleep(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other timeouts in this file all parameterized via env var. can this do the same?
time.sleep(int(os.getenv("DAGSTER_EXECUTE_K8S_JOB_WAIT_AFTER_STREAM_LOGS_FAILURE"))
(apologies for the delay in response) |
else: | ||
except ProtocolError as e: | ||
context.log.warning( | ||
f"urllib3.exceptions.ProtocolError. Pausing and will reconnect. {e}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do have one question actually - will this cause the logs to start over from the beginning if some have already been output before this error has been raised?
You could also imagine it saying that on any error reading the logs, including this one, it logs a warning with the stack trace explaining what the failure was, but doesn't fail the whole op (or try to start over) and just continues on, waiting for the pod to finsih
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to tell from the docs, but I just checked our logs and I can confirm that it creates duplicates logs. I can't find an easy way to fix that (we could maybe use since_seconds
, but according to the docs it the time relative to "now", so we'd need to be careful there). Still, I'd rather have duplicated logs than a failed op :)
As for your suggestion - are you suggesting that if it encounters this (or other) failures, it'll just give a warning saying to it failed getting the logs (so logs could be partial), and then wait for the job to complete? I guess it's an option too. If that's the preferred behavior, I can modify the code, just let me know what you prefer.
(also - woohoo! I just confirmed that some jobs that this fix prevented some of our jobs from failing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's great that it fixed it! I think there are two things we could do (or both):
- the thing i suggested above where we just log the error and continue on on any error, not just this specific protocol errors
- adding some kind of retry limit here before it gives up. The only case i'm a little worried about here is if there is a repeated networking error of some sort for whatever reason and it just retries indefinitely
I have a mild preference for the first one just because its simpler and I could imagine it being helpful for other transient issues, but I could go either way.
I agree - the first option makes more sense. It'll keep things simple for something that's isn't very common and hard to test/recreate. I'll update the PR |
…t keep waiting for the pod to complete
Updated so it simply stops reading the logs. It's a much simpler change. Let me know if that's what you had in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice and simple, love it
Summary & Motivation
Should fix the bug described here - #26626
The
execute_k8s_job
method useswatch.stream()
to stream logs from k8s pods. When the client enters a stale state, we should callstream
again. See the bug report for more information.How I Tested These Changes
I was unable to fix a repeatable way for recreating the issue and there are not existing tests for
execute_k8s_job
.I deployed a similar fix to our dev and prod environments, and the problem has not appear yet. At the very least I can say that it didn't degrade the stability of this method.
Changelog