[bugfix] Restart k8s log stream on urllib3 failure #26760

OrenLederman · 2024-12-30T20:30:30Z

Summary & Motivation

Should fix the bug described here - #26626

The execute_k8s_job method uses watch.stream() to stream logs from k8s pods. When the client enters a stale state, we should call stream again. See the bug report for more information.

How I Tested These Changes

I was unable to fix a repeatable way for recreating the issue and there are not existing tests for execute_k8s_job.
I deployed a similar fix to our dev and prod environments, and the problem has not appear yet. At the very least I can say that it didn't degrade the stability of this method.

Changelog

Insert changelog entry or delete this section.

gibsondan

couple of small things inline and then this looks good!

There are actually several tests of execute_k8s_job fwiw: https://github.com/dagster-io/dagster/blob/master/integration_tests/test_suites/k8s-test-suite/tests/test_k8s_job_op.py#L1-L586

gibsondan · 2025-01-13T03:54:55Z

python_modules/libraries/dagster-k8s/dagster_k8s/ops/k8s_job_op.py

+                    context.log.warning(
+                        f"urllib3.exceptions.ProtocolError. Pausing and will reconnect. {e}"
+                    )


can this be context.log.warning(f"urllib3.exceptions.ProtocolError. Pausing and will reconnect.", exc_info=True) to provide the full stack trace?

gibsondan · 2025-01-13T03:55:44Z

python_modules/libraries/dagster-k8s/dagster_k8s/ops/k8s_job_op.py

+                    context.log.warning(
+                        f"urllib3.exceptions.ProtocolError. Pausing and will reconnect. {e}"
+                    )
+                    time.sleep(5)


the other timeouts in this file all parameterized via env var. can this do the same?

time.sleep(int(os.getenv("DAGSTER_EXECUTE_K8S_JOB_WAIT_AFTER_STREAM_LOGS_FAILURE"))

gibsondan · 2025-01-13T04:00:00Z

(apologies for the delay in response)

gibsondan · 2025-01-13T04:04:19Z

python_modules/libraries/dagster-k8s/dagster_k8s/ops/k8s_job_op.py

-        else:
+                except ProtocolError as e:
+                    context.log.warning(
+                        f"urllib3.exceptions.ProtocolError. Pausing and will reconnect. {e}"


I do have one question actually - will this cause the logs to start over from the beginning if some have already been output before this error has been raised?

You could also imagine it saying that on any error reading the logs, including this one, it logs a warning with the stack trace explaining what the failure was, but doesn't fail the whole op (or try to start over) and just continues on, waiting for the pod to finsih

It's hard to tell from the docs, but I just checked our logs and I can confirm that it creates duplicates logs. I can't find an easy way to fix that (we could maybe use since_seconds, but according to the docs it the time relative to "now", so we'd need to be careful there). Still, I'd rather have duplicated logs than a failed op :)

As for your suggestion - are you suggesting that if it encounters this (or other) failures, it'll just give a warning saying to it failed getting the logs (so logs could be partial), and then wait for the job to complete? I guess it's an option too. If that's the preferred behavior, I can modify the code, just let me know what you prefer.

(also - woohoo! I just confirmed that some jobs that this fix prevented some of our jobs from failing)

gibsondan

that's great that it fixed it! I think there are two things we could do (or both):

the thing i suggested above where we just log the error and continue on on any error, not just this specific protocol errors
adding some kind of retry limit here before it gives up. The only case i'm a little worried about here is if there is a repeated networking error of some sort for whatever reason and it just retries indefinitely

I have a mild preference for the first one just because its simpler and I could imagine it being helpful for other transient issues, but I could go either way.

OrenLederman · 2025-01-15T00:50:07Z

I agree - the first option makes more sense. It'll keep things simple for something that's isn't very common and hard to test/recreate. I'll update the PR

…t keep waiting for the pod to complete

OrenLederman · 2025-01-15T06:42:01Z

Updated so it simply stops reading the logs. It's a much simpler change. Let me know if that's what you had in mind.
(Also noticed that I removed the else: before the Pod logs are disabled log message. I fixed that)

gibsondan

nice and simple, love it

OrenLederman added 2 commits December 30, 2024 12:03

Restart k8s log stream on urllib3 failure

7931e59

moving the stream() call inside the try clause

6c561db

OrenLederman changed the title ~~[bugfix[ Restart k8s log stream on urllib3 failure~~ [bugfix] Restart k8s log stream on urllib3 failure Dec 30, 2024

garethbrickman linked an issue Dec 30, 2024 that may be closed by this pull request

execute_k8s_job does not handle watch client stale state #26626

Closed

garethbrickman requested a review from gibsondan January 9, 2025 14:25

gibsondan requested changes Jan 13, 2025

View reviewed changes

gibsondan reviewed Jan 13, 2025

View reviewed changes

small changes requests in review

b3cb2f6

OrenLederman requested a review from gibsondan January 14, 2025 07:07

gibsondan requested changes Jan 14, 2025

View reviewed changes

Modified the code to give up reading logs if there's an exception, bu…

37ee354

…t keep waiting for the pod to complete

OrenLederman requested a review from gibsondan January 15, 2025 06:41

gibsondan approved these changes Jan 15, 2025

View reviewed changes

gibsondan merged commit 48aef70 into dagster-io:master Jan 15, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] Restart k8s log stream on urllib3 failure #26760

[bugfix] Restart k8s log stream on urllib3 failure #26760

OrenLederman commented Dec 30, 2024

gibsondan left a comment

gibsondan Jan 13, 2025

gibsondan Jan 13, 2025

gibsondan commented Jan 13, 2025

gibsondan Jan 13, 2025

OrenLederman Jan 14, 2025

gibsondan left a comment

OrenLederman commented Jan 15, 2025

OrenLederman commented Jan 15, 2025 •

edited

Loading

gibsondan left a comment

[bugfix] Restart k8s log stream on urllib3 failure #26760

[bugfix] Restart k8s log stream on urllib3 failure #26760

Conversation

OrenLederman commented Dec 30, 2024

Summary & Motivation

How I Tested These Changes

Changelog

gibsondan left a comment

Choose a reason for hiding this comment

gibsondan Jan 13, 2025

Choose a reason for hiding this comment

gibsondan Jan 13, 2025

Choose a reason for hiding this comment

gibsondan commented Jan 13, 2025

gibsondan Jan 13, 2025

Choose a reason for hiding this comment

OrenLederman Jan 14, 2025

Choose a reason for hiding this comment

gibsondan left a comment

Choose a reason for hiding this comment

OrenLederman commented Jan 15, 2025

OrenLederman commented Jan 15, 2025 • edited Loading

gibsondan left a comment

Choose a reason for hiding this comment

OrenLederman commented Jan 15, 2025 •

edited

Loading