-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
execute_k8s_job does not handle watch client stale state #26626
Comments
As a workaround, this seems to be working (again, hard to confirm because I can't easily recreate the issue). I just created a copy of the
|
Your comment on slack mentioning it fails after exactly 4 hours mirrors the symptons from #21331. That issue hasn't been closed but perhaps it should be, @MattyKuzyk. The 4-hour limit was fixed in #24313 for that issue (pipes). Perhaps that solution can be repurposed and/or merged with the normal k8s executor behavior to fix it in both places. |
Thanks @easontm , I'll take a look. It's a different bug though - Dagster consistently stops reading the logs after 4 hours (but doesn't fail). This bug occasionally causes the job fail. It's good to know that |
## Summary & Motivation Should fix the bug described here - #26626 The `execute_k8s_job` method uses `watch.stream()` to stream logs from k8s pods. When the client enters a stale state, we should call `stream` again. See the bug report for more information. ## How I Tested These Changes I was unable to fix a repeatable way for recreating the issue and there are not existing tests for `execute_k8s_job`. I deployed a similar fix to our dev and prod environments, and the problem has not appear yet. At the very least I can say that it didn't degrade the stability of this method. ## Changelog > Insert changelog entry or delete this section.
What's the issue?
Long calls to
execute_k8s_job
sometimes fail when reading the logs. The method has retries aroundnext(log_stream)
, but if the watch client enters a stale state, the code ends up failing. Example log:I found similar issues reported in ansible-playbook, and the relevant issue in the kubernetes client. The solution is to move the watch client creation (
log_stream = watch.stream()
) into a loop as well. I'm trying it out in my repo and will post a PR with a fix after I confirm that it's working (or at least not introducing new issues)What did you expect to happen?
The code shouldn't fail because of intermediate errors
How to reproduce?
This is difficult to reproduce. It originates from the underlying k8s client and only happens very rarely (but often enough to fail long running, expensive, training jobs).
Dagster version
1.9.3
Deployment type
Dagster Helm chart
Deployment details
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
By submitting this issue, you agree to follow Dagster's Code of Conduct.
The text was updated successfully, but these errors were encountered: