-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flush buffer in streaming interface before writing zip data #1161
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We ran into a really obscure issue when working on ansible/receptor#683. I'll try to make this at least somewhat digestable. Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805 To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod. Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same: ``` 2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"} 2022-11-08T23:07:58.648753832Z {"zipfile": 1330} 2022-11-08T23:07:58.648753832Z UEsDBBQAAAAIAPy4aFVGnUFkqQMAAIwK.... ``` After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.
nitzmahone
approved these changes
Nov 9, 2022
Akasurde
approved these changes
Nov 9, 2022
This was referenced Nov 9, 2022
TheRealHaoLiu
added a commit
to TheRealHaoLiu/receptor
that referenced
this pull request
Nov 28, 2022
resume kube job log using kube log timestamp require fix for kubernetes/kubernetes#77603 in kubernetes/kubernetes#113481 fix is backported to the following kubernetes version - 1.23.14 - 1.24.8 - 1.25.4 required fix for ansible/ansible-runner#1161 fix is backported to the following ansible-runner version - 2.2.2 - 2.3.1 added `RECEPTOR_KUBE_SUPPORT_RECONNECT` with following options: - “enabled”: this option will use timestamp with the log and enable our new code path - “disabled”: this option will not use timestamp and use the original code path - “auto”: auto detect if it's appropriate to enable timestamp base on kube version Co-Authored-By: Shane McDonald <me@shanemcd.com> Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com>
TheRealHaoLiu
added a commit
to TheRealHaoLiu/receptor
that referenced
this pull request
Nov 28, 2022
resume kube job log using kube log timestamp require fix for kubernetes/kubernetes#77603 in kubernetes/kubernetes#113481 fix is backported to the following kubernetes version - 1.23.14 - 1.24.8 - 1.25.4 required fix for ansible/ansible-runner#1161 fix is backported to the following ansible-runner version - 2.2.2 - 2.3.1 added `RECEPTOR_KUBE_SUPPORT_RECONNECT` with following options: - “enabled”: this option will use timestamp with the log and enable our new code path - “disabled”: this option will not use timestamp and use the original code path - “auto”: auto detect if it's appropriate to enable timestamp base on kube version Co-Authored-By: Shane McDonald <me@shanemcd.com> Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com> Signed-off-by: Hao Liu <haoli@redhat.com>
shanemcd
added a commit
to ansible/receptor
that referenced
this pull request
Dec 8, 2022
resume kube job log using kube log timestamp require fix for kubernetes/kubernetes#77603 in kubernetes/kubernetes#113481 fix is backported to the following kubernetes version - 1.23.14 - 1.24.8 - 1.25.4 required fix for ansible/ansible-runner#1161 fix is backported to the following ansible-runner version - 2.2.2 - 2.3.1 added `RECEPTOR_KUBE_SUPPORT_RECONNECT` with following options: - “enabled”: this option will use timestamp with the log and enable our new code path - “disabled”: this option will not use timestamp and use the original code path - “auto”: auto detect if it's appropriate to enable timestamp base on kube version Co-Authored-By: Shane McDonald <me@shanemcd.com> Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com> Signed-off-by: Hao Liu <haoli@redhat.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We ran into a really obscure issue when working on ansible/receptor#683.
I'll try to make this at least somewhat digestable.
Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805
To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod.
Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same:
After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.