Flush buffer in streaming interface before writing zip data #1161

shanemcd · 2022-11-09T02:24:35Z

We ran into a really obscure issue when working on ansible/receptor#683.

I'll try to make this at least somewhat digestable.

Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805

To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod.

Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same:

2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"}
2022-11-08T23:07:58.648753832Z {"zipfile": 1330}
2022-11-08T23:07:58.648753832Z UEsDBBQAAAAIAPy4aFVGnUFkqQMAAIwK

After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.

We ran into a really obscure issue when working on ansible/receptor#683. I'll try to make this at least somewhat digestable. Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805 To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod. Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same: ``` 2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"} 2022-11-08T23:07:58.648753832Z {"zipfile": 1330} 2022-11-08T23:07:58.648753832Z UEsDBBQAAAAIAPy4aFVGnUFkqQMAAIwK.... ``` After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.

resume kube job log using kube log timestamp require fix for kubernetes/kubernetes#77603 in kubernetes/kubernetes#113481 fix is backported to the following kubernetes version - 1.23.14 - 1.24.8 - 1.25.4 required fix for ansible/ansible-runner#1161 fix is backported to the following ansible-runner version - 2.2.2 - 2.3.1 added `RECEPTOR_KUBE_SUPPORT_RECONNECT` with following options: - “enabled”: this option will use timestamp with the log and enable our new code path - “disabled”: this option will not use timestamp and use the original code path - “auto”: auto detect if it's appropriate to enable timestamp base on kube version Co-Authored-By: Shane McDonald <me@shanemcd.com> Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com>

resume kube job log using kube log timestamp require fix for kubernetes/kubernetes#77603 in kubernetes/kubernetes#113481 fix is backported to the following kubernetes version - 1.23.14 - 1.24.8 - 1.25.4 required fix for ansible/ansible-runner#1161 fix is backported to the following ansible-runner version - 2.2.2 - 2.3.1 added `RECEPTOR_KUBE_SUPPORT_RECONNECT` with following options: - “enabled”: this option will use timestamp with the log and enable our new code path - “disabled”: this option will not use timestamp and use the original code path - “auto”: auto detect if it's appropriate to enable timestamp base on kube version Co-Authored-By: Shane McDonald <me@shanemcd.com> Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com> Signed-off-by: Hao Liu <haoli@redhat.com>

shanemcd requested a review from a team as a code owner November 9, 2022 02:24

shanemcd force-pushed the flush-it branch from e87670e to ad12c71 Compare November 9, 2022 02:26

nitzmahone approved these changes Nov 9, 2022

View reviewed changes

Akasurde approved these changes Nov 9, 2022

View reviewed changes

shanemcd merged commit 9d0ce96 into ansible:devel Nov 9, 2022

This was referenced Nov 9, 2022

[2.3] Flush buffer in streaming interface before writing zip data #1162

Merged

[2.2] Flush buffer in streaming interface before writing zip data #1163

Merged

fosterseth mentioned this pull request Nov 9, 2022

bump up lower bound on ansible-runner requirement ansible/awx-ee#148

Merged

This was referenced Nov 15, 2022

job failed with binascii.Error: Incorrect padding error from time to time ansible/awx#12922

Open

job failed with zipfile.BadZipFile: File is not a zip file error from time to time ansible/awx#12343

Open

TheRealHaoLiu mentioned this pull request Nov 16, 2022

Playbook failures with logs stopping after 5 hours ansible/awx-operator#622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flush buffer in streaming interface before writing zip data #1161

Flush buffer in streaming interface before writing zip data #1161

shanemcd commented Nov 9, 2022 •

edited

Loading

Flush buffer in streaming interface before writing zip data #1161

Flush buffer in streaming interface before writing zip data #1161

Conversation

shanemcd commented Nov 9, 2022 • edited Loading

shanemcd commented Nov 9, 2022 •

edited

Loading