Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush buffer in streaming interface before writing zip data #1161

Merged
merged 1 commit into from
Nov 9, 2022

Conversation

shanemcd
Copy link
Member

@shanemcd shanemcd commented Nov 9, 2022

We ran into a really obscure issue when working on ansible/receptor#683.

I'll try to make this at least somewhat digestable.

Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805

To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod.

Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same:

2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"}
2022-11-08T23:07:58.648753832Z {"zipfile": 1330}
2022-11-08T23:07:58.648753832Z UEsDBBQAAAAIAPy4aFVGnUFkqQMAAIwK

After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.

@shanemcd shanemcd requested a review from a team as a code owner November 9, 2022 02:24
We ran into a really obscure issue when working on ansible/receptor#683.

I'll try to make this at least somewhat digestable.

Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805

To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod.

Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same:

```
2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"}
2022-11-08T23:07:58.648753832Z {"zipfile": 1330}
2022-11-08T23:07:58.648753832Z UEsDBBQAAAAIAPy4aFVGnUFkqQMAAIwK....
```

After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.
@shanemcd shanemcd merged commit 9d0ce96 into ansible:devel Nov 9, 2022
TheRealHaoLiu added a commit to TheRealHaoLiu/receptor that referenced this pull request Nov 28, 2022
resume kube job log using kube log timestamp

require fix for kubernetes/kubernetes#77603
in kubernetes/kubernetes#113481

fix is backported to the following kubernetes version
- 1.23.14
- 1.24.8
- 1.25.4

required fix for ansible/ansible-runner#1161

fix is backported to the following ansible-runner version
- 2.2.2
- 2.3.1

added `RECEPTOR_KUBE_SUPPORT_RECONNECT`  with following options:

- “enabled”: this option will use timestamp with the log and enable our new code path
- “disabled”: this option will not use timestamp and use the original code path
- “auto”: auto detect if it's appropriate to enable timestamp base on kube version

Co-Authored-By: Shane McDonald <me@shanemcd.com>
Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com>
TheRealHaoLiu added a commit to TheRealHaoLiu/receptor that referenced this pull request Nov 28, 2022
resume kube job log using kube log timestamp

require fix for kubernetes/kubernetes#77603
in kubernetes/kubernetes#113481

fix is backported to the following kubernetes version
- 1.23.14
- 1.24.8
- 1.25.4

required fix for ansible/ansible-runner#1161

fix is backported to the following ansible-runner version
- 2.2.2
- 2.3.1

added `RECEPTOR_KUBE_SUPPORT_RECONNECT`  with following options:

- “enabled”: this option will use timestamp with the log and enable our new code path
- “disabled”: this option will not use timestamp and use the original code path
- “auto”: auto detect if it's appropriate to enable timestamp base on kube version

Co-Authored-By: Shane McDonald <me@shanemcd.com>
Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com>
Signed-off-by: Hao Liu <haoli@redhat.com>
shanemcd added a commit to ansible/receptor that referenced this pull request Dec 8, 2022
resume kube job log using kube log timestamp

require fix for kubernetes/kubernetes#77603
in kubernetes/kubernetes#113481

fix is backported to the following kubernetes version
- 1.23.14
- 1.24.8
- 1.25.4

required fix for ansible/ansible-runner#1161

fix is backported to the following ansible-runner version
- 2.2.2
- 2.3.1

added `RECEPTOR_KUBE_SUPPORT_RECONNECT`  with following options:

- “enabled”: this option will use timestamp with the log and enable our new code path
- “disabled”: this option will not use timestamp and use the original code path
- “auto”: auto detect if it's appropriate to enable timestamp base on kube version

Co-Authored-By: Shane McDonald <me@shanemcd.com>
Co-Authored-By: Seth Foster <fosterseth@users.noreply.github.com>
Signed-off-by: Hao Liu <haoli@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants