Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send HTTP Hijack headers after successful attach #7451

Merged
merged 1 commit into from
Aug 27, 2020

Conversation

mheon
Copy link
Member

@mheon mheon commented Aug 25, 2020

Our previous flow was to perform a hijack before passing a connection into Libpod, and then Libpod would attach to the container's attach socket and begin forwarding traffic.

A problem emerges: we write the attach header as soon as the attach complete. As soon as we write the header, the client assumes that all is ready, and sends a Start request. This Start may be processed before we successfully finish attaching, causing us to lose output.

The solution is to handle hijacking inside Libpod. Unfortunately, this requires a downright extensive refactor of the Attach and HTTP Exec StartAndAttach code. I think the result is an improvement in some places (a lot more errors will be handled with a proper HTTP error code, before the hijack occurs) but other parts, like the relocation of printing container logs, are just bad. Still, we need this fixed now to get CI back into good shape...

Fixes #7195

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2020
@mheon
Copy link
Member Author

mheon commented Aug 25, 2020

@baude @edsantiago Bear witness to my sorrow.

I think this probably fixes the #7195 flake but I have not performed anything more than trivial tests yet.

@mheon mheon force-pushed the fix_7195 branch 2 times, most recently from d839aa9 to b67ee9d Compare August 25, 2020 21:14
@mheon
Copy link
Member Author

mheon commented Aug 25, 2020

This appears to have made things worse. My interruption of their delicate dance has offended the attach functions and they will work no longer.

I will attempt to appease them tomorrow.

@mheon mheon force-pushed the fix_7195 branch 2 times, most recently from d662739 to 199158d Compare August 26, 2020 17:41
@mheon
Copy link
Member Author

mheon commented Aug 26, 2020

I think this is going to go green now.

@mheon
Copy link
Member Author

mheon commented Aug 26, 2020

Apparently I broke exec exit codes here - it looks like a race where it's querying for exec status before the exec session is marked as exited.

@mheon
Copy link
Member Author

mheon commented Aug 26, 2020

Alright, I know what's going on (connection is being closed earlier than when it used to be closed, thus causing the client to query for exec session status a lot earlier than it used to, before we have time to save it to disk). Still need to figure out how to actually fix it, the code is not structured in a way that is conducive to closing the connection later.

@mheon mheon force-pushed the fix_7195 branch 2 times, most recently from 6db6b01 to 7d65478 Compare August 27, 2020 14:36
@mheon
Copy link
Member Author

mheon commented Aug 27, 2020

Failed to stop: Error getting access token for service account: Remote host terminated the handshake

Second time I've seen this...

@mheon
Copy link
Member Author

mheon commented Aug 27, 2020

[+0280s] # error opening file /sys/fs/cgroup//system.slice/crun-buildah-buildah175193391.scope/container/cgroup.freeze: No such file or directory

Flakes in our flake fixes...

@edsantiago
Copy link
Member

At least that one usually passes on rerun

@mheon
Copy link
Member Author

mheon commented Aug 27, 2020

Nevermind, exec tests look red still. Damn it.

Will look more after lunch.

@mheon
Copy link
Member Author

mheon commented Aug 27, 2020

Re-pushed again. I think this might be it.

Our previous flow was to perform a hijack before passing a
connection into Libpod, and then Libpod would attach to the
container's attach socket and begin forwarding traffic.

A problem emerges: we write the attach header as soon as the
attach complete. As soon as we write the header, the client
assumes that all is ready, and sends a Start request. This Start
may be processed *before* we successfully finish attaching,
causing us to lose output.

The solution is to handle hijacking inside Libpod. Unfortunately,
this requires a downright extensive refactor of the Attach and
HTTP Exec StartAndAttach code. I think the result is an
improvement in some places (a lot more errors will be handled
with a proper HTTP error code, before the hijack occurs) but
other parts, like the relocation of printing container logs, are
just *bad*. Still, we need this fixed now to get CI back into
good shape...

Fixes containers#7195

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
@mheon
Copy link
Member Author

mheon commented Aug 27, 2020

Network flake in the docs job

@mheon
Copy link
Member Author

mheon commented Aug 27, 2020

@rhatdan @baude @giuseppe @TomSweeneyRedHat @QiWang19 PTAL and merge, please. Fixes the flake keeping CI red.

@baude
Copy link
Member

baude commented Aug 27, 2020

LGTM

1 similar comment
@QiWang19
Copy link
Contributor

LGTM

@edsantiago
Copy link
Member

Looks green to me. I am supremely unqualified to review the code, but the results talk to me and I'm about to take all my PRs, rebase them, and see how they go. So...

/lgtm

Thank you @mheon. This was a nasty one.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 27, 2020
@edsantiago
Copy link
Member

Oh mergebot... yoo-hoo...

@edsantiago edsantiago merged commit b13af45 into containers:master Aug 27, 2020
edsantiago added a commit to edsantiago/libpod that referenced this pull request Sep 28, 2020
 - pause test: enable when rootless + cgroups v2
   (was previously disabled for all rootless)

 - run --pull: now works with podman-remote
   (in containers#7647, thank you @jwhonce)

 - various other run/volumes tests: try reenabling
   It looks like containers#7195 was fixed (by containers#7451? I'm not
   sure if I'm reading the conversation correctly).
   Anyway, remove all the skip()s on 7195. Only time
   will tell if it's really fixed)

Also:

 - new test for podman image tree --whatrequires
   (because TIL). Doesn't work with podman-remote.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 24, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI flake: podman-remote: no output from container (and "does not exist in database"?)
5 participants