Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman build: stream dropped, unexpected failure #10154

Closed
edsantiago opened this issue Apr 27, 2021 · 19 comments · Fixed by #11067
Closed

podman build: stream dropped, unexpected failure #10154

edsantiago opened this issue Apr 27, 2021 · 19 comments · Fixed by #11067
Assignees
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote

Comments

@edsantiago
Copy link
Member

Looks like an unintended consequence of #10034 . I'm seeing a lot of these failures in my buildah-bud-tests-under-podman-remote testing (#9887):

# podman-remote build ....
STEP 1: FROM <some image>
STEP 2: something-somethng
Error: stream dropped, unexpected failure
[rc=125]

Logs, all from fedora34-beta:

Four of the five failures are in tests that expect rc=125; but one of them is in a test that expects rc=0.

This is based on master @ 476c76f as of yesterday afternoon (Apr 26)

@edsantiago edsantiago added flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. remote Problem is in podman-remote labels Apr 27, 2021
@edsantiago
Copy link
Member Author

Two more failures in my next rerun (plus one more flake which I'm just going to stick my head in the sand about)

@edsantiago
Copy link
Member Author

Still happening

@edsantiago
Copy link
Member Author

Update: although it mostly happens in the bud with copy-from with bad from flag in Dockerfile with --layers test, not always: the "third attempt" and "my next rerun" links, above, include failures in different tests.

This bug makes it impossible to merge #9887

@edsantiago
Copy link
Member Author

Another one, in bud-http-Dockerfile test

@edsantiago
Copy link
Member Author

edsantiago commented May 6, 2021

Here's a reproducer (assumes podman system service --timeout=0 running):

$ cat Containerfile
FROM php:7.2
COPY --from foo:bar a b
$ while :;do x=$(../bin/podman-remote build --force-rm=false -t foo . 2>&1); if [[ $x =~ stream ]]; then echo $x; break;fi;done
STEP 1: FROM php:7.2
STEP 2: COPY --from foo:bar a b
Error: stream dropped, unexpected failure

Might take 3 minutes, might take 15. The --force-rm=false seems to be necessary: without it, I couldn't get a failure in over an hour of trying.

[UPDATE: nothing useful on server end, with --log-level=debug]

DEBU[0137] Looking up image "docker.io/library/php:7.2" in local containers storage
DEBU[0137] Trying "docker.io/library/php:7.2" ...
DEBU[0137] parsed reference into "[overlay@/home/esm/.local/share/containers/storage+/run/user/14904/containers:overlay.mount_program=/usr/bin/fuse-overlayfs]@9f290ac43498b3f88a0c8bbc8c326bcfb92e9eaee31a41b76c62db5414e3b40c"
DEBU[0137] Found image "docker.io/library/php:7.2" as "docker.io/library/php:7.2" in local containers storage
DEBU[0137] exporting opaque data as blob "sha256:9f290ac43498b3f88a0c8bbc8c326bcfb92e9eaee31a41b76c62db5414e3b40c"
DEBU[0137] overlay: mount_data=,lowerdir=/home/esm/.local/share/containers/storage/overlay/l/Z25QK7JWCCCI2LGQKSQWN5DLYS:/home/esm/.local/share/containers/storage/overlay/l/3KJMQ6QRBIB4NEAVTJ2QMGHCGA:/home/esm/.local/share/containers/storage/overlay/l/WXQHE7MA3EJBL7L5Y6FTYVY4MO:/home/esm/.local/share/containers/storage/overlay/l/X3QXENVNBOW4N2QZZKQXCNRPQP:/home/esm/.local/share/containers/storage/overlay/l/WSMNCT6K47XYPITF624FUPB244:/home/esm/.local/share/containers/storage/overlay/l/R3UZYI2F25XGZ7KUP6OQ7A5W55:/home/esm/.local/share/containers/storage/overlay/l/OQU6ZA27BRJDMR3VX7KLNTNLP4:/home/esm/.local/share/containers/storage/overlay/l/NAA4ECEPNPVZG5TK4KZJ6BXV5V:/home/esm/.local/share/containers/storage/overlay/l/M4WO5JWYVVR5N7NZYHCQRA5GYO:/home/esm/.local/share/containers/storage/overlay/l/OVRAN6ZO46TRP47XNXKI52R5N4,upperdir=/home/esm/.local/share/containers/storage/overlay/6038c42d7e8a4d65cdc4e9cf85b6905d21f3bfafdd33b19f900ee17d19476a71/diff,workdir=/home/esm/.local/share/containers/storage/overlay/6038c42d7e8a4d65cdc4e9cf85b6905d21f3bfafdd33b19f900ee17d19476a71/work,volatile,context="system_u:object_r:container_file_t:s0:c173,c180"
DEBU[0137] Container ID: d571c30c093343872f0a6838f8a2fe86c90a9ac9b4f63659e2417c9c0737e43a
DEBU[0137] Parsed Step: {Env:[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin PHPIZE_DEPS=autoconf            dpkg-dev                file   g++              gcc             libc-dev                make            pkg-config              re2c PHP_INI_DIR=/usr/local/etc/php PHP_CFLAGS=-fstack-protector-strong -fpic -fpie -O2 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 PHP_CPPFLAGS=-fstack-protector-strong -fpic -fpie -O2 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 PHP_LDFLAGS=-Wl,-O1 -pie GPG_KEYS=1729F83938DA44E27BA0F4D3DBDB397470D12172 B1B44D8F021E4E2D6021E995DC9FF8D3EE5AF27F PHP_VERSION=7.2.34 PHP_URL=https://www.php.net/distributions/php-7.2.34.tar.xz PHP_ASC_URL=https://www.php.net/distributions/php-7.2.34.tar.xz.asc PHP_SHA256=409e11bc6a2c18707dfc44bc61c820ddfd81e17481470f3405ee7822d8379903] Command:copy Args:[foo:bar a b] Flags:[--from] Attrs:map[] Message:COPY --from foo:bar a b Original:COPY --from foo:bar a b}
DEBU[0137] APIHandler(d5739e1a-6860-4ff7-b67a-d35800f0810f) -- POST /v3.2.0/libpod/build?dockerfile=Containerfile&isolation=1&jobs=1&layers=1&networkmode=0&nsoptions=%5B%7B%22Name%22%3A%22user%22%2C%22Host%22%3Atrue%2C%22Path%22%3A%22%22%7D%2C%7B%22Name%22%3A%22network%22%2C%22Host%22%3Atrue%2C%22Path%22%3A%22%22%7D%5D&output=foo&outputformat=application%2Fvnd.oci.image.manifest.v1%2Bjson&pullpolicy=missing&rm=1&seccomp=%2Fusr%2Fshare%2Fcontainers%2Fseccomp.json&shmsize=67108864&t=foo END
DEBU[0137] IdleTracker 0xc000122090:idle 2m+0h/626t connection(s)
DEBU[0137] IdleTracker 0xc000122090:closed 2m+0h/626t connection(s)
DEBU[0137] IdleTracker 0xc000010048:closed 1m+0h/626t connection(s)

@edsantiago
Copy link
Member Author

Update: #10295 did not, alas, fix the problem. Here is a failed CI run from my buildah-bud-under-podman-remote PR. I was also able to reproduce the problem locally on my laptop, using the reproducer above. (I added buildah rm -a &>/dev/null to the loop, because otherwise I end up with gajillions of stale containers).

Curiously... I ran the server with --log-level=info, expecting to see Client disconnect reported, but no luck. Actually, it's worse than that: the message appears, but in a run that succeeded (over a minute before the failed run). I have no idea what that could mean.

Hint: if you run with a verbose log-level, redirect server output to a logfile in shm. With output going to my rxvt, I wasn't able to reproduce the failure.

@edsantiago
Copy link
Member Author

@jwhonce, @rhatdan, I know nothing at all about Go streams ... but is EOF really considered an error? Is it possible for EOF to trigger in conjunction with a successful read? And, more to the point, WDYT of:

--- a/pkg/bindings/images/build.go
+++ b/pkg/bindings/images/build.go
@@ -353,7 +353,7 @@ func Build(ctx context.Context, containerFiles []string, options entities.BuildO
                }
                if err := dec.Decode(&s); err != nil {
                        if errors.Is(err, io.EOF) {
-                               if mErr == nil && id == "" {
+                               if mErr == nil && s.Error == "" && id == "" {
                                        mErr = errors.New("stream dropped, unexpected failure")
                                }
                                break

With that, I've been unable to trigger the error on my laptop, and it even passed in CI, which has not happened in weeks. I don't know what unintended consequences that might have, though.

@edsantiago
Copy link
Member Author

Okay... never mind. I tried the following:

diff --git a/pkg/bindings/images/build.go b/pkg/bindings/images/build.go
index f5e7c0c98..44434810d 100644
--- a/pkg/bindings/images/build.go
+++ b/pkg/bindings/images/build.go
@@ -354,6 +354,7 @@ func Build(ctx context.Context, containerFiles []string, options entities.BuildO
                if err := dec.Decode(&s); err != nil {
                        if errors.Is(err, io.EOF) {
                                if mErr == nil && id == "" {
+                                       logrus.Errorf("got here: s.Error=%q", s.Error)
                                        mErr = errors.New("stream dropped, unexpected failure")
                                }
                                break

...and ran my reproducer, which quickly failed with:

$ while :;do x=$(../bin/podman-remote build --force-rm=false -t foo . 2>&1); if [[ $x =~ stream ]]; then echo $x; beep;break;fi;buildah rm -a >/dev/null;done
STEP 1: FROM php:7.2
STEP 2: COPY --from foo:bar a b
time="2021-05-11T10:06:35-06:00" level=error msg="got here: s.Error=\"\""
Error: stream dropped, unexpected failure

Therefore: adding && s.Error == "", as I suggested in my previous comment, is a NOP. The fact that it worked then is pure coincidence. Never mind.

@edsantiago
Copy link
Member Author

New record: seven failures in one CI run.

  • bud with --rm flag
  • bud-from-scratch
  • bud-from-scratch-iid
  • bud-http-Dockerfile
  • bud-http-context-with-Dockerfile
  • bud pull never
  • bud-builder-identity

@edsantiago
Copy link
Member Author

Sigh. I had hoped that #10549 would fix this, but nope.

@edsantiago
Copy link
Member Author

PR #9887 (buildah-bud tests under podman-remote) has been catching a large number of bugs in both podman and buildah. It would be really great if we could merge that and run it in CI, instead of manually by me, but we can't, because of this flake. Is there any chance someone could take a fresh look?

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Still happening

@edsantiago
Copy link
Member Author

Oh well. I was slightly hopeful that #10916 would fix this, but nope.

@edsantiago
Copy link
Member Author

Oh, this is really bad. We're now seeing this flake in regular CI (sys remote ubuntu-2104 root host).

@edsantiago
Copy link
Member Author

...and, looking back through flake logs, another one also in ubuntu-2104 root.

@edsantiago
Copy link
Member Author

And another one in our regular CI, this time ubuntu-2010.

@containers/podman-maintainers this is becoming a serious problem. I don't care if we can't merge #9887, but the flake is now affecting us in real-world CI.

@vrothberg
Copy link
Member

I am still dizzy from my 2nd shot but bookmarked it. Will take a look at it with a fresh brain on Monday.

Thank you for the ping, @edsantiago!

@vrothberg
Copy link
Member

Hit the flake also in #11049

vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 28, 2021
Address a number of issues in the streaming logic in remote build, most
importantly an error in using buffered channels on the server side.

The pattern below does not guarantee that the channel is entirely read
before the context fires.

for {
	select {
		case <- bufferedChannel:
		...
		case <- ctx.Done():
		...
	}
}

Fixes: containers#10154
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 28, 2021
Address a number of issues in the streaming logic in remote build, most
importantly an error in using buffered channels on the server side.

The pattern below does not guarantee that the channel is entirely read
before the context fires.

for {
	select {
		case <- bufferedChannel:
		...
		case <- ctx.Done():
		...
	}
}

Fixes: containers#10154
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote
Projects
None yet
4 participants