Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman logs: missing output #18501

Closed
edsantiago opened this issue May 8, 2023 · 29 comments · Fixed by #21234
Closed

podman logs: missing output #18501

edsantiago opened this issue May 8, 2023 · 29 comments · Fixed by #21234
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

 [It] tail 800 lines
...
podman [options] logs --tail 800 CID
...
[FAILED] Expected <LONG STRING> to have length 800

Different CI runs on different days. root/rootless, f37/38, journald/json-file

  • fedora-37 : int podman fedora-37 root host sqlite
  • fedora-38 : int podman fedora-38 rootless host sqlite

Echoes of #14362, except these failures are on main and include the wait fix.

@edsantiago edsantiago added flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. labels May 8, 2023
@edsantiago
Copy link
Member Author

I wonder if this could be a different manifestation of the same bug? Basically, podman runs a systemd container, spins waiting for "Reached target multi-user.target", and test fails because the string never appears in podman logs. This is podman-remote, though, so it could be a different issue.

@Luap99
Copy link
Member

Luap99 commented May 8, 2023

I see some weird output formatting in the json-file case:

...
           line 424
line 425
         
           line 426
...

However I also see this in tests which pass so I am not sure if this is caused by ginkgo or the logformatter maybe?
I also counted the lines in the log and see 800 so why does the matcher say 799???

The journal case is more concerning, we are actually missing 261 lines in the log.


In any case each line also prints a \r because we start the container with -t should we drop that?

@edsantiago
Copy link
Member Author

As a general rule I like removing -t from tests; but ISTR that podman logs can behave differently with -t, and I don't know if there's a good reason for the -t in this test. (From a quick tig blame session, I don't think so, but I can't be sure). I think it's worth looking into removing -t here.

@Luap99
Copy link
Member

Luap99 commented May 8, 2023

yeah with -t we only have one stream instead of separate stdout/err streams. However for the 800 line test, which I added a long time ago, the point was to test logs --tail with a size greater than the default pagesize (4096 bytes usually) read size. #7232

@edsantiago edsantiago changed the title e2e: log tail 800 lines: flaking again, journald & json-file podman logs: missing output May 31, 2023
@edsantiago
Copy link
Member Author

Repurposing this issue as a gathering place for all "podman logs missing output" flakes.

@edsantiago
Copy link
Member Author

Hey, is anyone looking into this? It's a serious problem, hitting a lot of different tests (suggesting it's not a badly written test). I would hate for customers to start hitting this in the field.

  • debian-12 : int podman debian-12 root host sqlite
  • debian-12 : int podman debian-12 rootless host sqlite
  • debian-12 : int remote debian-12 root host sqlite [remote]
    • 05-05 15:57 in podman run container with systemd PID1
  • fedora-37 : int podman fedora-37 root host sqlite
  • fedora-37 : int remote fedora-37 root host sqlite [remote]
  • fedora-38 : int podman fedora-38 root host boltdb
  • fedora-38 : int podman fedora-38 root host sqlite
  • fedora-38 : int podman fedora-38 rootless host sqlite
  • rawhide : int podman rawhide root host sqlite
    • 05-27 09:52 in using container with container log-size: journald

@vrothberg
Copy link
Member

I am flooded with upstream issues and bugs at the moment but totally agree with you Ed.

@mheon, can we schedule a bug week soon-ish?

@vrothberg
Copy link
Member

It's bookmarked, so I will take a look when I have some cycles (no ETA).

@Luap99
Copy link
Member

Luap99 commented Jun 21, 2023

Some tests are definitely broken (missing wait or should not use run -d, i.e. podman pod logs -l)


The 800 lines thing is interesting because the output is screwed up with the \r bytes somehow: https://api.cirrus-ci.com/v1/artifact/task/6317242757939200/html/int-podman-debian-12-root-host-sqlite.log.html#t--tail-800-lines--json-file--1
But it does the same thing in tests that pass so I don't think that is the cause. And even more weird it actually shows the full 800 lines output in the log so why does ginkgo say it is only 799 lines?!!!!!!


This one is odd, looks like the linker is broken: https://api.cirrus-ci.com/v1/artifact/task/6365092854366208/html/int-podman-debian-12-rootless-host-sqlite.log.html#t--podman-remote-pod-logs-test--1
But then again other tests in this run pass so why does loading a shared lib flake with ENOENT????????

@edsantiago
Copy link
Member Author

Oh, I didn't look closely enough at that one. It's probably #17042, another scary-nasty flake. I've reclassified it.

edsantiago added a commit to edsantiago/libpod that referenced this issue Jun 21, 2023
A few tests were doing "podman run -d" + "podman logs".
This is racy. Remove the unnecessary "-d".

And, as long as we're mucking around in here:
 - remove the "-t" from the 800-lines test, so we get
   clean output without ^Ms
 - remove unnecessary "sh", "-c" from simple echo commands
 - add actual error-message checks to two places that
   were only checking exit status

Resolves one (not all) of the flakes tracked in containers#18501

Signed-off-by: Ed Santiago <santiago@redhat.com>
@edsantiago
Copy link
Member Author

Here's one from yesterday, 800-line test, journald. It's a humongous failure, expecting 800 lines, getting 551. Looking at the journal, search in-page for "line 551", we find:

Jun 26 16:46:46 cirrus-task-6755306202464256 eloquent_williamson[223648]: line 551
Jun 26 16:46:46 cirrus-task-6755306202464256 eloquent_williamson[223648]: line 552
Jun 26 16:46:47 cirrus-task-6755306202464256 systemd-journald[557]: Data hash table of /var/log/journal/eedc8417f8cb3f847c826e088dfd67ab/system.journal has a fill level at 75.0 (174763 of 233016 items, 50331648 file size, 287 bytes per hash table item), suggesting rotation.
Jun 26 16:46:47 cirrus-task-6755306202464256 systemd-journald[557]: /var/log/journal/eedc8417f8cb3f847c826e088dfd67ab/system.journal: Journal header limits reached or header out-of-date, rotating.
........
Jun 26 16:46:46 cirrus-task-6755306202464256 rsyslogd[666]: imjournal: journal files changed, reloading...  [v8.2210.0-4.fc38 try https://www.rsyslog.com/e/0 ]
Jun 26 16:46:46 cirrus-task-6755306202464256 eloquent_williamson[223648]: line 553
Jun 26 16:46:47 cirrus-task-6755306202464256 rsyslogd[666]: imjournal: journal files changed, reloading...  [v8.2210.0-4.fc38 try https://www.rsyslog.com/e/0 ]

Looks like a smoking gun to me.

@Luap99
Copy link
Member

Luap99 commented Jun 27, 2023

Yeah that is definitely suspicious and in fact we miss the logs after that line but there are also other test failures were we actually saw all lines in the log so I am not sure it is the only bug.

@edsantiago
Copy link
Member Author

The new logformatter adds direct links to journal, making it much easier to look into these. So like here's one of those 800-line failures, k8s-file (not journald), and it looks like all the lines are there (I didn't count) ... but journal shows a conmon error:

Jun 15 16:51:11 cirrus-task-6083635929939968 conmon[93450]: conmon 6a82177ee9e67f96f12c <nwarn>: Failed to open cgroups file: /sys/fs/cgroup/machine.slice/libpod-6a82177ee9e67f96f12c938169b73c16537188fa8d712e151d2f7a2505ee21aa.scope/container/memory.events
Jun 15 16:51:11 cirrus-task-6083635929939968 peaceful_bhaskara[93450]: line 488
Jun 15 16:51:11 cirrus-task-6083635929939968 conmon[93450]: conmon 6a82177ee9e67f96f12c <nwarn>: stdio_input read failed Input/output error

@Luap99
Copy link
Member

Luap99 commented Jun 27, 2023

Ok I guess the first step is to actually make ginkgo print the full matcher output so we can at least know which element is missing. Because AFAICS podman logs printed all lines we would expect.

Luap99 added a commit to Luap99/libpod that referenced this issue Jun 27, 2023
Sometimes this tests flakes but in the CI log I see all expected lines
printed but still for some reason the matcher fails.
Right now it will truncate the array so it is not possible to verify
what the matcher sees. Change this be removing the truncate limit for
this specific test only.

see containers#18501

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@edsantiago
Copy link
Member Author

Here's an interesting variation: remote f38 root. The interesting thing about it is journalctl being run, and not seeing any output (vs podman logs).

@edsantiago
Copy link
Member Author

So, it's not just journald: tests fail in json-file too: f38 rootless, rawhide rootless. We've seen json-file flakes here before, but most failures are journald. Two json flakes in one run seems unusual.

Does this help chase down the bug?

@edsantiago
Copy link
Member Author

I have a reproducer with json-file, nothing special needed, just a laptop (mine is f38):

$ nl="        <----- basically doublequote, enter, doublequote, enter
"
$ cr="^M"      <-------- control-V for quote-next-character, control-M for CR
$ while :;do cid=$(bin/podman run --log-driver json-file -dt quay.io/libpod/testimage:20221018 sh -c "echo podman;echo podman;echo podman");bin/podman wait $cid;logs=$(bin/podman logs --tail 2 $cid); test "$logs" = "podman${cr}${nl}podman${cr}" || break;bin/podman rm $cid;done
...
0
$ echo "$logs"
podman

$ echo "$logs" | cat -vET
podman^Mpodman$         <<<---------- Ooooh, isn't this interesting!
^M$
$ bin/podman logs $cid | cat -vET
podman^M$
podman^M$
podman^M$

@edsantiago
Copy link
Member Author

Another failure (it doesn't take long). More info:

$ echo "$logs"|cat -vET
^M$
podman^M$
$ bin/podman logs $cid | cat -vET
podman^M$
podman^M$
podman^M$
$ bin/podman logs --tail 2 $cid | cat -vET
^M$
podman^M$                        <<<<---- OK, at least it's consistent
$ cat -vET ~/.local/share/containers/storage/overlay-containers/$cid/userdata/ctr.log
2023-08-07T19:56:34.223758260-06:00 stdout F podman^M$
2023-08-07T19:56:34.223758260-06:00 stdout P podman$
2023-08-07T19:56:34.223816701-06:00 stdout F ^M$
2023-08-07T19:56:34.223816701-06:00 stdout F podman^M$

I doubt that this is connected in any way to the journald bug, since that one reproduces easily without podman or conmon anywhere (or even installed), but this is still a bug.

@Luap99
Copy link
Member

Luap99 commented Aug 8, 2023

Thanks for the reproducer, yes this seems to be a bug in our log reader.
Basically with --tail we read last N lines from the file which breaks when it contains partial (P) lines. If we read the file backwards we see two full (F) lines so it looks good but F basically only means append newline, the correct way would be to continue reading backwards until the next F line and we must keep all partial lines in between as well.

@edsantiago
Copy link
Member Author

Yes, precisely. Filed #19545.

edsantiago added a commit to edsantiago/libpod that referenced this issue Aug 9, 2023
...to reduce flakes.

Reason: journald makes no guarantees. Just because a systemd job
has finished, or podman has written+flushed log entries, doesn't
mean that journald will actually know about them:

   systemd/systemd#28650

Workaround: wrap some podman-logs tests inside Eventually()
so they will be retried when log == journald

This addresses, but does not close, containers#18501. That's a firehose,
with many more failures than I can possibly cross-reference.
I will leave it open, then keep monitoring missing-logs flakes
over time, and pick those off as they occur.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@edsantiago
Copy link
Member Author

Hi, this is still a big source of frustration. Any new insights?

Seen in: int+sys podman+remote debian-13+fedora-37+fedora-38 root+rootless host boltdb+sqlite

edsantiago added a commit to edsantiago/libpod that referenced this issue Jan 11, 2024
Add a wait_for_ready() to one kube-play test, to make sure
container output has made it to the journal.

Probably does not fix containers#18501, but I think it might fix its
most common presentation.

Signed-off-by: Ed Santiago <santiago@redhat.com>
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/podman that referenced this issue Feb 1, 2024
Add a wait_for_ready() to one kube-play test, to make sure
container output has made it to the journal.

Probably does not fix containers#18501, but I think it might fix its
most common presentation.

Signed-off-by: Ed Santiago <santiago@redhat.com>
mheon pushed a commit to mheon/libpod that referenced this issue Feb 1, 2024
Add a wait_for_ready() to one kube-play test, to make sure
container output has made it to the journal.

Probably does not fix containers#18501, but I think it might fix its
most common presentation.

Signed-off-by: Ed Santiago <santiago@redhat.com>
mheon pushed a commit to mheon/libpod that referenced this issue Feb 1, 2024
Add a wait_for_ready() to one kube-play test, to make sure
container output has made it to the journal.

Probably does not fix containers#18501, but I think it might fix its
most common presentation.

Signed-off-by: Ed Santiago <santiago@redhat.com>
mheon pushed a commit to mheon/libpod that referenced this issue Feb 2, 2024
Add a wait_for_ready() to one kube-play test, to make sure
container output has made it to the journal.

Probably does not fix containers#18501, but I think it might fix its
most common presentation.

Signed-off-by: Ed Santiago <santiago@redhat.com>
mheon pushed a commit to mheon/libpod that referenced this issue Feb 2, 2024
Add a wait_for_ready() to one kube-play test, to make sure
container output has made it to the journal.

Probably does not fix containers#18501, but I think it might fix its
most common presentation.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Apr 15, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants