Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

machine: Volume ops test: statfs /private/tmp/ci/ginkgoNNN: no such file or directory #22569

Closed
edsantiago opened this issue May 1, 2024 · 16 comments · Fixed by #23118
Closed
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine macos MacOS (OSX) related

Comments

@edsantiago
Copy link
Member

Not quite as frequent or annoying as #22551, but still causing wasted runs:

run basic podman commands
  Volume ops
....
  Trying to pull quay.io/libpod/alpine_nginx:latest...
  ...
  Writing manifest to image destination
  WARNING: image platform (linux/amd64) does not match the expected platform (linux/arm64)
  Error: statfs /private/tmp/ci/ginkgo630638030: no such file or directory
x x x x x x
machine-mac(5) podman(5) darwin(5) rootless(5) host(5) sqlite(5)
@edsantiago edsantiago added flakes Flakes from Continuous Integration macos MacOS (OSX) related machine labels May 1, 2024
@edsantiago
Copy link
Member Author

@cevich is there any chance whatsoever that https://github.com/containers/podman/blob/c9644ebccf14309a77769cba00833cd139509e4a/contrib/cirrus/mac_cleanup.sh is getting invoked in the middle of a running CI job? I just can't understand this bug and am grasping at straws.

@Luap99
Copy link
Member

Luap99 commented May 3, 2024

Unlikely, there is a extra level of indirectness here given the dir is mounted in the machine VM. As such maybe the machine mount failed silently?

@cevich
Copy link
Member

cevich commented May 3, 2024

is there any chance whatsoever that

What Paul said. And the Mac's are single-task/single-user. Is it possible the running VM really is x86_64 via some emulation and/or is the pull command specifying --arch or --platform (just double-checking)?

Sorry no hack/get_ci_vm.sh support here, that's just way to complex with this environment to do safely. But in case it helps and is supported (I never checked), the re-run in terminal may be an option (with cleanup temporarily disabled).

Otherwise, there is a way to isolate (for a few hours) one of the Macs and dedicate it to servicing a single PR. In that PR, the end-of-task cleanup could be disabled, so that a human may ssh in and check out the state of things. This is all manual, and a bit of a chore to pull off, but it's technically possible.

@Luap99
Copy link
Member

Luap99 commented May 15, 2024

@edsantiago Not sure if you are testing machine in your non flake retry testing PR but if you do could you give this a go:

diff --git a/pkg/machine/apple/apple.go b/pkg/machine/apple/apple.go
index 93201407e..04db7638b 100644
--- a/pkg/machine/apple/apple.go
+++ b/pkg/machine/apple/apple.go
@@ -124,7 +124,7 @@ func GenerateSystemDFilesForVirtiofsMounts(mounts []machine.VirtIoFs) ([]ignitio
        mountPrep.Add("Service", "Type", "oneshot")
        mountPrep.Add("Service", "ExecStartPre", "chattr -i /")
        mountPrep.Add("Service", "ExecStart", "mkdir -p '%f'")
-       mountPrep.Add("Service", "ExecStopPost", "chattr +i /")
+       // mountPrep.Add("Service", "ExecStopPost", "chattr +i /")
 
        mountPrep.Add("Install", "WantedBy", "remote-fs.target")
        mountPrepFile, err := mountPrep.ToString()

@edsantiago
Copy link
Member Author

Oops, no, I long ago disabled machine tests in #17831. I will look into reenabling this one.

FWIW here's the current flake list. I don't think there's any useful info in this list, i.e., I haven't seen any logs that look different or provide interesting new data, but am posting anyway.

x x x x x x
machine-mac(7) podman(7) darwin(7) rootless(7) host(7) sqlite(7)

@Luap99
Copy link
Member

Luap99 commented May 15, 2024

The alternative is I instrument the tests to do some checks. Basically I it would have to ssh into the machine VM and run systemctl status on all the mount units. I think the race here is the most likely cause.

One interesting point would be the new machine init with volume test, if this never fails then I am sure this is a race due the parallel running chattr -i and chattr +i in different units. Reason this tests mounts only one path so there cannot be a race, however the default volumes are several paths thus the chance for the race.

Copy link

A friendly reminder that this issue had no activity for 30 days.

@Luap99
Copy link
Member

Luap99 commented Jun 15, 2024

@edsantiago Any conclusions?

@edsantiago
Copy link
Member Author

No. This isn't something I can look into, and our PR merge rate is too low these days; not many flakes to report.

@edsantiago
Copy link
Member Author

However, I just ran my afternoon flake catchup, and here's a new one

@Luap99
Copy link
Member

Luap99 commented Jun 19, 2024

I was mostly interested if you ever saw it in the no flake retry PR (edsantiago@28882ca)

However as I have a mac now I can try to reproduce locally and see where I go from there.

@edsantiago
Copy link
Member Author

Still active

@Luap99
Copy link
Member

Luap99 commented Jun 26, 2024

So I have been running this script all day without luck. So either my script is wrong or I was not able to reproduce.

#!/bin/bash

set -e

while :; do
dirs=()
for i in {1..20}; do dir="$TMPDIR$i"; dirs+=($dir); mkdir -p $dir; done
args=() 
for dir in "${dirs[@]}"; do args+=("--volume" "$dir:$dir"); done
podman machine init --now "${args[@]}"
podman machine ssh ls "${dirs[@]}"
podman machine ssh mount | grep $TMPDIR
podman machine ssh systemctl list-units --failed | grep fail && break

podman machine rm -f
done

I did manage to hit a quay.io flake though

Error: reading manifest sha256:a7775864b05f6402c7ca071446f8a50ce94e456e85c3dbe8d94b3a8bf2a2c81d in quay.io/podman/machine-os: authentication required

So my best bet is to try to instrument the CI tests to give out more logs when it happens.

@edsantiago
Copy link
Member Author

I still suspect this is related to the weird Macintosh CI setup, and as such will only trigger in CI. But I have no evidence to back that up.

@Luap99
Copy link
Member

Luap99 commented Jun 27, 2024

Well my PR did triggered it first try so I can tell you now that the bug is in the VM
https://api.cirrus-ci.com/v1/artifact/task/4925116956540928/html/machine-mac-podman-darwin-rootless-host-sqlite.log.html#t--run-basic-podman-commands-Volume-ops--1

@Luap99 Luap99 self-assigned this Jun 27, 2024
@Luap99
Copy link
Member

Luap99 commented Jun 27, 2024

#23118 should contain a proper fix now

mheon pushed a commit to mheon/libpod that referenced this issue Jul 10, 2024
One problem on FCOS is that the root directory is immutable, as such in
order to mount arbitrary paths from the host we must make it mutable
again and create these dir on boot in order to be able to mount there.

The current logic was racy as it used one unit for each path and they
all did chattr -i /; mkdir -p $path; chattr -i / and systemd can run
these units in parallel. That means it was possible for another unit to
make / immutable before the unit could do the mkdir. I pointed this out
on the original PR[1] but we never followed up on it...

Now this here changes several things. First have one unit that does the
chattr -i / (immutable-root-off.service), it is hooked into
remote-fs-pre.target which means it is executed before the network
mounts (virtiofs) are done.

Then we have another unit that does chattr +i /
(immutable-root-on.service) which turn the immutable root back on after
remote-fs.target which means all mount are done at this point.

Additionally the automount unit is removed because it does not add any
value for us and it was borken anyway as it used the virtiofs tag as
path so systemd just ignored it.

[1] containers#20612 (comment)

Fixes containers#22569

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 26, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Sep 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine macos MacOS (OSX) related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants