-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: properly reload persistent snapshotter data and restart services #767
fix: properly reload persistent snapshotter data and restart services #767
Conversation
Signed-off-by: Justin Alvarez <alvajus@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redo how BuildKit/Stargz/SOCI are related to containerd using systemd's PartOf
Nice work! Thanks for the PR
Looks like the persistence fix had some unintended consequences :(. I'll keep looking into it |
Interesting, seems like the e2e test is failing spuriously? I tried to repro the basic scenario on my local machine, and it only occurs when I stop my VM before stopping the container. There may be some metadata we can clean up to handle this, but for now, starting it 3 times has consistently fixed it for me. Wondering if #759 fixes the e2e test failure. |
Although I'm not yet convinced that this is a real issue (probably just a bad test?) I did some more digging. This is also relevant for #219. After reading through the source code of the cni plugins (specifically, looking through every instance where the state is cleaned) I found what the lingering state is when you restart your machine while a container is running. We can add (maybe a more careful version) of this cleanup code:
the bridge dir holds information like:
where each file is empty and This workaround completely fixes the network issue, and we can add this command on boot/reboot (need to think further on a fix for lima/nerdctl in general). However, even after fixing that, there's an other issue where the filesystem of the container will be in a weird limbo state where the files "exist" but they have no content if the system is rebooted while the container is running. Luckily, this recent PR (containerd/containerd#9401) for containerd fixes this issue by forcing Linux to flush to disk. We just need to wait for nerdctl 2.0 to be adopted by Lima, or wait for the PR to get backported into containerd 1.7. |
Signed-off-by: Justin Alvarez <alvajus@amazon.com>
In case it helps, v0.5.0 of SOCI just released on Friday, which includes this fix. We'd still have to wait for the fix in stargz, though. |
Signed-off-by: Justin Alvarez <alvajus@amazon.com>
Signed-off-by: Justin Alvarez <alvajus@amazon.com>
Seems like my workaround of using systemd services that run at boot and on shutdown worked, and this has the added benefit of fixing #219 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job on the investigation! Do you think it could be useful to file this information away somewhere? Would be nice to put it in an internal kb or something.
Overriding since there are still lingering unrun old test cases on this PR. |
🤖 I have created a release *beep* *boop* --- ## [1.1.0](v1.0.1...v1.1.0) (2024-01-31) ### Build System or External Dependencies * **deps:** Bump github.com/aws/aws-sdk-go-v2 from 1.24.0 to 1.24.1 ([#749](#749)) ([c3ae967](c3ae967)) * **deps:** Bump github.com/containerd/containerd from 1.7.10 to 1.7.11 ([#742](#742)) ([7d1e250](7d1e250)) * **deps:** Bump github.com/docker/docker from 24.0.7+incompatible to 25.0.1+incompatible ([#772](#772)) ([b16f6ae](b16f6ae)) * **deps:** Bump github.com/lima-vm/lima from 0.19.0 to 0.20.0 ([#769](#769)) ([7f0c86e](7f0c86e)) * **deps:** Bump github.com/onsi/ginkgo/v2 from 2.13.2 to 2.14.0 ([#758](#758)) ([7d3a7c8](7d3a7c8)) * **deps:** Bump github.com/onsi/gomega from 1.30.0 to 1.31.1 ([#768](#768)) ([889abf8](889abf8)) * **deps:** Bump github.com/runfinch/common-tests from 0.7.11 to 0.7.12 ([#761](#761)) ([bb17a96](bb17a96)) * **deps:** Bump github.com/shirou/gopsutil/v3 from 3.23.11 to 3.23.12 ([#744](#744)) ([eb55877](eb55877)) * **deps:** Bump golang.org/x/crypto from 0.16.0 to 0.18.0 ([#751](#751)) ([fc434ac](fc434ac)) * **deps:** Bump golang.org/x/image from 0.0.0-20210220032944-ac19c3e999fb to 0.10.0 ([#752](#752)) ([9a08b45](9a08b45)) * **deps:** Bump golang.org/x/tools from 0.16.0 to 0.16.1 ([#734](#734)) ([efecfca](efecfca)) * **deps:** Bump golang.org/x/tools from 0.16.1 to 0.17.0 ([#757](#757)) ([89623da](89623da)) * **deps:** Bump submodules and dependencies ([#733](#733)) ([8b2d8cd](8b2d8cd)) ### Experimental * make finch work on windows with wsl2 ([#649](#649)) ([31cdc41](31cdc41)) ### Features * upgrade Windows support to "feature" ([#778](#778)) ([63894d1](63894d1)) ### Bug Fixes * properly reload persistent snapshotter data and restart services ([#767](#767)) ([700cb92](700cb92)) * temporarily switch to our own nerdctl-full bundle with patched runc and buildkit ([#783](#783)) ([f677e2e](f677e2e)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Issue #, if available: re-verified #412
Unfortunately, neither of them are deployed yet, so I've implemented a hacky workaround for now. After this change, an image/container can be pull/run, the VM can be restarted, and then the container can be re-started again.
Description of changes:
PartOf
SIGTERM
is used to kill the snapshotter services for nowTesting done:
manual testing
I've reviewed the guidance in CONTRIBUTING.md
License Acceptance
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.