fix: properly reload persistent snapshotter data and restart services #767

pendo324 · 2024-01-19T00:18:42Z

Issue #, if available: re-verified #412

Through extensive e2e test debugging, I noticed that soci and stargz snapshotters weren't persisting data as expected. After debugging, I found some context in these two PRs:
- Keep directories when SIGINT sent to daemon awslabs/soci-snapshotter#881
- Keep directories when SIGINT sent to daemon containerd/stargz-snapshotter#1526
  Unfortunately, neither of them are deployed yet, so I've implemented a hacky workaround for now. After this change, an image/container can be pull/run, the VM can be restarted, and then the container can be re-started again.

Description of changes:

Redo how BuildKit/Stargz/SOCI are related to containerd using systemd's PartOf
- this ensures that all of these services are restarted when containerd is restarted, which the lack of has caused errors in the past
Create some missing directories that might throw errors in cloud-init
Ensure that SIGTERM is used to kill the snapshotter services for now

Testing done:

manual testing
I've reviewed the guidance in CONTRIBUTING.md

License Acceptance

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Signed-off-by: Justin Alvarez <alvajus@amazon.com>

weikequ

Redo how BuildKit/Stargz/SOCI are related to containerd using systemd's PartOf

Nice work! Thanks for the PR

pendo324 · 2024-01-19T01:17:40Z

Looks like the persistence fix had some unintended consequences :(. I'll keep looking into it

pendo324 · 2024-01-20T00:38:55Z

Interesting, seems like the e2e test is failing spuriously? I tried to repro the basic scenario on my local machine, and it only occurs when I stop my VM before stopping the container. There may be some metadata we can clean up to handle this, but for now, starting it 3 times has consistently fixed it for me.

Wondering if #759 fixes the e2e test failure.

pendo324 · 2024-01-20T02:04:14Z

Although I'm not yet convinced that this is a real issue (probably just a bad test?) I did some more digging. This is also relevant for #219.

After reading through the source code of the cni plugins (specifically, looking through every instance where the state is cleaned) I found what the lingering state is when you restart your machine while a container is running.

We can add (maybe a more careful version) of this cleanup code:

sudo rm /var/lib/cni/networks/bridge/**
sudo rm /var/lib/cni/results/bridge-finch-*

the bridge dir holds information like:

$ sudo ls /var/lib/cni/networks/bridge/
10.4.0.2            last_reserved_ip.0  lock

where each file is empty and 10.4.0.2 represents the address of the container on the (default) nerdctl0 bridge network.

This workaround completely fixes the network issue, and we can add this command on boot/reboot (need to think further on a fix for lima/nerdctl in general).

However, even after fixing that, there's an other issue where the filesystem of the container will be in a weird limbo state where the files "exist" but they have no content if the system is rebooted while the container is running.

Luckily, this recent PR (containerd/containerd#9401) for containerd fixes this issue by forcing Linux to flush to disk. We just need to wait for nerdctl 2.0 to be adopted by Lima, or wait for the PR to get backported into containerd 1.7.

Signed-off-by: Justin Alvarez <alvajus@amazon.com>

sondavidb · 2024-01-22T18:40:52Z

In case it helps, v0.5.0 of SOCI just released on Friday, which includes this fix. We'd still have to wait for the fix in stargz, though.

Signed-off-by: Justin Alvarez <alvajus@amazon.com>

pendo324 · 2024-01-23T01:01:12Z

Seems like my workaround of using systemd services that run at boot and on shutdown worked, and this has the added benefit of fixing #219

weikequ

Nice job on the investigation! Do you think it could be useful to file this information away somewhere? Would be nice to put it in an internal kb or something.

finch.windows.yaml

pendo324 · 2024-01-29T16:43:08Z

Overriding since there are still lingering unrun old test cases on this PR.

🤖 I have created a release *beep* *boop* --- ## [1.1.0](v1.0.1...v1.1.0) (2024-01-31) ### Build System or External Dependencies * **deps:** Bump github.com/aws/aws-sdk-go-v2 from 1.24.0 to 1.24.1 ([#749](#749)) ([c3ae967](c3ae967)) * **deps:** Bump github.com/containerd/containerd from 1.7.10 to 1.7.11 ([#742](#742)) ([7d1e250](7d1e250)) * **deps:** Bump github.com/docker/docker from 24.0.7+incompatible to 25.0.1+incompatible ([#772](#772)) ([b16f6ae](b16f6ae)) * **deps:** Bump github.com/lima-vm/lima from 0.19.0 to 0.20.0 ([#769](#769)) ([7f0c86e](7f0c86e)) * **deps:** Bump github.com/onsi/ginkgo/v2 from 2.13.2 to 2.14.0 ([#758](#758)) ([7d3a7c8](7d3a7c8)) * **deps:** Bump github.com/onsi/gomega from 1.30.0 to 1.31.1 ([#768](#768)) ([889abf8](889abf8)) * **deps:** Bump github.com/runfinch/common-tests from 0.7.11 to 0.7.12 ([#761](#761)) ([bb17a96](bb17a96)) * **deps:** Bump github.com/shirou/gopsutil/v3 from 3.23.11 to 3.23.12 ([#744](#744)) ([eb55877](eb55877)) * **deps:** Bump golang.org/x/crypto from 0.16.0 to 0.18.0 ([#751](#751)) ([fc434ac](fc434ac)) * **deps:** Bump golang.org/x/image from 0.0.0-20210220032944-ac19c3e999fb to 0.10.0 ([#752](#752)) ([9a08b45](9a08b45)) * **deps:** Bump golang.org/x/tools from 0.16.0 to 0.16.1 ([#734](#734)) ([efecfca](efecfca)) * **deps:** Bump golang.org/x/tools from 0.16.1 to 0.17.0 ([#757](#757)) ([89623da](89623da)) * **deps:** Bump submodules and dependencies ([#733](#733)) ([8b2d8cd](8b2d8cd)) ### Experimental * make finch work on windows with wsl2 ([#649](#649)) ([31cdc41](31cdc41)) ### Features * upgrade Windows support to "feature" ([#778](#778)) ([63894d1](63894d1)) ### Bug Fixes * properly reload persistent snapshotter data and restart services ([#767](#767)) ([700cb92](700cb92)) * temporarily switch to our own nerdctl-full bundle with patched runc and buildkit ([#783](#783)) ([f677e2e](f677e2e)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

fix: properly reload persistent snapshotter data and restart services

a5e674c

Signed-off-by: Justin Alvarez <alvajus@amazon.com>

pendo324 added the bug Something isn't working label Jan 19, 2024

pendo324 requested a review from weikequ January 19, 2024 00:18

pendo324 self-assigned this Jan 19, 2024

weikequ previously approved these changes Jan 19, 2024

View reviewed changes

Merge branch 'main' into fix-services-and-persistence

fc8c7d0

Signed-off-by: Justin Alvarez <alvajus@amazon.com>

cleanup on boot and shutdown

4b1b1e6

Signed-off-by: Justin Alvarez <alvajus@amazon.com>

pendo324 dismissed weikequ’s stale review via 4b1b1e6 January 23, 2024 00:34

also add services to windows yaml

f148a9a

Signed-off-by: Justin Alvarez <alvajus@amazon.com>

pendo324 requested a review from weikequ January 23, 2024 19:59

weikequ approved these changes Jan 23, 2024

View reviewed changes

finch.windows.yaml Show resolved Hide resolved

pendo324 merged commit 700cb92 into runfinch:main Jan 29, 2024
21 checks passed

github-actions bot mentioned this pull request Jan 29, 2024

chore(main): release 1.1.0 #748

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: properly reload persistent snapshotter data and restart services #767

fix: properly reload persistent snapshotter data and restart services #767

pendo324 commented Jan 19, 2024

weikequ left a comment

pendo324 commented Jan 19, 2024

pendo324 commented Jan 20, 2024

pendo324 commented Jan 20, 2024 •

edited

Loading

sondavidb commented Jan 22, 2024 •

edited

Loading

pendo324 commented Jan 23, 2024

weikequ left a comment

pendo324 commented Jan 29, 2024

fix: properly reload persistent snapshotter data and restart services #767

fix: properly reload persistent snapshotter data and restart services #767

Conversation

pendo324 commented Jan 19, 2024

License Acceptance

weikequ left a comment

Choose a reason for hiding this comment

pendo324 commented Jan 19, 2024

pendo324 commented Jan 20, 2024

pendo324 commented Jan 20, 2024 • edited Loading

sondavidb commented Jan 22, 2024 • edited Loading

pendo324 commented Jan 23, 2024

weikequ left a comment

Choose a reason for hiding this comment

pendo324 commented Jan 29, 2024

pendo324 commented Jan 20, 2024 •

edited

Loading

sondavidb commented Jan 22, 2024 •

edited

Loading