Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: properly reload persistent snapshotter data and restart services #767

Merged
merged 4 commits into from
Jan 29, 2024

Conversation

pendo324
Copy link
Member

Issue #, if available: re-verified #412

Description of changes:

  • Redo how BuildKit/Stargz/SOCI are related to containerd using systemd's PartOf
    • this ensures that all of these services are restarted when containerd is restarted, which the lack of has caused errors in the past
  • Create some missing directories that might throw errors in cloud-init
  • Ensure that SIGTERM is used to kill the snapshotter services for now

Testing done:

  • manual testing

  • I've reviewed the guidance in CONTRIBUTING.md

License Acceptance

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Signed-off-by: Justin Alvarez <alvajus@amazon.com>
@pendo324 pendo324 added the bug Something isn't working label Jan 19, 2024
@pendo324 pendo324 requested a review from weikequ January 19, 2024 00:18
@pendo324 pendo324 self-assigned this Jan 19, 2024
weikequ
weikequ previously approved these changes Jan 19, 2024
Copy link
Contributor

@weikequ weikequ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redo how BuildKit/Stargz/SOCI are related to containerd using systemd's PartOf

Nice work! Thanks for the PR

@pendo324
Copy link
Member Author

Looks like the persistence fix had some unintended consequences :(. I'll keep looking into it

@pendo324
Copy link
Member Author

Interesting, seems like the e2e test is failing spuriously? I tried to repro the basic scenario on my local machine, and it only occurs when I stop my VM before stopping the container. There may be some metadata we can clean up to handle this, but for now, starting it 3 times has consistently fixed it for me.

Wondering if #759 fixes the e2e test failure.

@pendo324
Copy link
Member Author

pendo324 commented Jan 20, 2024

Although I'm not yet convinced that this is a real issue (probably just a bad test?) I did some more digging. This is also relevant for #219.

After reading through the source code of the cni plugins (specifically, looking through every instance where the state is cleaned) I found what the lingering state is when you restart your machine while a container is running.

We can add (maybe a more careful version) of this cleanup code:

sudo rm /var/lib/cni/networks/bridge/**
sudo rm /var/lib/cni/results/bridge-finch-*

the bridge dir holds information like:

$ sudo ls /var/lib/cni/networks/bridge/
10.4.0.2            last_reserved_ip.0  lock

where each file is empty and 10.4.0.2 represents the address of the container on the (default) nerdctl0 bridge network.

This workaround completely fixes the network issue, and we can add this command on boot/reboot (need to think further on a fix for lima/nerdctl in general).

However, even after fixing that, there's an other issue where the filesystem of the container will be in a weird limbo state where the files "exist" but they have no content if the system is rebooted while the container is running.

Luckily, this recent PR (containerd/containerd#9401) for containerd fixes this issue by forcing Linux to flush to disk. We just need to wait for nerdctl 2.0 to be adopted by Lima, or wait for the PR to get backported into containerd 1.7.

Signed-off-by: Justin Alvarez <alvajus@amazon.com>
@sondavidb
Copy link

sondavidb commented Jan 22, 2024

In case it helps, v0.5.0 of SOCI just released on Friday, which includes this fix. We'd still have to wait for the fix in stargz, though.

Signed-off-by: Justin Alvarez <alvajus@amazon.com>
Signed-off-by: Justin Alvarez <alvajus@amazon.com>
@pendo324
Copy link
Member Author

Seems like my workaround of using systemd services that run at boot and on shutdown worked, and this has the added benefit of fixing #219

@pendo324 pendo324 requested a review from weikequ January 23, 2024 19:59
Copy link
Contributor

@weikequ weikequ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job on the investigation! Do you think it could be useful to file this information away somewhere? Would be nice to put it in an internal kb or something.

finch.windows.yaml Show resolved Hide resolved
@pendo324
Copy link
Member Author

Overriding since there are still lingering unrun old test cases on this PR.

@pendo324 pendo324 merged commit 700cb92 into runfinch:main Jan 29, 2024
21 checks passed
mharwani pushed a commit that referenced this pull request Jan 31, 2024
🤖 I have created a release *beep* *boop*
---


## [1.1.0](v1.0.1...v1.1.0)
(2024-01-31)


### Build System or External Dependencies

* **deps:** Bump github.com/aws/aws-sdk-go-v2 from 1.24.0 to 1.24.1
([#749](#749))
([c3ae967](c3ae967))
* **deps:** Bump github.com/containerd/containerd from 1.7.10 to 1.7.11
([#742](#742))
([7d1e250](7d1e250))
* **deps:** Bump github.com/docker/docker from 24.0.7+incompatible to
25.0.1+incompatible
([#772](#772))
([b16f6ae](b16f6ae))
* **deps:** Bump github.com/lima-vm/lima from 0.19.0 to 0.20.0
([#769](#769))
([7f0c86e](7f0c86e))
* **deps:** Bump github.com/onsi/ginkgo/v2 from 2.13.2 to 2.14.0
([#758](#758))
([7d3a7c8](7d3a7c8))
* **deps:** Bump github.com/onsi/gomega from 1.30.0 to 1.31.1
([#768](#768))
([889abf8](889abf8))
* **deps:** Bump github.com/runfinch/common-tests from 0.7.11 to 0.7.12
([#761](#761))
([bb17a96](bb17a96))
* **deps:** Bump github.com/shirou/gopsutil/v3 from 3.23.11 to 3.23.12
([#744](#744))
([eb55877](eb55877))
* **deps:** Bump golang.org/x/crypto from 0.16.0 to 0.18.0
([#751](#751))
([fc434ac](fc434ac))
* **deps:** Bump golang.org/x/image from
0.0.0-20210220032944-ac19c3e999fb to 0.10.0
([#752](#752))
([9a08b45](9a08b45))
* **deps:** Bump golang.org/x/tools from 0.16.0 to 0.16.1
([#734](#734))
([efecfca](efecfca))
* **deps:** Bump golang.org/x/tools from 0.16.1 to 0.17.0
([#757](#757))
([89623da](89623da))
* **deps:** Bump submodules and dependencies
([#733](#733))
([8b2d8cd](8b2d8cd))


### Experimental

* make finch work on windows with wsl2
([#649](#649))
([31cdc41](31cdc41))


### Features

* upgrade Windows support to "feature"
([#778](#778))
([63894d1](63894d1))


### Bug Fixes

* properly reload persistent snapshotter data and restart services
([#767](#767))
([700cb92](700cb92))
* temporarily switch to our own nerdctl-full bundle with patched runc
and buildkit ([#783](#783))
([f677e2e](f677e2e))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants