-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump gVisor to 20230911.0 #684
Bump gVisor to 20230911.0 #684
Conversation
The gVisor containerd quick start guide mentions1 the following lines in the configuration file: [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2" We don't have them in our configuration file — should we? Footnotes |
83f6dc1
to
83e20f6
Compare
83e20f6
to
8dee0b1
Compare
Did you try running it using the quickstart guide? Also make sure you pass gvisor, default is set to firecracker. |
@anshalshukla Yes, I did, everything works. |
@CuriousGeorgiy 1 reviewer at a time, please. @leokondrashov can you do this? @CuriousGeorgiy I don't know, can you please check if our config is up to date? Please note that vHive has 2 containerd daemons co-running: one for containers and one for MicroVMs (gvisor VMs in this case)
|
I guess @alexandrinapanfil was following this currently outdated guide when bringing gVisor to vHive:
I will try adding the lines from the newer guide. |
I tried the PR on two xl170 nodes (master + 1 worker), used multinode setup from quickstart guide, adding the gvisor as runtime to necessary scripts:
In a result, deployment fails Istio pods are 0/1 ready with error in the log I'm not sure that it is the issue with update or this thing happened before with gVisor deployment in general. @CuriousGeorgiy, what initialization process did you use for testing? |
@leokondrashov I used a single node setup, i.e., |
8dee0b1
to
9eff9e1
Compare
please double check with 2 nodes as well when Georgiy finishes upgrading the config |
It's a typo, I didn't mean this line.
As I mentioned, istio installation does not succeed, no other errors. It fails with the same error you describe in #557.
As I mentioned, this problem is reproducible on a single node (both @anshalshukla and @leokondrashov managed to reproduce it). |
@@ -5,6 +5,8 @@ state = "/run/gvisor-containerd" | |||
address = "/run/gvisor-containerd/containerd.sock" | |||
[plugins."io.containerd.runtime.v1.linux"] | |||
shim_debug = true | |||
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] | |||
runtime_type = "io.containerd.runc.v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as non-function components should run in the regular runc containers, I am not sure what these 2 lines are. I'd try removing them.
Also, please check that only functions and nothing else runs in gvisor sandbox (top on the nodes would do to spot gvisor processes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as non-function components should run in the regular runc containers, I am not sure what these 2 lines are. I'd try removing them.
Removing these lines does not help (actually, I added these lines to try to make it work).
Also, please check that only functions and nothing else runs in gvisor sandbox (top on the nodes would do to spot gvisor processes).
Not sure what you mean, but I filtered out the processes by gvisor
and there's only gvisor-containerd
running:
@CuriousGeorgiy if istio setup fails, this is our target for debugging. please proceed with the steps I mentioned above. |
IMHO, this should be debugged in scope of #557, since it's obviously not related to gvisor. Since single node setup works, we should focus on debugging gvisor in the scope of this PR.
What particular steps? |
I have analyzed the log difference between the main branch and the PR branch during the following steps: main branch:git clone https://github.com/vhive-serverless/vhive.git
cd vhive
mkdir -p /tmp/vhive-logs
./scripts/cloudlab/setup_node.sh gvisor > >(tee -a /tmp/vhive-logs/setup_node.stdout) 2> >(tee -a /tmp/vhive-logs/setup_node.stderr >&2)
GITHUB_VHIVE_ARGS="-dbg" ./scripts/cloudlab/start_onenode_vhive_cluster.sh gvisor > >(tee -a /tmp/vhive-logs/start_onenode_vhive_cluster.stdout) 2> >(tee -a /tmp/vhive-logs/start_onenode_vhive_cluster.stderr >&2)
source /etc/profile && pushd ./examples/deployer && go build && popd && ./examples/deployer/deployer PR branch:git clone https://github.com/CuriousGeorgiy/vhive.git
cd vhive
mkdir -p /tmp/vhive-logs
git checkout gh-683-upgrade-gvisor
./scripts/cloudlab/setup_node.sh gvisor > >(tee -a /tmp/vhive-logs/setup_node.stdout) 2> >(tee -a /tmp/vhive-logs/setup_node.stderr >&2)
GITHUB_VHIVE_ARGS="-dbg" ./scripts/cloudlab/start_onenode_vhive_cluster.sh gvisor > >(tee -a /tmp/vhive-logs/start_onenode_vhive_cluster.stdout) 2> >(tee -a /tmp/vhive-logs/start_onenode_vhive_cluster.stderr >&2)
source /etc/profile && pushd ./examples/deployer && go build && popd && ./examples/deployer/deployer The setup and startup script logs are identical. The containerd logs on the main branch have additional log entries, gvisor-containerd logs diverge at the tail, vHive orchestrator logs do not differ except for 'failed to start container' messages. containerd main branch logs:time="2023-03-10T02:14:27.031554014-07:00" level=info msg="StartContainer for \"d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4\""
time="2023-03-10T02:14:27.148468619-07:00" level=info msg="StartContainer for \"d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4\" returns successfully"
time="2023-03-10T02:15:28.074415426-07:00" level=info msg="StopContainer for \"d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4\" with timeout 300 (s)"
time="2023-03-10T02:15:28.076026423-07:00" level=info msg="Stop container \"d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4\" with signal terminated"
time="2023-03-10T02:15:58.081082392-07:00" level=info msg="StopContainer for \"3e766a54c6365c33c6317425a6321dc5f7d13f45c54b8dd9c9dfbf41b1a93262\" with timeout 300 (s)"
time="2023-03-10T02:15:58.081724337-07:00" level=info msg="Stop container \"3e766a54c6365c33c6317425a6321dc5f7d13f45c54b8dd9c9dfbf41b1a93262\" with signal terminated"
time="2023-03-10T02:15:58.115683927-07:00" level=info msg="shim disconnected" id=d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4
time="2023-03-10T02:15:58.115766496-07:00" level=warning msg="cleaning up after shim disconnected" id=d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4 namespace=k8s.io
time="2023-03-10T02:15:58.115789289-07:00" level=info msg="cleaning up dead shim"
time="2023-03-10T02:15:58.129310162-07:00" level=warning msg="cleanup warnings time=\"2023-03-10T02:15:58-07:00\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=32459 runtime=io.containerd.runc.v2\n"
time="2023-03-10T02:15:58.130505751-07:00" level=info msg="StopContainer for \"d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4\" returns successfully"
time="2023-03-10T02:15:58.185056761-07:00" level=info msg="shim disconnected" id=3e766a54c6365c33c6317425a6321dc5f7d13f45c54b8dd9c9dfbf41b1a93262
time="2023-03-10T02:15:58.185153573-07:00" level=warning msg="cleaning up after shim disconnected" id=3e766a54c6365c33c6317425a6321dc5f7d13f45c54b8dd9c9dfbf41b1a93262 namespace=k8s.io
time="2023-03-10T02:15:58.185178769-07:00" level=info msg="cleaning up dead shim"
time="2023-03-10T02:15:58.200865016-07:00" level=warning msg="cleanup warnings time=\"2023-03-10T02:15:58-07:00\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=32489 runtime=io.containerd.runc.v2\n"
time="2023-03-10T02:15:58.201871987-07:00" level=info msg="StopContainer for \"3e766a54c6365c33c6317425a6321dc5f7d13f45c54b8dd9c9dfbf41b1a93262\" returns successfully"
time="2023-03-10T02:15:58.202904198-07:00" level=info msg="StopPodSandbox for \"01c5c2568835c2a877993150ebdec21dd8920f36d5a694c12c7dffc2ceeaa23e\""
time="2023-03-10T02:15:58.203032786-07:00" level=info msg="Container to stop \"d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4\" must be in running or unknown state, current state \"CONTAINER_EXITED\""
time="2023-03-10T02:15:58.203068032-07:00" level=info msg="Container to stop \"3e766a54c6365c33c6317425a6321dc5f7d13f45c54b8dd9c9dfbf41b1a93262\" must be in running or unknown state, current state \"CONTAINER_EXITED\""
time="2023-03-10T02:15:58.297232468-07:00" level=info msg="shim disconnected" id=01c5c2568835c2a877993150ebdec21dd8920f36d5a694c12c7dffc2ceeaa23e
time="2023-03-10T02:15:58.297287578-07:00" level=warning msg="cleaning up after shim disconnected" id=01c5c2568835c2a877993150ebdec21dd8920f36d5a694c12c7dffc2ceeaa23e namespace=k8s.io
time="2023-03-10T02:15:58.297301377-07:00" level=info msg="cleaning up dead shim"
time="2023-03-10T02:15:58.308435103-07:00" level=warning msg="cleanup warnings time=\"2023-03-10T02:15:58-07:00\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=32528 runtime=io.containerd.runc.v2\n"
time="2023-03-10T02:15:58.517985042-07:00" level=info msg="TearDown network for sandbox \"01c5c2568835c2a877993150ebdec21dd8920f36d5a694c12c7dffc2ceeaa23e\" successfully"
time="2023-03-10T02:15:58.518050557-07:00" level=info msg="StopPodSandbox for \"01c5c2568835c2a877993150ebdec21dd8920f36d5a694c12c7dffc2ceeaa23e\" returns successfully"
time="2023-03-10T02:15:58.673958466-07:00" level=info msg="RemoveContainer for \"d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4\""
time="2023-03-10T02:15:58.679129051-07:00" level=info msg="RemoveContainer for \"d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4\" returns successfully"
time="2023-03-10T02:15:58.681979391-07:00" level=info msg="RemoveContainer for \"3e766a54c6365c33c6317425a6321dc5f7d13f45c54b8dd9c9dfbf41b1a93262\""
time="2023-03-10T02:15:58.687500857-07:00" level=info msg="RemoveContainer for \"3e766a54c6365c33c6317425a6321dc5f7d13f45c54b8dd9c9dfbf41b1a93262\" returns successfully"
time="2023-03-10T02:15:58.688453451-07:00" level=error msg="ContainerStatus for \"d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4\" failed" error="rpc error: code = NotFound desc = an error occurred when try to find container \"d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4\": not found"
time="2023-03-10T02:15:58.689549044-07:00" level=error msg="ContainerStatus for \"3e766a54c6365c33c6317425a6321dc5f7d13f45c54b8dd9c9dfbf41b1a93262\" failed" error="rpc error: code = NotFound desc = an error occurred when try to find container \"3e766a54c6365c33c6317425a6321dc5f7d13f45c54b8dd9c9dfbf41b1a93262\": not found"
time="2023-03-10T02:15:58.690369091-07:00" level=error msg="ContainerStatus for \"d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4\" failed" error="rpc error: code = NotFound desc = an error occurred when try to find container \"d1f8136eac10d13a0c78476980cccaeede748072885b02828685983afa61cae4\": not found"
time="2023-03-10T02:15:58.691214088-07:00" level=error msg="ContainerStatus for \"3e766a54c6365c33c6317425a6321dc5f7d13f45c54b8dd9c9dfbf41b1a93262\" failed" error="rpc error: code = NotFound desc = an error occurred when try to find container \"3e766a54c6365c33c6317425a6321dc5f7d13f45c54b8dd9c9dfbf41b1a93262\": not found"
time="2023-03-10T02:16:38.381544370-07:00" level=info msg="StopPodSandbox for \"01c5c2568835c2a877993150ebdec21dd8920f36d5a694c12c7dffc2ceeaa23e\""
time="2023-03-10T02:16:38.394076537-07:00" level=info msg="TearDown network for sandbox \"01c5c2568835c2a877993150ebdec21dd8920f36d5a694c12c7dffc2ceeaa23e\" successfully"
time="2023-03-10T02:16:38.394129265-07:00" level=info msg="StopPodSandbox for \"01c5c2568835c2a877993150ebdec21dd8920f36d5a694c12c7dffc2ceeaa23e\" returns successfully"
time="2023-03-10T02:16:38.395241445-07:00" level=info msg="RemovePodSandbox for \"01c5c2568835c2a877993150ebdec21dd8920f36d5a694c12c7dffc2ceeaa23e\""
time="2023-03-10T02:16:38.395330196-07:00" level=info msg="Forcibly stopping sandbox \"01c5c2568835c2a877993150ebdec21dd8920f36d5a694c12c7dffc2ceeaa23e\""
time="2023-03-10T02:16:38.442629007-07:00" level=info msg="TearDown network for sandbox \"01c5c2568835c2a877993150ebdec21dd8920f36d5a694c12c7dffc2ceeaa23e\" successfully"
time="2023-03-10T02:16:38.447710333-07:00" level=info msg="RemovePodSandbox \"01c5c2568835c2a877993150ebdec21dd8920f36d5a694c12c7dffc2ceeaa23e\" returns successfully" vHive orchestrator gVisor PR branch logs:time="2023-03-20T03:43:39.094412638-06:00" level=error msg="failed to start container" error="context deadline exceeded" gVisor main branch logs:time="2023-03-10T02:03:37.628278567-07:00" level=debug msg="registering ttrpc server"
time="2023-03-10T02:03:37.628379841-07:00" level=debug msg="serving api on socket" socket="[inherited from parent]"
time="2023-03-10T02:03:37.628409199-07:00" level=info msg="starting signal loop" namespace=default path=/run/gvisor-containerd/io.containerd.runtime.v2.task/default/1 pid=31710
time="2023-03-10T02:03:37.631737846-07:00" level=debug msg="Executing: [runsc --root=/run/containerd/runsc/default --log=/run/gvisor-containerd/io.containerd.runtime.v2.task/default/1/log.json --log-format=json create --bundle /run/gvisor-containerd/io.containerd.runtime.v2.task/default/1 --pid-file /run/gvisor-containerd/io.containerd.runtime.v2.task/default/1/init.pid 1]" gVisor PR branch logs:time="2023-03-10T02:14:24.306573191-07:00" level=info msg="starting signal loop" namespace=default path=/run/gvisor-containerd/io.containerd.runtime.v2.task/default/1 pid=31401
time="2023-03-10T02:14:24.309960086-07:00" level=error msg="Can't get pod uid" error="no sandbox log path annotation" |
46256e2
to
11d62f1
Compare
scripts/install_stock.sh
Outdated
@@ -37,7 +37,7 @@ wget --continue --quiet https://github.com/opencontainers/runc/releases/download | |||
mv runc.amd64 runc | |||
sudo install -D -m0755 runc /usr/local/sbin/runc | |||
|
|||
wget --continue --quiet https://storage.googleapis.com/gvisor/releases/release/20210622/x86_64/runsc | |||
wget --continue --quiet https://storage.googleapis.com/gvisor/releases/release/20221026/x86_64/runsc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CuriousGeorgiy why didn't you take a newer version of gVisor from 2023?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, I took the latest version — it didn't work. So I tried using older versions, but it didn't help.
I've revisited comments here and found out that the issue that was blocking the PR is closely resembling the issue solved by #785. I think that this as well might be the issue here, since the problem was not caused by specific runtime, but rather network bridges vHive has created. |
3ee09d6
to
3a53554
Compare
@leokondrashov rebased the branch, let's see if the gVisor CRI workflow passes. |
3a53554
to
6710a77
Compare
@ustiugov this PR does not affect firecracker, so I guess we can ignore the firecracker CRI workflow failure. |
Does our CI test for multi node setup? I don't think so, so it should be checked manually in addition to tests. |
@leokondrashov I tried doing a multi-node setup in CloudLab, but Istio setup fails for me, as always... - Processing resources for Istio core.
✔ Istio core installed
- Processing resources for Istiod.
- Processing resources for Istiod. Waiting for Deployment/istio-system/istiod
✔ Istiod installed
- Processing resources for Ingress gateways.
- Processing resources for Ingress gateways. Waiting for Deployment/istio-system/cluster-local-gateway, Deployment/istio-system/istio-ingressgateway
✘ Ingress gateways encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition
Deployment/istio-system/cluster-local-gateway (containers with unready status: [istio-proxy])
Deployment/istio-system/istio-ingressgateway (containers with unready status: [istio-proxy])
- Pruning removed resourcesError: failed to install manifests: errors occurred during operation So I guess it's up to you what to do with this PR. |
* Bump gVisor containerd shim and runsc to 20230911.0 release. * Bump gvisor-containerd version to 1.6.2. Closes vhive-serverless#683 Signed-off-by: Georgiy Lebedev <lebedev.gk@phystech.edu>
6710a77
to
ebae907
Compare
I agree with Georgiy, can merge. |
if there going to be issues with the multi-node setup, please raise an issue. Istio failure should not be related to gvisor... |
Summary
Closes #683
Implementation Notes ⚒️
External Dependencies 🍀
Breaking API Changes⚠️