-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange resource busy issue on simple buildx using buildkit #3812
Comments
PS this bit might be important too.. we don't seem to be using containerd:
|
This does not matter. The mount code is in the containerd libraries that buildkit imports. Actually running containerd daemon is optional. This usually means that some process has a file open under |
thanks @tonistiigi! I will investigate and let you know.. it's not impossible that we have services scanning the system for security, but I will bring back more details |
hi @tonistiigi! it's pretty strange.. we don't really have anything scanning this system (at least not the inner container where buildkit is run) this is an example of lsof output running in 1s intervals during the build command there's only one entry of /tmp/containerd-* there.. not sure files are used very briefly and I didn't catch possible culprits.. but what's most interesting is that after the builx command is run and fails, we still cannot unmount or delete the directory and there's nothing showing open under /tmp/
please note this is a kubernertes environment, so it goes like this: do you think it's possible that outer layers are making the resource busy? I am not very sure what is being mounted and from where
thanks for the assistance! EDIT: example fuser:
|
one strange note that might add to the investigation.. if I try using strace -f to follow the buildkitd process and collect details as the error is triggered, I cannot reproduce the issue (haven't been able with a few attempts) this could perhaps indicate that this is some sort of race condition that is coincidentally resolved with strace's added delay here and there EDIT: I was able to capture the error running |
Maybe this is something with mount propagation. Can you make sure that
If you can look if there is anything inside that path and what files these are. |
hi @tonistiigi ! thanks for reaching back /tmp wasn't a tmpfs filesystem but that didn't seem to fix it (I wanna say it helped, but not sure): inside the buildkit container I tried setting it to tmpfs too, with there's nothing inside
however /proc/1/mountinfo does provide a bit more details:
|
okay, it looks like your first hunch was correct! we are still going to test it thoroughly tomorrow, but it looks like a security scan run permanently on the worker node is causing this (seeing the mountinfo info, it was possible to trace back the /tmp/containerd-mount* dir to the worker node) in case we do need this scanner running, would there be a way to configure buildkit to wait more gracefully for this? thanks a lot! |
okay, reporting back after tests.. it looks like it was not the scanner (we turned it off and it didn't help.. this scanner also apparently only acts as sort of an strace on the system calls.. it's called falco btw) the tests we are performing are as follows:
|
Hello @tonistiigi did you manage to reproduce the problem? |
We're also experiencing this issue, also running buildkit pods on EKS running AL2. Restarting the builders helps temporarily. |
@nuzayets check this workaround. |
Thanks for the link. You seem to be running with the default builders on the machine running the buildx executable? So, this sidesteps the issue entirely. We run a multi-platform build natively on our cluster, not in the CI environment's runners; our builders are buildkit pods running on both x86 and ARM nodes. So, switching to using the default builder in the CI environment's runner is not a workaround for us. We have far more storage & compute on the cluster than in CI, and cross-arch builds have been problematic for us in the past, not to mention time on the CI runner is a lot more expensive than time on our cluster. |
I see. Well, I hope buildkit's devs can help you with this issue, then. |
Having the same error when using moby/buildkit:master-rootless docker image in our self hosted ci env
|
Did a few more tests last week. Seemed to be more common when building using daemonless mode, but did happen using buildkitd as well. |
I am replacing DIND with buildkit and in the initial tests I have hit this bug and its easily reproducible. The buildkit container logs
|
Anyone from the team able to take a look at this? |
We are also hitting this issue on 9/10 builds now. EKS 1.24, Gitlab Runners in k8s, v0.11.1 b4df08551fb12eb2ed17d3d9c3977ca0a903415a this used to be a non-issue for us until the last few weeks What logs can we provide to help triage? |
+1, we are also using GitLab CI K8s executor, EKS 1.24. Same issue above as @adawalli. |
Alright ran a few more test today. was able to replicate with moby/buildkit:master-rootless Also tried using latest ubuntu flavored eks worker node for eks 1.27 which was actually the worst results. |
@tonistiigi What can we give you to help shine light on this issue? This is starting to break non-multiarch builds for us now FWIW, our build steps in gitlab are fairly straightforward .docker-build:
extends:
- .docker-dind
before_script:
- !reference [.docker-prep, before_script]
- docker context create kamino-context
- >
docker buildx create \
--driver docker-container \
--use \
--name kamino \
--bootstrap \
kamino-context
- docker buildx inspect
script:
- >
docker buildx build \
-f ${CONTEXT}/${DOCKERFILE} \
--cache-from ${CI_REGISTRY_IMAGE} \
--cache-to ${CI_REGISTRY_IMAGE} \
-t ${CI_REGISTRY_IMAGE}:${DOCKER_TAG} \
${CONTEXT}
sh-4.2$ uname -a
Linux <REDACTED>nvpro 5.4.247-162.350.amzn2.x86_64 #1 SMP Tue Jun 27 22:03:59 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
FWIW, it still fails with caching "disabled", eg script:
- >
docker buildx build \
--no-cache \
-f ${CONTEXT}/${DOCKERFILE} \
-t ${CI_REGISTRY_IMAGE}:${DOCKER_TAG} \
--push \
${CONTEXT} |
A reproducer. Looks like this is likely related to the specific environment or maybe rootless. |
We are not running rootless. We are running on a vanilla AWS EKS cluster with the only caveat being that we route through a direct connect. Are there no logs we can provide that would be of interest to you? |
What daemonsets are folks running here? I found that the issue has 100% disappeared for me by removing datadog agent - which definitely spies on the container mounts. I am going to continue testing to make sure this isn't built on hopium, but take a critical look at your daemonsets. |
I am using datadog also! |
So not definitive proof yet. But ran 10 concurrent builds twice in a row without datadog daemon set with no errors. |
We were failing 10/10 times before this change but I need to try a few hundred times and on more nodes to be confident |
I have now run over 1000 jobs (20 pipelines of 50 concurrent builds) and the issue has not popped up a single time. As soon as I re-enable datadog agent, I fail 9/10 (or more) times. I have a ticket open with DD to find out how to disable snooping on the container layers. I will update my comment here when I find something out. |
The only reason I haven't raised an issue in their github, is because I haven't figured out exactly which integration (e.g., cri, vs containerd, vs k8s, vs docker) is causing the issue. I will do that at some point if I don't make any headway with support though. Keep me posted if the issue appears to be resolved on your end with further testing 🙏 |
Gotcha. Yea, same if you find anything out let me know. |
Well, in our scenario we didn't have datadog but we have trivvy security scan. However, when we were running our testes, we disabled the security scans and the problems still happened. (edit: formatting) |
@sergiomacedo do you have any other daemon sets that might be monitoring container mounts? |
@adawalli Narrowed it down a bit today. Turned off "Enable Universal Service Monitoring" on the datadog agents and had no issues building. turned it back on and error reappeared. |
trivvy and falco-security. However, as I said, they were disabled during the tests to narrow down the problem. I cant reproduce it anymore because we droped buildkit altogether and just replicated the build commands on our own. |
I am reproducing your results. I think it's time to take this to datadog github issue. I will begin that conversation and link back here |
Hi! We are trying to run docker buildx using the Github Actions action, and will sometimes encounter this "resource busy" issue when trying to unmount the /tmp/containerd-mount***** folder that is produced on build.
We investigated a bit the code and found that that folder is actually created and mounted on the containerd project, but it seems like it's triggerd from buildkit, so not sure what the culprit here is.
This problem is on/off, but we were able to simulate it by running a lot of load on a node and then trying to run the build manually on it.
Here's an example of how the issue manifests::
This problem does not seem related to the RUN mount command itself, because it could be anything there.. It seems rather related to the extraction that takes place beforehand.
Sometimes it goes through if I run it again, and then I use
docker buildx prune -a -f
to remove cache and hit the error again.This is very strange, we haven't come accross this since about 23/03/2023.. but we coulnd't pinpoint this to any version upgrades, because we tried locking all versions to an earlier date when this error was not met, and it didn't help.
These are the versions we are using:
This is running on an AWS instance which runs as a kubernetes node.
inux REDACTED 5.4.209-116.367.amzn2.x86_64 #1 SMP Wed Aug 31 00:09:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Buildkit is run on a container within a DIND container in pod on that node.
Docker container:
Linux REDACTED 5.4.209-116.367.amzn2.x86_64 #1 SMP Wed Aug 31 00:09:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Buildkit image: moby/buildkit:buildx-stable-1 "buildkitd --debug"
Here you are some logs showing the behavior for a single buildx failed command
error.log
And here the Dockerfile used: Dockerfile.txt
Please let me know if there's anything else that could be of service.
Thanks!
The text was updated successfully, but these errors were encountered: