fix(remount): relocate libraries along with their symlinks #255

maxbrunet · 2024-06-28T22:37:57Z

This PR adds:

Look up the library directory, the value differs for Debian-based distros (the initial envbuilder image is non-Debian, but the final image may be Debian)
Find the symlinks pointing to mounts in the library directory
Temporarily move the library symlinks pointing to mounts to the magic directory while relocating the mounts
Look up the new library directory (in case it has changed)
Move back the library symlinks and mounts to the new library directory

After that the container should behave like a regular container created by the NVIDIA container runtime. Of course/unfortunately, the process of mounting/unmounting requires GPU containers to run with privileges:

Appropriate privilege (Linux: the CAP_SYS_ADMIN capability) is required to mount/umount filesystems.

https://www.man7.org/linux/man-pages/man2/mount.2.html
https://www.man7.org/linux/man-pages/man2/umount.2.html

The logic is not generalized to any symlinks or any directories, it only aims at providing compatibility with the NVIDIA container runtime for now.

More context can be found in this comment #143 (comment)

Tested with the following images:

docker.io/library/debian:bookworm
docker.io/library/fedora:40
nvcr.io/nvidia/pytorch:24.05-py3

Closes #143

internal/ebutil/libs.go

internal/ebutil/remount.go

internal/ebutil/libs_ppc64le.go

internal/ebutil/remount.go

internal/ebutil/libs.go

johnstcn

Thank you for this contribution!

I validated this fix on a Fedora 40 system with the NVidia runtime etc. installed (so no /usr/lib/x86_64-linux-gnu), using both Docker (v27.0.2) and K3s (v1.29.6).

(Edit: for posterity, also verified working on an AL2 EKS cluster.)

In both cases, I was able to successfully build the image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2, run nvidia-smi and /tmp/vectorAdd.

My only things I would like to see changed are:

More commenting for future readers/code-spelunkers
Remove references to PPC arch; I don't know if we will ever support this.

There may also be a similar workaround needed for AMD/Vulkan cards, but this can be tested separately.

mtojek

👍 👍

(cherry picked from commit 46a78fb)

adrianmlops · 2025-04-10T13:50:43Z

After that the container should behave like a regular container created by the NVIDIA container runtime. Of course/unfortunately, the process of mounting/unmounting requires GPU containers to run with privileges:

Is there any way to get this working without requiring privileged mode?

I’ve been experimenting with the envbuilder build process using NVIDIA GPU images on EKS, and so far, the only reliable way I’ve found to make the build succeed is by setting the container as privileged. Unfortunately, this introduces a significant issue: the container ends up seeing all GPUs on the host, even when only a single GPU is requested. This breaks GPU isolation and becomes a real problem in multi-user environments where proper resource separation is critical.

Here’s the security_context I’m currently using:

security_context {
  run_as_user = 0
  privileged  = true
}

Enabling privileged = true is bypassing Kubernetes’ standard GPU isolation via the device plugin, but I haven’t found a reliable alternative yet that allows the build to succeed.

Has anyone found a way to:

Avoid privileged mode while still building successfully with GPU images?
Or isolate GPU access even when privileged is required?

Any insights would be greatly appreciated!

fix(remount): relocate libraries along with their symlinks

773d748

johnstcn requested review from johnstcn and mtojek July 1, 2024 11:16

johnstcn assigned maxbrunet Jul 1, 2024

mtojek reviewed Jul 1, 2024

View reviewed changes

johnstcn reviewed Jul 1, 2024

View reviewed changes

internal/ebutil/libs.go Show resolved Hide resolved

johnstcn approved these changes Jul 1, 2024

View reviewed changes

Address review comments

07d0f1c

maxbrunet force-pushed the fix/remount/relocate-libs branch from 0997cbd to 07d0f1c Compare July 1, 2024 16:40

mtojek approved these changes Jul 1, 2024

View reviewed changes

johnstcn merged commit 46a78fb into coder:main Jul 2, 2024
4 checks passed

maxbrunet deleted the fix/remount/relocate-libs branch July 2, 2024 18:30

johnstcn pushed a commit that referenced this pull request Jul 5, 2024

fix(remount): relocate libraries along with their symlinks (#255)

fc11458

(cherry picked from commit 46a78fb)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(remount): relocate libraries along with their symlinks #255

fix(remount): relocate libraries along with their symlinks #255

maxbrunet commented Jun 28, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnstcn left a comment •

edited

Loading

Uh oh!

mtojek left a comment

Uh oh!

Uh oh!

adrianmlops commented Apr 10, 2025

Uh oh!

Uh oh!

fix(remount): relocate libraries along with their symlinks #255

fix(remount): relocate libraries along with their symlinks #255

Conversation

maxbrunet commented Jun 28, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnstcn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtojek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adrianmlops commented Apr 10, 2025

Uh oh!

Uh oh!

johnstcn left a comment •

edited

Loading