Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda-dev #158

Closed
wants to merge 8 commits into from
Closed

cuda-dev #158

wants to merge 8 commits into from

Conversation

woensug-choi
Copy link
Contributor

The rocker cuda PR does not include executables of the Cuda libraries. It's because PR only copied apt-get installation of the Nvidia cudagl's base dockerfile. For our purposes, all those executables are required to compile the .cu source code of the multibeam sonar plugin.

I believe that --cuda flag without Cuda executables could have usability that does not require compile of .cu source files. I've made a new --cuda-dev flag that includes all installations required.

I've tested the functionality with the Dave project's Multibeam Sonar Plugin.
https://github.com/Field-Robotics-Lab/DAVE

@woensug-choi
Copy link
Contributor Author

@tfoote I hope to see this merged or receive comments if you think its not ready :)

Copy link
Collaborator

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks like a good direction.

For this to go in it definitely needs some unit tests. And I'd like to have a better story about how the different layers of base, runtime and development can be used together. Aka get this working with #126 The lack of tests and validation of cooperation is why I haven't pushed #126 further forward.

And also it should not collide/conflict with the generic --nvidia option.


# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These options here need to mutate/extend the driver capabilitiy not just overwrite it. aka it could be used for graphics too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to

ENV NVIDIA_VISIBLE_DEVICES ${NVIDIA_VISIBLE_DEVICES:-all}
ENV NVIDIA_DRIVER_CAPABILITIES ${NVIDIA_DRIVER_CAPABILITIES:+$NVIDIA_DRIVER_CAPABILITIES,}compute,utility


ENV CUDA_VERSION 11.2.1

ENV NV_CUDA_LIB_VERSION "11.2.1-1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having all of these versions embedded seems very fragile. Is there a way to do this more generically? What's the upgrade path, how can we let people adjust the version looking forward to future releases and other platforms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version gets updated constantly. It's now at 11.6.0 (https://gitlab.com/nvidia/container-images/cuda/blob/master/dist/11.6.0/ubuntu2004/base/Dockerfile)

How could we exploit https://gitlab.com/nvidia/container-images/cuda/-/blob/master/manifests/cuda.yaml ? If we could load this during the build and have user to type wanted Cuda version (e.g. 11.6.0), it could be much more general.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mention the latest gets updated frequently. How can we use the latest version without hard coding all these many versions in our codebase? if they're changing the version constantly we have to check it works and update our codebase too? How do we get notices and know what to update when? Will people's usage break w/o updates? Will we break users if we update too much? How do our users know/declare what versions they get/want?

@woensug-choi
Copy link
Contributor Author

What kinds of unit tests do we need? I've been using this branch with a gazebo plugin that requires Cuda compilation.

What do you mean by layer story? Both devel and runtime image based on the base.

@tfoote
Copy link
Collaborator

tfoote commented Mar 8, 2022

For tests, we need tests that provide coverage of the code paths. They need to run examples that exercise the code paths and exercise the cuda capabilities so that we know that it's still working when we deploy it. We have 84% code coverage on the non-nvidia stuff. If you run it locally with the nvidia tests it's higher. But you content has zero unit tests.

For the story we need to be able to explain to the users. If you want this features do this. If you want that feature do that. If you want this and that t do they work together? In particular since this is taking the capabilities of #126 and going further, it should probably be done as a series of layers that will first enable the base, and then install the specific tools, so that you have core nvidia, then extend it for the runtime, and then extend that for dev tools. Instead of having each plugin potentially installing different separate versions of cuda tools.

@tfoote tfoote added the nvidia Connected to the nvidia extension label Mar 9, 2022
@woensug-choi
Copy link
Contributor Author

Hmm... Very much unfamiliar with the code coverage concept. Let me get back to you when I understand what it is about :D

@tfoote
Copy link
Collaborator

tfoote commented Mar 17, 2022

There's an overview of how to run the tests and you'll see reports here: https://github.com/osrf/rocker#testing

@woensug-choi
Copy link
Contributor Author

RUN apt-get update && apt-get install -y --no-install-recommends \
cuda nvidia-cuda-dev \
&& rm -rf /var/lib/apt/lists/*

@tfoote I've made a change to use the package manager. It now avoids specifying versions for dependent packages. I've tested in WSL Ubuntu 20.04. It should also work for the generic Ubuntu installation.

Copy link
Collaborator

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent a few hours to converge #182 and this into the branch for #126

And following the installation instructions from https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network as closely as possible.

There seems to be an issue with installing cuda and nvidia-cuda-dev simultaneously. But I've run out of time to iterate on that right now so I've pushed it into the draft.

@@ -0,0 +1,23 @@
RUN apt-get update && apt-get install -y --no-install-recommends \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also found that nvidia-cuda-dev is available all the way back to bionic https://packages.ubuntu.com/bionic/nvidia-cuda-dev so we shouldn't by default necessarily use the nvidia ones but leverage the system ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried with and without nvidia-cuda-dev installation block. Both worked for me. No need to include nvidia-cuda-dev for pre-caching purpose.

@tfoote
Copy link
Collaborator

tfoote commented Jul 13, 2022

@woensug-choi I think that I've merged all of this into #126 could you verify that it meets your needs now?

Note: I changed the argument to --cuda

@woensug-choi
Copy link
Contributor Author

@tfoote I confirm that everything I need is merged into #126. Closing this PR.

@tfoote
Copy link
Collaborator

tfoote commented Jul 14, 2022

Great thanks for verifying.

@woensug-choi woensug-choi deleted the cuda-dev branch November 30, 2022 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nvidia Connected to the nvidia extension
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants