-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable GPUs on the Carbon Plan Azure hub #930
Comments
I have updated the top comment with next steps. I'm the only one with direct access to the cluster so I'll have to do the terraform part. |
After we changed some network setting to allow AzureFile to be used as a NFS server in #887, I was no longer able to edit the cluster due to the below error (also documented in #890)
@yuvipanda @GeorgianaElena did you ever fix this on the UToronto cluster? I'm blocked moving forward on this issue until I regain access to the infrastructure via terraform |
Temporary fix and link to upstream work that will provide the actual fix here: #890 (comment) |
Added an extra TODO item here around installing GPU drivers. See #931 (comment) |
What are the steps we need to take to give others access? |
Probably ask @jhamman |
We can set others up with access. I just need names / email addresses. |
@jhamman could you add everybody defined in the open infrastructure team here: https://team-compass.2i2c.org/en/latest/about/team.html#open-infrastructure-team ? |
✅ |
Status update from team meetingThis is currently blocked - we are waiting to hear back from Azure on quota increases for certain GPU types. Once we hear back from them we can give a shot at resolving this again. |
Update from our side. It seems like Azure has approved a quota increase for us under the Update 2: We now have quota for the following families: Standard NC Family - 48 vCPUs |
@jhamman looks like it's not enough quota still. I tried to manually scale the nodepool out in the Azure portal, and got this (and of course, the text wasn't copy-pasteable) So I went to the URL there and tried to ask for new nodes, and got this after a few minutes of 'processing'. Looking at the quota page, it looks like there aren't any quotas? @jhamman can you confirm the quotas are present in the right region (western europe I think)? |
I also just changed the node machine type to |
Hmmm, the console still says |
Thank you! |
We could have a session about this if you want Sarah, I'd like to not pick it up alone though --- I'm lost in the azure / AKS overall. I have a quite good grasp on how I think the k8s parts should be done, and have done it on amazon/google clouds before. Overall, my understanding and strategy to debug the situation can be summarized like this I think. My understanding summarized
Debugging strategy
|
I am now able to get a pod up and running on the GPU node, but it does not seem that the daemonset is working since the GPU is not available from within the server (I used tensor flow because |
Overall I'm confused about having only one daemonset, as I think there may be a need for two separate deamonset - device plugin + driver installer. Perhaps the issue is that you only have a device plugin daemonset, but not a driver installer, or similar?
|
That is most likely correct - I've never done this before so I don't know. These were the docs we were looking at, which only suggested one daemonset https://docs.microsoft.com/en-us/azure/aks/gpu-cluster#manually-install-the-nvidia-device-plugin |
@sgibson91 yeah, I assume they just look at part of what's needed for full functionality for pods in a k8s cluster to use them with tools like tensorflow etc. There are some links here about the driver installer part i think: #932 (comment) |
In https://docs.microsoft.com/en-us/azure/aks/gpu-cluster#manually-install-the-nvidia-device-plugin they say...
I'm not sure what's done is actually to install the relevant driver... Maybe... Maybe not... If so, wouldn't it be reasonable to see an indication of what driver is to be installed somewhere at least? Hmmm well they provide an end to end example... Well.. I guess it comes down to understanding the AKS device plugin - it may serve multiple purpose while amazon/google has opted to have two separate daemonset's (where the driver installed is standardized i believe) with dedicated purposes. |
It seems that this may answer some questions about the drivers, they seem to be pre-installed? See https://github.com/Azure/aks-engine/blob/master/docs/topics/gpu.md#using-gpus-with-kubernetes |
New summary of situationWhile on Amazon/Google clouds we have used two separate daemonsets, AKS automatically installs a fixed nvidia-driver versions on nodes with GPU attached, and this version seem to be pinned here. Due to that, I believe the issue in the image below posted earlier by @sgibson91 is related to software installation of tensorflow and its dependencies. Software dependencies for tensorflow are very tricky. They depend on cudatoolkit, which depend on the nvidia-driver, etc... Installing tensorflow and dependenciesThis is a somewhat outdated example which may not be relevant as versions may not match etc, but could be helpful to adjust what's to be installed in the image. # TensorFlow
# ref: https://www.tensorflow.org/install/gpu
#
# For dependency inspection:
# conda search --info cudatoolkit
# conda search --info cudnn
# conda search --info tensorflow
#
# 1. Tensorflow
# - tensorflow=2.1.* is installed by pip
# - require CUDA Toolkit 10.1.*
# - require cuDNN 7.6.*
# 2. CUDA Toolkit
# - cudatoolkit=10.1.* is installed by conda
# - 10.1 require NVIDIA Driver >= 418.39, and 10.2 require NVIDIA Driver >= 440.33
# ref: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility
# 3. cuDNN
# - cudnn=7.6.* is installed by conda
# - require cudatoolkit >= 10.2, according to conda specs
#
cudatoolkit=11.5.* \
cudnn=8.2.* \
# tensorflow==2.7.* \
# installed via pip because its weird how it behaves via conda Making sure software binaries/libraries are findableBy setting You can inspect the folder Here are some notes I made in a Dockerfile when setting things up to function on a GKE cluster. # GPUs on GCP NVIDIA Drivers are installed into /usr/local/nvidia by the
# nvidia-driver-installer daemonset on nodes as they start up. The crux is that
# software needs to be aware about the drivers and their location, and this can
# be tricky to configure before we have installed them within this Dockerfile.
# ref: https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
#
# `ldconfig` the CLI can inspect paths from /etc/ld.so.conf (and
# /etc/ld.so.conf.d/*) and summarize its findings in /etc/ld.so.cache, but since
# these folders are empty at the moment that would need to run after we
# installed the files.
# ref: https://linux.die.net/man/8/ldconfig
#
# LD_LIBRARY_PATH is an environment variable can be set to the planned location
# of the drivers though, but we need to be careful to not loose the environment
# variable in a switch from a root user to a jovyan user!
# ref: https://github.com/jupyter/docker-stacks/blob/dc9744740e128ad7ca7c235d5f54791883c2ea69/base-notebook/start.sh#L102
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}
ENV PATH=/usr/local/nvidia/bin:${PATH} WARNING: |
Joe and his team manage their own image and said to use the pangeo ml notebook one until they had developed their own GPU-enabled one |
And if I understand this correctly, then cudatoolkit should be installed https://github.com/pangeo-data/pangeo-docker-images/blob/66015a25e0029dc95d6367349d91b74e844a9bc0/ml-notebook/environment.yml#L7 |
I ran
|
pangeo/ml-notebook seem to have... I would guess, that if this image is used as a base image, the issue is environment variables. This may be an addition to the image to solve the issues.
|
What I did
|
Was something within /usr/local/nvidia? It seems there wasnt driver related files exposed at /usr/local/nvidia/lib64 at least - as no such directory seem to be around - hmmm... |
Okay hmm... I think we are forced to learn more details.
|
This command didn't turn up anything 😕
|
This commit resolved the issue - we weren't requesting any GPU resources from the pod. @yuvipanda got |
@jhamman after a lot of back and forth and confusion (mostly from me 😅 ) this is now deployed! 🎉 🚀 The GPU nodes are still running the pangeo/ml-notebook image and I'm also not sure if this snippet limits us to 1 GPU when there are four available. If you'd like either of those things changing, just let us know (I am on holiday next week but anyone on the engineering team should be able to make those changes). |
Wieeee nice work @sgibson91 and @yuvipanda!!!
Pods won't be able to share GPU resources like they can share CPU resources etc, so, you need to specify exactly how many GPUs you want for the pod. If you expect multiple users want to start and have access to one GPU, then it could make sense to have a machine with many GPUs and each user gets one GPU. But, if you have only the occational GPU user, that user should probably have all the GPUs of the machines being provisioned. |
Good to know, thank you! It was requested one user per pod, so I think it's a safe assumption to make all GPUs available to the pod? We can always wait for Joe's confirmation :) |
I am closing this as we haven't had an update and work has been completed to the best of the spec provided. |
Description of problem and opportunity to address it
Joe with the Carbon Plan hub asked if we could enable GPUs on their infrastructure. Here are some guidelines that they provided:
Implementation guide and constraints
Their profile list is here:
infrastructure/config/hubs/carbonplan.cluster.yaml
Lines 90 to 101 in 943cae3
Though this might require some change on the cluster itself to enable these machines first? I believe that @sgibson91 set this one up so maybe she could advise on what needs to happen.
Updates and ongoing work
The text was updated successfully, but these errors were encountered: