Enable GPUs on the Carbon Plan Azure hub #930

choldgraf · 2022-01-15T01:21:53Z

Description of problem and opportunity to address it

Joe with the Carbon Plan hub asked if we could enable GPUs on their infrastructure. Here are some guidelines that they provided:

Ideal timeline: Operational in the next 2 weeks
Constraints: 1 T4 gpu per user pod. The Standard_NC16as_T4_v3 looks like it would work well for us.

Implementation guide and constraints

Their profile list is here:

infrastructure/config/hubs/carbonplan.cluster.yaml

Lines 90 to 101 in 943cae3

    
           profileList: 
        
             # The mem-guarantees are here so k8s doesn't schedule other pods 
        
             # on these nodes. 
        
             - display_name: "Small: r5.large" 
        
               description: "~2 CPU, ~15G RAM" 
        
               kubespawner_override: 
        
                 # Expllicitly unset mem_limit, so it overrides the default memory limit we set in 
        
                 # basehub/values.yaml 
        
                 mem_limit: null 
        
                 mem_guarantee: 12G 
        
                 node_selector: 
        
                   node.kubernetes.io/instance-type: r5.large

Though this might require some change on the cluster itself to enable these machines first? I believe that @sgibson91 set this one up so maybe she could advise on what needs to happen.

Updates and ongoing work

sgibson91 · 2022-01-17T10:19:16Z

I have updated the top comment with next steps. I'm the only one with direct access to the cluster so I'll have to do the terraform part.

sgibson91 · 2022-01-17T10:43:02Z

After we changed some network setting to allow AzureFile to be used as a NFS server in #887, I was no longer able to edit the cluster due to the below error (also documented in #890)

$ tf plan -var-file=projects/carbonplan.tfvars -out=carbonplan -refresh-only

[snip]

Error: shares.Client#GetProperties: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationFailure" Message="This request is not authorized to perform this operation.\nRequestId:a7c9006d-d01a-004f-5e8d-0be45f000000\nTime:2022-01-17T10:34:16.3962556Z"

with azurerm_storage_share.homes,
on storage.tf line 21, in resource "azurerm_storage_share" "homes":
21: resource "azurerm_storage_share" "homes" {

@yuvipanda @GeorgianaElena did you ever fix this on the UToronto cluster?

I'm blocked moving forward on this issue until I regain access to the infrastructure via terraform

sgibson91 · 2022-01-17T11:25:32Z

Temporary fix and link to upstream work that will provide the actual fix here: #890 (comment)

sgibson91 · 2022-01-17T13:35:50Z

Added an extra TODO item here around installing GPU drivers. See #931 (comment)

choldgraf · 2022-01-18T18:26:42Z

I'm the only one with direct access to the cluster so I'll have to do the terraform part.

What are the steps we need to take to give others access?

sgibson91 · 2022-01-18T18:50:30Z

Probably ask @jhamman

jhamman · 2022-01-18T21:11:09Z

We can set others up with access. I just need names / email addresses.

choldgraf · 2022-01-18T23:50:52Z

@jhamman could you add everybody defined in the open infrastructure team here: https://team-compass.2i2c.org/en/latest/about/team.html#open-infrastructure-team ?

jhamman · 2022-01-19T16:20:34Z

@jhamman could you add everybody defined in the open infrastructure team

✅

choldgraf · 2022-02-02T16:21:29Z

Status update from team meeting

This is currently blocked - we are waiting to hear back from Azure on quota increases for certain GPU types. Once we hear back from them we can give a shot at resolving this again.

jhamman · 2022-02-02T17:57:49Z

Update from our side. It seems like Azure has approved a quota increase for us under the Standard NC Family. So we could try adding Standard_NC6 which comes with a NVIDIA Tesla K80.

Update 2: We now have quota for the following families:

Standard NC Family - 48 vCPUs
Standard NCSv3 Family - 64 vCPUs
Standard NC Promo Family - 48 vCPUs

yuvipanda · 2022-02-04T14:43:24Z

@jhamman looks like it's not enough quota still. I tried to manually scale the nodepool out in the Azure portal, and got this (and of course, the text wasn't copy-pasteable)

So I went to the URL there and tried to ask for new nodes, and got this after a few minutes of 'processing'.

Looking at the quota page, it looks like there aren't any quotas?

@jhamman can you confirm the quotas are present in the right region (western europe I think)?

sgibson91 · 2022-02-04T14:50:48Z

I also just changed the node machine type to Standard_NC24s_v3 which I thought we'd have access to from #930 (comment)

sgibson91 · 2022-02-04T14:55:23Z

Hmmm, the console still says Standard_NC16as_T4_v3 and I don't think we have quota for the AS family

sgibson91 · 2022-02-04T16:53:55Z

@sgibson91 I use nvidia-smi in the commandline to test it.

Thank you!

consideRatio · 2022-02-08T15:39:09Z

We could have a session about this if you want Sarah, I'd like to not pick it up alone though --- I'm lost in the azure / AKS overall.

I have a quite good grasp on how I think the k8s parts should be done, and have done it on amazon/google clouds before. Overall, my understanding and strategy to debug the situation can be summarized like this I think.

My understanding summarized

A k8s daemonset resource is supposed to schedule one pod on each node that matches the deamonset's scheduling critieria.
We need two separate deamonset's to prepare the GPU nodes, and both needs to function properly.

device plugin daemonset (i think it makes nodes not ready until the GPU is attached and made available to the pod)
driver installer daemonset (i think it tolerates running on non-ready nodes with a GPU attached, and makes it ready after it has installed a driver)
normal pods need to request nvidia.com/gpu resources (then they automatically gets a toleration for being on a GPU node as well)

Debugging strategy

Verify that by adding a node from a GPU node pool, the device plugin daemonset will get a pod started on that node, and that its logs running on the node seem to indicate thats its all good.
Verify that the driver installer daemonset schedules a pod on the GPU node after the device plugin deamonset pod has done its job, and that the driver installer pod reports a success.
1. If this fails, there may very well be an issue related to what driver version is installed and if the GPU supports that driver version etc. Typically, you want a modern version of a driver to increase compatibility with software like cudatoolkit, tensorflow, etc., but you must use an old enough driver to be compatible with the GPU. The daemonset installing the driver can be configured with what version of the driver to install I think.
Verify that a pod with a request for nvidia.com/gpu can schedule on the node, and when started, can run nvidia-smi successfully from a terminal.
The final part would be to consider autoscaling of a GPU node by letting the node pool with GPUs scale up dynamically. If autoscaling fails:
1. look into the pod's associated events via kubectl describe pod ... - did the cluster autoscaler report it wanted to scale up or not?
2. look into a cluster-autoscaler-status configmap in the kube-system namespace - does it want to scale up the relevant node pool?
3. look into the cloud provider web consoles logs - is the scale up event triggered by the cluster-autoscaler failing for some reason?

sgibson91 · 2022-02-08T15:46:10Z

I am now able to get a pod up and running on the GPU node, but it does not seem that the daemonset is working since the GPU is not available from within the server

(I used tensor flow because nvidia-smi was not installed in the pangeo ml-notebook image and I didn't want to waste time trying to figure out how to force a pod to be created on a specific node with a different image)

consideRatio · 2022-02-08T15:54:55Z

the daemonset

Overall I'm confused about having only one daemonset, as I think there may be a need for two separate deamonset - device plugin + driver installer. Perhaps the issue is that you only have a device plugin daemonset, but not a driver installer, or similar?

nvidia-smi is something I believe will be installed by the nvidia driver installer daemonset, which is also an indication that it isn't installed but should be.

sgibson91 · 2022-02-08T15:56:35Z

Overall I'm confused about having only one daemonset, as I think there may be a need for two separate deamonset - device plugin + driver installer. Perhaps the issue is that you only have a device plugin daemonset, but not a driver installer, or similar?

That is most likely correct - I've never done this before so I don't know.

These were the docs we were looking at, which only suggested one daemonset https://docs.microsoft.com/en-us/azure/aks/gpu-cluster#manually-install-the-nvidia-device-plugin

consideRatio · 2022-02-08T16:00:28Z

@sgibson91 yeah, I assume they just look at part of what's needed for full functionality for pods in a k8s cluster to use them with tools like tensorflow etc.

There are some links here about the driver installer part i think: #932 (comment)

consideRatio · 2022-02-08T16:02:59Z

In https://docs.microsoft.com/en-us/azure/aks/gpu-cluster#manually-install-the-nvidia-device-plugin they say...

Alternatively, you can deploy a DaemonSet for the NVIDIA device plugin. This DaemonSet runs a pod on each node to provide the required drivers for the GPUs.

I'm not sure what's done is actually to install the relevant driver... Maybe... Maybe not... If so, wouldn't it be reasonable to see an indication of what driver is to be installed somewhere at least?

Hmmm well they provide an end to end example... Well.. I guess it comes down to understanding the AKS device plugin - it may serve multiple purpose while amazon/google has opted to have two separate daemonset's (where the driver installed is standardized i believe) with dedicated purposes.

consideRatio · 2022-02-08T16:14:33Z

It seems that this may answer some questions about the drivers, they seem to be pre-installed? See https://github.com/Azure/aks-engine/blob/master/docs/topics/gpu.md#using-gpus-with-kubernetes

consideRatio · 2022-02-08T16:28:36Z

New summary of situation

While on Amazon/Google clouds we have used two separate daemonsets, AKS automatically installs a fixed nvidia-driver versions on nodes with GPU attached, and this version seem to be pinned here.

Due to that, I believe the issue in the image below posted earlier by @sgibson91 is related to software installation of tensorflow and its dependencies.

Software dependencies for tensorflow are very tricky. They depend on cudatoolkit, which depend on the nvidia-driver, etc...

Installing tensorflow and dependencies

This is a somewhat outdated example which may not be relevant as versions may not match etc, but could be helpful to adjust what's to be installed in the image.

        # TensorFlow
            # ref: https://www.tensorflow.org/install/gpu
            #
            # For dependency inspection:
            # conda search --info cudatoolkit
            # conda search --info cudnn
            # conda search --info tensorflow
            #
            # 1. Tensorflow
            #    - tensorflow=2.1.* is installed by pip
            #    - require CUDA Toolkit 10.1.*
            #    - require cuDNN 7.6.*
            # 2. CUDA Toolkit
            #    - cudatoolkit=10.1.* is installed by conda
            #    - 10.1 require NVIDIA Driver >= 418.39, and 10.2 require NVIDIA Driver >= 440.33
            #      ref: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility
            # 3. cuDNN
            #    - cudnn=7.6.* is installed by conda
            #    - require cudatoolkit >= 10.2, according to conda specs
            #
            cudatoolkit=11.5.* \
            cudnn=8.2.* \
            # tensorflow==2.7.* \
                # installed via pip because its weird how it behaves via conda

Making sure software binaries/libraries are findable

By setting LD_LIBRARY_PATH and PATH you can make sure that the drivers installed are accessible for use by cudatoolkit, which I believe is used by tensorflow, which is what I think may be causing the error.

You can inspect the folder /usr/local/nvidia to see indications if the drivers have installed and been exposed successfully to the user container. Then, I believe you should find nvidia-smi in the subfolder bin therein.

Here are some notes I made in a Dockerfile when setting things up to function on a GKE cluster.

# GPUs on GCP NVIDIA Drivers are installed into /usr/local/nvidia by the
# nvidia-driver-installer daemonset on nodes as they start up. The crux is that
# software needs to be aware about the drivers and their location, and this can
# be tricky to configure before we have installed them within this Dockerfile.
# ref: https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
#
# `ldconfig` the CLI can inspect paths from /etc/ld.so.conf (and
# /etc/ld.so.conf.d/*) and summarize its findings in /etc/ld.so.cache, but since
# these folders are empty at the moment that would need to run after we
# installed the files.
# ref: https://linux.die.net/man/8/ldconfig
#
# LD_LIBRARY_PATH is an environment variable can be set to the planned location
# of the drivers though, but we need to be careful to not loose the environment
# variable in a switch from a root user to a jovyan user!
# ref: https://github.com/jupyter/docker-stacks/blob/dc9744740e128ad7ca7c235d5f54791883c2ea69/base-notebook/start.sh#L102
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}
ENV PATH=/usr/local/nvidia/bin:${PATH}

WARNING: LD_ prefixed environment variables will have trouble being retained if you use sudo or switch from a root user to a normal user etc. But, assuming we start our container directly as something, and not switch users, this won't be an issue... For reference, this is a DEEP rabbit hole and I explored it further here: jupyter/docker-stacks#1052

sgibson91 · 2022-02-08T16:32:17Z

Joe and his team manage their own image and said to use the pangeo ml notebook one until they had developed their own GPU-enabled one

sgibson91 · 2022-02-08T16:33:24Z

And if I understand this correctly, then cudatoolkit should be installed https://github.com/pangeo-data/pangeo-docker-images/blob/66015a25e0029dc95d6367349d91b74e844a9bc0/ml-notebook/environment.yml#L7

sgibson91 · 2022-02-08T16:36:22Z

I ran kubectl describe node against a GPU node and we should have 4 GPUs

Capacity:
  attachable-volumes-azure-disk:  32
  cpu:                            24
  ephemeral-storage:              203070420Ki
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         462318116Ki
  nvidia.com/gpu:                 4
  pods:                           110

consideRatio · 2022-02-08T16:39:56Z

pangeo/ml-notebook seem to have...

I would guess, that if this image is used as a base image, the issue is environment variables. This may be an addition to the image to solve the issues.

ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}
ENV PATH=/usr/local/nvidia/bin:${PATH}

sgibson91 · 2022-02-09T12:34:56Z

What I did

Built and pushed a new Docker image whose Dockerfile looks like this

FROM pangeo/ml-notebook:master

ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}
ENV PATH=/usr/local/nvidia/bin:${PATH}

Updated the profile's Kubespawner override to pull the new image
Reran the previous code block yielding the same result

consideRatio · 2022-02-09T13:18:50Z

Was something within /usr/local/nvidia?

It seems there wasnt driver related files exposed at /usr/local/nvidia/lib64 at least - as no such directory seem to be around - hmmm...

sgibson91 · 2022-02-09T13:35:56Z

Yeah, the nvidia folder isn't there 😕

consideRatio · 2022-02-09T14:59:09Z

Okay hmm...

I think we are forced to learn more details.

What makes the /usr/local/nvidia folder be created/mounted at all in GKE/EKS?
I think this is what's failing, and this is whats crucial to solve. We need our k8s container to have access to the drivers, not just have the drivers be installed on the nodes automatically. Perhaps we need to provide a readonly hostPath volume to our containers, mounting it from the k8s node at /usr/local/nvidia to the same path in the container.
- Some insights could be gained by trialing the examples AKS docs provide. Do we see the same issue? Do they make use of the GPU? This may be tricky to figure out, I couldn't confidently find the source for the Dockerfile that creates the image being used in the example (see Where is the sample Dockerfile kept MicrosoftDocs/azure-docs#40650).
What does the files in /usr/local/nvidia represent? Is it the nvidia drivers and various related libraries that may be relevant for use by tensorflow, cudatoolkit, or similar?
It seems to be the nvidia-drivers that has been installed are found there.
libcuda.so.1 is a file we need to make findable, as can be declared via LD_LIBRARY_PATH.
- Is it something we would get by installing an nvidia driver, or something we get by installing for cudatoolkit etc?
  It seems associated with the nvidia-driver based on googling.
- Is it expected for this to be found in /usr/local/nvidia/lib64 or some other location?
  It depends on the driver installation I think. What driver installation process is used on the AKS nodes, and, would that be accessible in a container thanks to the device plugin or not? Looking at this code, if this is relevant to look at I'm not sure, but it indicates that it would be exposed at /usr/local/nvidia.
- Is it findable in the container currently, as can be tested with find / -iname "libcuda.so.1" -print 2>/dev/null?

sgibson91 · 2022-02-10T11:30:10Z

3. [ ] Is it findable in the container currently, as can be tested with find / -iname "libcuda.so.1" -print 2>/dev/null?

This command didn't turn up anything 😕

(notebook) jovyan@jupyter-sgibson91:~$ find / -iname "libcuda.so.1" -print 2>/dev/null
(notebook) jovyan@jupyter-sgibson91:~$

sgibson91 · 2022-02-11T09:36:44Z

This commit resolved the issue - we weren't requesting any GPU resources from the pod. @yuvipanda got nvidia-smi working here

sgibson91 · 2022-02-11T10:15:48Z

@jhamman after a lot of back and forth and confusion (mostly from me 😅 ) this is now deployed! 🎉 🚀 The GPU nodes are still running the pangeo/ml-notebook image and I'm also not sure if this snippet limits us to 1 GPU when there are four available. If you'd like either of those things changing, just let us know (I am on holiday next week but anyone on the engineering team should be able to make those changes).

consideRatio · 2022-02-11T10:58:48Z

Wieeee nice work @sgibson91 and @yuvipanda!!!

I'm also not sure if this snippet limits us to 1 GPU when there are four available

Pods won't be able to share GPU resources like they can share CPU resources etc, so, you need to specify exactly how many GPUs you want for the pod. If you expect multiple users want to start and have access to one GPU, then it could make sense to have a machine with many GPUs and each user gets one GPU. But, if you have only the occational GPU user, that user should probably have all the GPUs of the machines being provisioned.

sgibson91 · 2022-02-11T12:02:40Z

Pods won't be able to share GPU resources like they can share CPU resources etc, so, you need to specify exactly how many GPUs you want for the pod.

Good to know, thank you! It was requested one user per pod, so I think it's a safe assumption to make all GPUs available to the pod? We can always wait for Joe's confirmation :)

sgibson91 · 2022-03-02T09:52:52Z

I am closing this as we haven't had an update and work has been completed to the best of the spec provided.

choldgraf added the support label Jan 15, 2022

choldgraf added this to DEPRECATED Engineering and Product Backlog Jan 15, 2022

choldgraf changed the title ~~Enable GPUs on the Carbon Plan hub~~ Enable GPUs on the Carbon Plan Azure hub Jan 15, 2022

sgibson91 self-assigned this Jan 17, 2022

sgibson91 added the blocked label Jan 17, 2022

sgibson91 moved this to Blocked in DEPRECATED Engineering and Product Backlog Jan 17, 2022

sgibson91 moved this from Blocked to Ready to work in DEPRECATED Engineering and Product Backlog Jan 17, 2022

sgibson91 removed the blocked label Jan 17, 2022

sgibson91 mentioned this issue Jan 17, 2022

Add GPU nodepools to CarbonPlan's Azure cluster #931

Merged

sgibson91 moved this from Ready to work to In progress in DEPRECATED Engineering and Product Backlog Jan 17, 2022

sgibson91 added this to Sprint Board Jan 17, 2022

sgibson91 removed the status in DEPRECATED Engineering and Product Backlog Jan 17, 2022

sgibson91 moved this to In Progress ⚡ in Sprint Board Jan 17, 2022

sgibson91 moved this to In progress in DEPRECATED Engineering and Product Backlog Jan 17, 2022

sgibson91 mentioned this issue Jan 17, 2022

Add a profile for GPUs to CarbonPlan's Azure hub #933

Merged

damianavila moved this from In progress to Blocked in DEPRECATED Engineering and Product Backlog Feb 2, 2022

sgibson91 mentioned this issue Feb 7, 2022

Update machine type for GPU node pools #982

Merged

sgibson91 mentioned this issue Feb 10, 2022

Plan for next week with many team members away 2i2c-org/team-compass#370

Closed

sgibson91 mentioned this issue Feb 11, 2022

Document setting up GPUs on our clusters and access for our hubs #996

Closed

3 tasks

sgibson91 moved this from Blocked to In progress in DEPRECATED Engineering and Product Backlog Feb 24, 2022

sgibson91 closed this as completed Mar 2, 2022

Repository owner moved this from In Progress ⚡ to Done 🎉 in Sprint Board Mar 2, 2022

choldgraf moved this from In progress to Complete in DEPRECATED Engineering and Product Backlog Mar 11, 2022

Enable GPUs on the Carbon Plan Azure hub #930

Enable GPUs on the Carbon Plan Azure hub #930

Comments

choldgraf commented Jan 15, 2022 • edited by sgibson91 Loading

Description of problem and opportunity to address it

Implementation guide and constraints

Updates and ongoing work

sgibson91 commented Jan 17, 2022

sgibson91 commented Jan 17, 2022

sgibson91 commented Jan 17, 2022 • edited Loading

sgibson91 commented Jan 17, 2022

choldgraf commented Jan 18, 2022

sgibson91 commented Jan 18, 2022 • edited Loading

jhamman commented Jan 18, 2022

choldgraf commented Jan 18, 2022 • edited Loading

jhamman commented Jan 19, 2022

choldgraf commented Feb 2, 2022

Status update from team meeting

jhamman commented Feb 2, 2022 • edited Loading

yuvipanda commented Feb 4, 2022

sgibson91 commented Feb 4, 2022

sgibson91 commented Feb 4, 2022

sgibson91 commented Feb 4, 2022

consideRatio commented Feb 8, 2022

My understanding summarized

Debugging strategy

sgibson91 commented Feb 8, 2022

consideRatio commented Feb 8, 2022

sgibson91 commented Feb 8, 2022 • edited Loading

consideRatio commented Feb 8, 2022

consideRatio commented Feb 8, 2022 • edited Loading

consideRatio commented Feb 8, 2022

consideRatio commented Feb 8, 2022 • edited Loading

New summary of situation

Installing tensorflow and dependencies

Making sure software binaries/libraries are findable

sgibson91 commented Feb 8, 2022

sgibson91 commented Feb 8, 2022

sgibson91 commented Feb 8, 2022

consideRatio commented Feb 8, 2022

sgibson91 commented Feb 9, 2022

What I did

consideRatio commented Feb 9, 2022 • edited Loading

sgibson91 commented Feb 9, 2022

consideRatio commented Feb 9, 2022

sgibson91 commented Feb 10, 2022

sgibson91 commented Feb 11, 2022

sgibson91 commented Feb 11, 2022

consideRatio commented Feb 11, 2022

sgibson91 commented Feb 11, 2022

sgibson91 commented Mar 2, 2022

choldgraf commented Jan 15, 2022 •

edited by sgibson91

Loading

sgibson91 commented Jan 17, 2022 •

edited

Loading

sgibson91 commented Jan 18, 2022 •

edited

Loading

choldgraf commented Jan 18, 2022 •

edited

Loading

jhamman commented Feb 2, 2022 •

edited

Loading

sgibson91 commented Feb 8, 2022 •

edited

Loading

consideRatio commented Feb 8, 2022 •

edited

Loading

consideRatio commented Feb 8, 2022 •

edited

Loading

consideRatio commented Feb 9, 2022 •

edited

Loading