Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable GPUs on the Carbon Plan Azure hub #930

Closed
3 tasks done
choldgraf opened this issue Jan 15, 2022 · 41 comments
Closed
3 tasks done

Enable GPUs on the Carbon Plan Azure hub #930

choldgraf opened this issue Jan 15, 2022 · 41 comments
Assignees

Comments

@choldgraf
Copy link
Member

choldgraf commented Jan 15, 2022

Description of problem and opportunity to address it

Joe with the Carbon Plan hub asked if we could enable GPUs on their infrastructure. Here are some guidelines that they provided:

  • Ideal timeline: Operational in the next 2 weeks
  • Constraints: 1 T4 gpu per user pod. The Standard_NC16as_T4_v3 looks like it would work well for us.

Implementation guide and constraints

Their profile list is here:

profileList:
# The mem-guarantees are here so k8s doesn't schedule other pods
# on these nodes.
- display_name: "Small: r5.large"
description: "~2 CPU, ~15G RAM"
kubespawner_override:
# Expllicitly unset mem_limit, so it overrides the default memory limit we set in
# basehub/values.yaml
mem_limit: null
mem_guarantee: 12G
node_selector:
node.kubernetes.io/instance-type: r5.large

Though this might require some change on the cluster itself to enable these machines first? I believe that @sgibson91 set this one up so maybe she could advise on what needs to happen.

Updates and ongoing work

@choldgraf choldgraf changed the title Enable GPUs on the Carbon Plan hub Enable GPUs on the Carbon Plan Azure hub Jan 15, 2022
@sgibson91
Copy link
Member

I have updated the top comment with next steps. I'm the only one with direct access to the cluster so I'll have to do the terraform part.

@sgibson91
Copy link
Member

After we changed some network setting to allow AzureFile to be used as a NFS server in #887, I was no longer able to edit the cluster due to the below error (also documented in #890)

$ tf plan -var-file=projects/carbonplan.tfvars -out=carbonplan -refresh-only

[snip]

Error: shares.Client#GetProperties: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationFailure" Message="This request is not authorized to perform this operation.\nRequestId:a7c9006d-d01a-004f-5e8d-0be45f000000\nTime:2022-01-17T10:34:16.3962556Z"

with azurerm_storage_share.homes,
on storage.tf line 21, in resource "azurerm_storage_share" "homes":
21: resource "azurerm_storage_share" "homes" {

@yuvipanda @GeorgianaElena did you ever fix this on the UToronto cluster?

I'm blocked moving forward on this issue until I regain access to the infrastructure via terraform

@sgibson91
Copy link
Member

sgibson91 commented Jan 17, 2022

Temporary fix and link to upstream work that will provide the actual fix here: #890 (comment)

@sgibson91
Copy link
Member

Added an extra TODO item here around installing GPU drivers. See #931 (comment)

@choldgraf
Copy link
Member Author

I'm the only one with direct access to the cluster so I'll have to do the terraform part.

What are the steps we need to take to give others access?

@sgibson91
Copy link
Member

sgibson91 commented Jan 18, 2022

Probably ask @jhamman

@jhamman
Copy link

jhamman commented Jan 18, 2022

We can set others up with access. I just need names / email addresses.

@choldgraf
Copy link
Member Author

choldgraf commented Jan 18, 2022

@jhamman could you add everybody defined in the open infrastructure team here: https://team-compass.2i2c.org/en/latest/about/team.html#open-infrastructure-team ?

@jhamman
Copy link

jhamman commented Jan 19, 2022

@jhamman could you add everybody defined in the open infrastructure team

@damianavila damianavila moved this from In progress to Blocked in DEPRECATED Engineering and Product Backlog Feb 2, 2022
@choldgraf
Copy link
Member Author

Status update from team meeting

This is currently blocked - we are waiting to hear back from Azure on quota increases for certain GPU types. Once we hear back from them we can give a shot at resolving this again.

@jhamman
Copy link

jhamman commented Feb 2, 2022

Update from our side. It seems like Azure has approved a quota increase for us under the Standard NC Family. So we could try adding Standard_NC6 which comes with a NVIDIA Tesla K80.

Update 2: We now have quota for the following families:

Standard NC Family - 48 vCPUs
Standard NCSv3 Family - 64 vCPUs
Standard NC Promo Family - 48 vCPUs

@yuvipanda
Copy link
Member

@jhamman looks like it's not enough quota still. I tried to manually scale the nodepool out in the Azure portal, and got this (and of course, the text wasn't copy-pasteable)

image

So I went to the URL there and tried to ask for new nodes, and got this after a few minutes of 'processing'.

image

Looking at the quota page, it looks like there aren't any quotas?

image

@jhamman can you confirm the quotas are present in the right region (western europe I think)?

@sgibson91
Copy link
Member

I also just changed the node machine type to Standard_NC24s_v3 which I thought we'd have access to from #930 (comment)

@sgibson91
Copy link
Member

Hmmm, the console still says Standard_NC16as_T4_v3 and I don't think we have quota for the AS family

@sgibson91
Copy link
Member

@sgibson91 I use nvidia-smi in the commandline to test it.

Thank you!

@consideRatio
Copy link
Contributor

We could have a session about this if you want Sarah, I'd like to not pick it up alone though --- I'm lost in the azure / AKS overall.

I have a quite good grasp on how I think the k8s parts should be done, and have done it on amazon/google clouds before. Overall, my understanding and strategy to debug the situation can be summarized like this I think.

My understanding summarized

  • A k8s daemonset resource is supposed to schedule one pod on each node that matches the deamonset's scheduling critieria.
  • We need two separate deamonset's to prepare the GPU nodes, and both needs to function properly.
  1. device plugin daemonset (i think it makes nodes not ready until the GPU is attached and made available to the pod)
  2. driver installer daemonset (i think it tolerates running on non-ready nodes with a GPU attached, and makes it ready after it has installed a driver)
  3. normal pods need to request nvidia.com/gpu resources (then they automatically gets a toleration for being on a GPU node as well)

Debugging strategy

  1. Verify that by adding a node from a GPU node pool, the device plugin daemonset will get a pod started on that node, and that its logs running on the node seem to indicate thats its all good.
  2. Verify that the driver installer daemonset schedules a pod on the GPU node after the device plugin deamonset pod has done its job, and that the driver installer pod reports a success.
    1. If this fails, there may very well be an issue related to what driver version is installed and if the GPU supports that driver version etc. Typically, you want a modern version of a driver to increase compatibility with software like cudatoolkit, tensorflow, etc., but you must use an old enough driver to be compatible with the GPU. The daemonset installing the driver can be configured with what version of the driver to install I think.
  3. Verify that a pod with a request for nvidia.com/gpu can schedule on the node, and when started, can run nvidia-smi successfully from a terminal.
  4. The final part would be to consider autoscaling of a GPU node by letting the node pool with GPUs scale up dynamically. If autoscaling fails:
    1. look into the pod's associated events via kubectl describe pod ... - did the cluster autoscaler report it wanted to scale up or not?
    2. look into a cluster-autoscaler-status configmap in the kube-system namespace - does it want to scale up the relevant node pool?
    3. look into the cloud provider web consoles logs - is the scale up event triggered by the cluster-autoscaler failing for some reason?

@sgibson91
Copy link
Member

I am now able to get a pod up and running on the GPU node, but it does not seem that the daemonset is working since the GPU is not available from within the server

Screenshot 2022-02-07 at 15 56 11

(I used tensor flow because nvidia-smi was not installed in the pangeo ml-notebook image and I didn't want to waste time trying to figure out how to force a pod to be created on a specific node with a different image)

@consideRatio
Copy link
Contributor

the daemonset

Overall I'm confused about having only one daemonset, as I think there may be a need for two separate deamonset - device plugin + driver installer. Perhaps the issue is that you only have a device plugin daemonset, but not a driver installer, or similar?

nvidia-smi is something I believe will be installed by the nvidia driver installer daemonset, which is also an indication that it isn't installed but should be.

@sgibson91
Copy link
Member

sgibson91 commented Feb 8, 2022

Overall I'm confused about having only one daemonset, as I think there may be a need for two separate deamonset - device plugin + driver installer. Perhaps the issue is that you only have a device plugin daemonset, but not a driver installer, or similar?

That is most likely correct - I've never done this before so I don't know.

These were the docs we were looking at, which only suggested one daemonset https://docs.microsoft.com/en-us/azure/aks/gpu-cluster#manually-install-the-nvidia-device-plugin

@consideRatio
Copy link
Contributor

@sgibson91 yeah, I assume they just look at part of what's needed for full functionality for pods in a k8s cluster to use them with tools like tensorflow etc.

There are some links here about the driver installer part i think: #932 (comment)

@consideRatio
Copy link
Contributor

consideRatio commented Feb 8, 2022

In https://docs.microsoft.com/en-us/azure/aks/gpu-cluster#manually-install-the-nvidia-device-plugin they say...

Alternatively, you can deploy a DaemonSet for the NVIDIA device plugin. This DaemonSet runs a pod on each node to provide the required drivers for the GPUs.

I'm not sure what's done is actually to install the relevant driver... Maybe... Maybe not... If so, wouldn't it be reasonable to see an indication of what driver is to be installed somewhere at least?


Hmmm well they provide an end to end example... Well.. I guess it comes down to understanding the AKS device plugin - it may serve multiple purpose while amazon/google has opted to have two separate daemonset's (where the driver installed is standardized i believe) with dedicated purposes.

@consideRatio
Copy link
Contributor

It seems that this may answer some questions about the drivers, they seem to be pre-installed? See https://github.com/Azure/aks-engine/blob/master/docs/topics/gpu.md#using-gpus-with-kubernetes

@consideRatio
Copy link
Contributor

consideRatio commented Feb 8, 2022

New summary of situation

While on Amazon/Google clouds we have used two separate daemonsets, AKS automatically installs a fixed nvidia-driver versions on nodes with GPU attached, and this version seem to be pinned here.

Due to that, I believe the issue in the image below posted earlier by @sgibson91 is related to software installation of tensorflow and its dependencies.

image

Software dependencies for tensorflow are very tricky. They depend on cudatoolkit, which depend on the nvidia-driver, etc...

Installing tensorflow and dependencies

This is a somewhat outdated example which may not be relevant as versions may not match etc, but could be helpful to adjust what's to be installed in the image.

        # TensorFlow
            # ref: https://www.tensorflow.org/install/gpu
            #
            # For dependency inspection:
            # conda search --info cudatoolkit
            # conda search --info cudnn
            # conda search --info tensorflow
            #
            # 1. Tensorflow
            #    - tensorflow=2.1.* is installed by pip
            #    - require CUDA Toolkit 10.1.*
            #    - require cuDNN 7.6.*
            # 2. CUDA Toolkit
            #    - cudatoolkit=10.1.* is installed by conda
            #    - 10.1 require NVIDIA Driver >= 418.39, and 10.2 require NVIDIA Driver >= 440.33
            #      ref: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility
            # 3. cuDNN
            #    - cudnn=7.6.* is installed by conda
            #    - require cudatoolkit >= 10.2, according to conda specs
            #
            cudatoolkit=11.5.* \
            cudnn=8.2.* \
            # tensorflow==2.7.* \
                # installed via pip because its weird how it behaves via conda

Making sure software binaries/libraries are findable

By setting LD_LIBRARY_PATH and PATH you can make sure that the drivers installed are accessible for use by cudatoolkit, which I believe is used by tensorflow, which is what I think may be causing the error.

You can inspect the folder /usr/local/nvidia to see indications if the drivers have installed and been exposed successfully to the user container. Then, I believe you should find nvidia-smi in the subfolder bin therein.

Here are some notes I made in a Dockerfile when setting things up to function on a GKE cluster.

# GPUs on GCP NVIDIA Drivers are installed into /usr/local/nvidia by the
# nvidia-driver-installer daemonset on nodes as they start up. The crux is that
# software needs to be aware about the drivers and their location, and this can
# be tricky to configure before we have installed them within this Dockerfile.
# ref: https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
#
# `ldconfig` the CLI can inspect paths from /etc/ld.so.conf (and
# /etc/ld.so.conf.d/*) and summarize its findings in /etc/ld.so.cache, but since
# these folders are empty at the moment that would need to run after we
# installed the files.
# ref: https://linux.die.net/man/8/ldconfig
#
# LD_LIBRARY_PATH is an environment variable can be set to the planned location
# of the drivers though, but we need to be careful to not loose the environment
# variable in a switch from a root user to a jovyan user!
# ref: https://github.com/jupyter/docker-stacks/blob/dc9744740e128ad7ca7c235d5f54791883c2ea69/base-notebook/start.sh#L102
ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}
ENV PATH=/usr/local/nvidia/bin:${PATH}

WARNING: LD_ prefixed environment variables will have trouble being retained if you use sudo or switch from a root user to a normal user etc. But, assuming we start our container directly as something, and not switch users, this won't be an issue... For reference, this is a DEEP rabbit hole and I explored it further here: jupyter/docker-stacks#1052

@sgibson91
Copy link
Member

Joe and his team manage their own image and said to use the pangeo ml notebook one until they had developed their own GPU-enabled one

@sgibson91
Copy link
Member

@sgibson91
Copy link
Member

I ran kubectl describe node against a GPU node and we should have 4 GPUs

Capacity:
  attachable-volumes-azure-disk:  32
  cpu:                            24
  ephemeral-storage:              203070420Ki
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         462318116Ki
  nvidia.com/gpu:                 4
  pods:                           110

@consideRatio
Copy link
Contributor

pangeo/ml-notebook seem to have...


I would guess, that if this image is used as a base image, the issue is environment variables. This may be an addition to the image to solve the issues.

ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}
ENV PATH=/usr/local/nvidia/bin:${PATH}

@sgibson91
Copy link
Member

What I did

  1. Built and pushed a new Docker image whose Dockerfile looks like this
FROM pangeo/ml-notebook:master

ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}
ENV PATH=/usr/local/nvidia/bin:${PATH}
  1. Updated the profile's Kubespawner override to pull the new image
  2. Reran the previous code block yielding the same result

Screenshot 2022-02-09 at 12 34 25

@consideRatio
Copy link
Contributor

consideRatio commented Feb 9, 2022

Was something within /usr/local/nvidia?

It seems there wasnt driver related files exposed at /usr/local/nvidia/lib64 at least - as no such directory seem to be around - hmmm...

@sgibson91
Copy link
Member

Screenshot 2022-02-09 at 13 35 36

Yeah, the nvidia folder isn't there 😕

@consideRatio
Copy link
Contributor

Okay hmm...

I think we are forced to learn more details.

  1. What makes the /usr/local/nvidia folder be created/mounted at all in GKE/EKS?
    I think this is what's failing, and this is whats crucial to solve. We need our k8s container to have access to the drivers, not just have the drivers be installed on the nodes automatically. Perhaps we need to provide a readonly hostPath volume to our containers, mounting it from the k8s node at /usr/local/nvidia to the same path in the container.
  2. What does the files in /usr/local/nvidia represent? Is it the nvidia drivers and various related libraries that may be relevant for use by tensorflow, cudatoolkit, or similar?
    It seems to be the nvidia-drivers that has been installed are found there.
  3. libcuda.so.1 is a file we need to make findable, as can be declared via LD_LIBRARY_PATH.
    • Is it something we would get by installing an nvidia driver, or something we get by installing for cudatoolkit etc?
      It seems associated with the nvidia-driver based on googling.
    • Is it expected for this to be found in /usr/local/nvidia/lib64 or some other location?
      It depends on the driver installation I think. What driver installation process is used on the AKS nodes, and, would that be accessible in a container thanks to the device plugin or not? Looking at this code, if this is relevant to look at I'm not sure, but it indicates that it would be exposed at /usr/local/nvidia.
    • Is it findable in the container currently, as can be tested with find / -iname "libcuda.so.1" -print 2>/dev/null?

@sgibson91
Copy link
Member

3. [ ] Is it findable in the container currently, as can be tested with find / -iname "libcuda.so.1" -print 2>/dev/null?

This command didn't turn up anything 😕

(notebook) jovyan@jupyter-sgibson91:~$ find / -iname "libcuda.so.1" -print 2>/dev/null
(notebook) jovyan@jupyter-sgibson91:~$ 

@sgibson91
Copy link
Member

This commit resolved the issue - we weren't requesting any GPU resources from the pod. @yuvipanda got nvidia-smi working here

@sgibson91
Copy link
Member

@jhamman after a lot of back and forth and confusion (mostly from me 😅 ) this is now deployed! 🎉 🚀 The GPU nodes are still running the pangeo/ml-notebook image and I'm also not sure if this snippet limits us to 1 GPU when there are four available. If you'd like either of those things changing, just let us know (I am on holiday next week but anyone on the engineering team should be able to make those changes).

@consideRatio
Copy link
Contributor

Wieeee nice work @sgibson91 and @yuvipanda!!!

I'm also not sure if this snippet limits us to 1 GPU when there are four available

Pods won't be able to share GPU resources like they can share CPU resources etc, so, you need to specify exactly how many GPUs you want for the pod. If you expect multiple users want to start and have access to one GPU, then it could make sense to have a machine with many GPUs and each user gets one GPU. But, if you have only the occational GPU user, that user should probably have all the GPUs of the machines being provisioned.

@sgibson91
Copy link
Member

Pods won't be able to share GPU resources like they can share CPU resources etc, so, you need to specify exactly how many GPUs you want for the pod.

Good to know, thank you! It was requested one user per pod, so I think it's a safe assumption to make all GPUs available to the pod? We can always wait for Joe's confirmation :)

@sgibson91
Copy link
Member

I am closing this as we haven't had an update and work has been completed to the best of the spec provided.

Repository owner moved this from In Progress ⚡ to Done 🎉 in Sprint Board Mar 2, 2022
@choldgraf choldgraf moved this from In progress to Complete in DEPRECATED Engineering and Product Backlog Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants