Skip to content

Commit

Permalink
feat: Enable ability to build GPU drives during image build
Browse files Browse the repository at this point in the history
This addition also creates a new s3 addtional_component that can be used for other s3 related interactions.
NVIDIA drivers can be optionally installed using the added role. Due to NVIDIA not making the drivers for GRIDD publically available, this role requires an S3 endpoint as it is probably the most available to most users.
Users can use a variety of tools to create an S3 Endpoint be it AWS, CloudFlare, Minio or one of the many other options. With this in mind, this option seems the most logical, plus it allows for an endpoint that can be secured thus not breaking any license agreement with NVIDIA with regards to making the driver public.
Users should store their .run driver file and .tok file on the S3 endpoint. the gridd.conf will be generated based on the Feature flag passed in.
  • Loading branch information
drew-viles committed Jun 5, 2023
1 parent c46443b commit 5a25506
Show file tree
Hide file tree
Showing 9 changed files with 237 additions and 0 deletions.
7 changes: 7 additions & 0 deletions docs/book/src/capi/capi.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,13 @@ PACKER_VAR_FILES=proxy.json make build-node-ova-local-photon-3
"additional_executables": "true",
"additional_executables_destination_path": "/path/to/dest",
"additional_executables_list": "http://path/to/exec1,http://path/to/exec2",
"additional_s3": "true",
"additional_s3_endpoint": "https://path-to-s3-endpoint",
"additional_s3_access": "S3_ACCESS_KEY",
"additional_s3_secret": "S3_SECRET_KEY",
"additional_s3_bucket": "some-bucket",
"additional_s3_object": "path/to/object",
"additional_s3_destination_path": "/path/to/dest",
"additional_registry_images": "true",
"additional_registry_images_list": "plndr/kube-vip:0.3.4,plndr/kube-vip:0.3.3",
"additional_url_images": "true",
Expand Down
1 change: 1 addition & 0 deletions images/capi/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ deps-openstack:
hack/ensure-ansible.sh
hack/ensure-packer.sh
hack/ensure-goss.sh
hack/ensure-s3.sh

.PHONY: deps-qemu
deps-qemu: ## Installs/checks dependencies for QEMU builds
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,6 @@
- import_tasks: url.yml
when: additional_url_images | bool

- import_tasks: s3.yml
when: additional_s3 | bool

24 changes: 24 additions & 0 deletions images/capi/ansible/roles/load_additional_components/tasks/s3.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Copyright 2023 The Kubernetes Authors.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
- name: Download additional from S3
amazon.aws.s3_object:
endpoint_url: "{{ additional_s3_endpoint }}"
access_key: "{{ additional_s3_access }}"
secret_key: "{{ additional_s3_secret }}"
bucket: "{{ additional_s3_bucket }}"
object: "{{ additional_s3_object }}"
dest: "{{ additional_s3_destination_path }}"
mode: get
ceph: "{{ additional_s3_ceph }}"
26 changes: 26 additions & 0 deletions images/capi/ansible/roles/nvidia/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# NVIDIA GPU driver installation

To install the NVIDIA GPU driver as part of the image build process, you must have a `.run` file and `.tok` file from NVIDIA ready and available from an S3 endpoint.

Then all you need to do is reference those files in your packer file.

An example of the fields you need are defined below. Make sure to review and change any fields where required.

```json
{
"ansible_user_vars": "nvidia_s3_url=https://s3-endpoint nvidia_bucket=nvidia nvidia_bucket_access=ACCESS_KEY nvidia_bucket_secret=SECRET_KEY nvidia_installer_location=NVIDIA-Linux-x86_64-525.85.05-grid.run nvidia_tok_location=client_configuration_token.tok gridd_feature_type=4"
"node_custom_roles_pre": "nvidia"
}

```

The role has to be installed via the `node_custom_roles_pre` option to avoid a known issue where should a dist-upgrade install a new kernel,
the driver won't work with it when the image is booted. This is because the DKMS hook doesn't get run due to the driver
being installed after the kernel has been installed. To get around this, we install the driver first.

The `nvidia` custom role makes use of the `s3->load_additional_components` role so that it can fetch the items required from an S3 endpoint.

The reasoning behind requiring an S3 endpoint was due to the fact NVIDIA will soon (July 2023) no longer support an internal licensing server being hosted by a customer.

As a result they now require a `.tok` file to be available for licensing via their cloud services.
This file contains sensitive information and is unique to the company/license to which it is provided.
120 changes: 120 additions & 0 deletions images/capi/ansible/roles/nvidia/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Copyright 2023 The Kubernetes Authors.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

- name: unload nouveau
modprobe:
name: nouveau
state: absent
ignore_errors: true

- name: Add NVIDIA package signing key
ansible.builtin.apt_key:
url: https://nvidia.github.io/libnvidia-container/gpgkey
when: ansible_distribution == "Debian"

- name: perform a cache update
apt:
force_apt_get: True
update_cache: True
register: apt_lock_status
until: apt_lock_status is not failed
retries: 5
delay: 10
when: ansible_distribution == "Debian"

- name: Install packages for interacting with s3 endpoint & building NVIDIA driver kernel module
become: true
ansible.builtin.apt:
pkg:
- python3-boto3
- python3-botocore
- build-essential
- wget
- dkms
when: ansible_distribution == "Debian"

- name: Make /etc/nvidia/ClientConfigToken directory
become: true
file:
path: /etc/nvidia/ClientConfigToken
state: directory
owner: root
group: root
mode: 0755

- name: Download NVIDIA License Token
ansible.builtin.include_role:
name: load_additional_components
vars:
additional_s3: true
additional_s3_endpoint: "{{ nvidia_s3_url }}"
additional_s3_access: "{{ nvidia_bucket_access }}"
additional_s3_secret: "{{ nvidia_bucket_secret }}"
additional_s3_bucket: "{{ nvidia_bucket }}"
additional_s3_ceph: "{{ nvidia_ceph }}"
additional_s3_object: "{{ nvidia_tok_location }}"
additional_s3_destination_path: /etc/nvidia/ClientConfigToken/client_configuration_token.tok

- name: Set Permissions of NVIDIA License Token
file:
path: /etc/nvidia/ClientConfigToken/client_configuration_token.tok
state: file
owner: root
group: root
mode: 0744

- name: Create GRIDD licensing config
become: true
template:
src: templates/gridd.conf.j2
dest: /etc/nvidia/gridd.conf
mode: 0644

- name: Download NVIDIA driver
ansible.builtin.include_role:
name: load_additional_components
vars:
additional_s3: true
additional_s3_endpoint: "{{ nvidia_s3_url }}"
additional_s3_access: "{{ nvidia_bucket_access }}"
additional_s3_secret: "{{ nvidia_bucket_secret }}"
additional_s3_bucket: "{{ nvidia_bucket }}"
additional_s3_ceph: "{{ nvidia_ceph }}"
additional_s3_object: "{{ nvidia_installer_location }}"
additional_s3_destination_path: /tmp/NVIDIA-Linux-gridd.run

- name: Set Permissions of NVIDIA driver
file:
path: /tmp/NVIDIA-Linux-gridd.run
state: file
owner: root
group: root
mode: 0755

- name: Install NVIDIA driver
become: true
ansible.builtin.command:
cmd: "/tmp/NVIDIA-Linux-gridd.run -s --dkms --no-cc-version-check"

- name: Cleanup packages for interacting with s3 endpoint
become: true
ansible.builtin.apt:
state: absent
purge: true
pkg:
- python3-boto3
- python3-botocore
when: ansible_distribution == "Debian"
15 changes: 15 additions & 0 deletions images/capi/ansible/roles/nvidia/templates/gridd.conf.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copyright 2023 The Kubernetes Authors.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FeatureType={{ gridd_feature_type }}
33 changes: 33 additions & 0 deletions images/capi/hack/ensure-s3.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/usr/bin/env bash

# Copyright 2023 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -o errexit
set -o nounset
set -o pipefail

[[ -n ${DEBUG:-} ]] && set -o xtrace

source hack/utils.sh

# Change directories to the parent directory of the one in which this
# script is located.
cd "$(dirname "${BASH_SOURCE[0]}")/.."

# Disable pip's version check and root user warning
export PIP_DISABLE_PIP_VERSION_CHECK=1 PIP_ROOT_USER_ACTION=ignore

# S3 interaction requires the following galaxy collection
ansible-galaxy collection install amazon.aws
8 changes: 8 additions & 0 deletions images/capi/packer/config/additional_components.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@
"additional_executables_list": "",
"additional_registry_images": "false",
"additional_registry_images_list": "",
"additional_s3": "false",
"additional_s3_access": "",
"additional_s3_bucket": "",
"additional_s3_ceph": "false",
"additional_s3_destination_path": "",
"additional_s3_endpoint": "",
"additional_s3_object": "",
"additional_s3_secret": "",
"additional_url_images": "false",
"additional_url_images_list": "",
"load_additional_components": "false"
Expand Down

0 comments on commit 5a25506

Please sign in to comment.