This chapter introduces you to GPU technology, starting from installation to understanding both hardware and software components. We'll install drivers, run some GPU-based tasks, and explore key concepts of GPU architecture as shown below.
Before we start, ensure you have a Ubuntu 22.04 server/desktop with an NVIDIA GPU. We're using a Paperspace Core VM (www.paperspace.com), but any similar setup, including your local machine, will work. Just ensure you have SSH access to the VM.
Paperspace offers UI and CLI for machine creation. Choose a server with the vanilla Ubuntu 22.04 image and note that this setup will be used up to chapter 3. From chapter 4 (Kubernetes), we'll switch to a new VM setup.
Keep in mind that there is a cost (eg. $0.76/hr for A4000) to your account for using these VMs. When you power-off the machine, you only pay for the storage.
After connecting to your VM, verify the presence of NVIDIA cards.
Checking NVIDIA Cards (Click to expand)
# Blank is expected as we are starting from a clean image
paperspace@psxgrv5u6:~$ sudo dpkg -l | grep -i nvidia
# Blank is expected as we are starting from a clean image
paperspace@psxgrv5u6:~$ sudo lsmod | grep -i nvidia
# Should display the GPU hardware
paperspace@psxgrv5u6:~$ sudo lspci | grep -i nvidia
00:05.0 VGA compatible controller: NVIDIA Corporation GA104GL [RTX A4000] (rev a1)
00:06.0 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
paperspace@psxgrv5u6:~$
In this section, we'll install the necessary drivers and toolkits for GPU operation.
A correct driver installation is key for optimal GPU performance. Ubuntu provides a straightforward process:
-
Update Your System: Ensure your system is up-to-date.
sudo apt update DEBIAN_FRONTEND=noninteractive sudo apt upgrade -y
-
Install Recommended NVIDIA Drivers: Identify and install the recommended drivers. The 'gpgpu' tag stands for general-purpose computing on GPUs.
ubuntu-drivers list --gpgpu ubuntu-drivers list --gpgpu --recommended sudo ubuntu-drivers install --gpgpu
-
Reboot Your System: Reboot to load the new drivers.
sudo reboot
-
Install NVIDIA Utilities: Install utilities like
nvidia-smi
. Match the version with your GPU driver.apt search nvidia-utils-535 sudo apt install nvidia-utils-535-server
-
Verify the Driver Installation: Check the installation with
nvidia-smi
. It should display GPU details and driver version.
Example Output (Click to expand)
paperspace@psq929f0nprl:~$ nvidia-smi
[Tuesday, December 26, 2023]
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:00:05.0 Off | Off |
| 33% 55C P0 38W / 140W | 1MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
paperspace@psq929f0nprl:~$
[Interpretation of the columns] List of columns
- GPU: This column indicates the index number of each GPU in your system. It's useful when specifying a GPU for specific tasks or monitoring.
- Name: Displays the model name of the GPU, useful for identifying the type of NVIDIA GPU installed.
- Persistence-M: Indicates if the persistence mode is enabled or disabled. When enabled, it helps maintain consistent GPU performance by keeping the GPU initialized even when not in use.
- Bus-Id: Shows the unique identifier of the GPU on the system bus, useful for system-level management and configuration.
- Disp.A: Indicates whether the GPU is connected to a display.
- Volatile Uncorr. ECC: Displays Error Correction Code (ECC) memory information. ECC memory can detect and correct some types of memory errors, which is crucial for computational accuracy.
- Fan: Shows the fan speed as a percentage of its maximum speed. This is important for cooling and operational efficiency.
- Temp: Indicates the current temperature of the GPU.
- Perf: Represents the current performance state of the GPU, usually ranging from P0 (maximum performance) to P8 (minimum performance).
- Pwr:Usage/Cap: Displays the current power usage and the power limit (cap) for the GPU.
- Memory-Usage: Shows the amount of GPU memory currently in use and the total GPU memory available.
- GPU-Util: Indicates the percentage of GPU computational cores being used. High utilization means the GPU is being heavily used.
- Compute M.: Shows the current compute mode of the GPU. Different modes determine how the GPU can be used (e.g., for graphics or compute tasks).
- MIG M.: Stands for Multi-Instance GPU Mode. If supported and enabled, this mode allows a single GPU to be split into multiple instances, each with its own resources like memory and cores, for separate tasks or users.
CUDA is a bridge for executing parallel computations on the GPU. It runs primarily on the CPU and manages the execution of code on the GPU. The runtime and compiler interpret, compile, and schedule the execution of kernels, which are the portions of code running on the GPU. This enables CUDA to use the CPU for orchestration while leveraging the GPU's parallel processing power.
CUDA is crucial for GPU application development. While containers often bundle CUDA libraries, we'll install it directly on our VM. Ensure compatibility between the CUDA library, hardware, and driver. Download CUDA from NVIDIA's website, choosing the version that matches your system and driver.
Installation Steps (Click to expand)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/cuda-repo-ubuntu2204-12-3-local_12.3.1-545.23.08-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-3-local_12.3.1-545.23.08-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3
Post-installation, ensure CUDA's inclusion in your PATH and verify its installation.
Post-Installation Steps (Click to expand)
export PATH=/usr/local/cuda-12.3/bin${PATH:+:${PATH}}
nvcc --version
After setting up the GPU, let's run some example codes to witness the power of GPU computing.
We will compare a 10000x10000 matrix multiplication on both CPU and GPU, utilizing CUDA and Tensor cores for the GPU computation.
PyTorch is required to run the matrix multiplication code. Install it using Python3 and pip:
paperspace@ps3nfhuzbif3:~$ which python
paperspace@ps3nfhuzbif3:~$ which python3
/usr/bin/python3
paperspace@ps3nfhuzbif3:~$ python3 --version
Python 3.10.12
paperspace@ps3nfhuzbif3:~$ sudo apt install python3-pip
paperspace@ps3nfhuzbif3:~$ pip install torch torchvision torchaudio
# Add /home/paperspace/.local/bin to $PATH in .bashrc, if using paperspace
Save the following script as mat_mul.py
(also under Scripts_Artifacts folder) and run it using Python3.
Matrix Multiplication Script (Click to expand)
#!/usr/bin/python3
import torch
import time
#Size of the matrices. You need to adjust the size based on your GPU.
#The arithmatic may not be that simple in reality (other processes running, reserved memory etc.),
#but matrices size gives an idea of how much GPU RAM you need,
#eg. 2 input and 1 output matrices with each element a FP32 or FP16.
N_cpu = 10000
N_cudacore = 25000
N_tensorcore = 25000
#Computation on CPU
#Create random matrices. These are 32-bit floating point
A = torch.rand(N_cpu, N_cpu)
B = torch.rand(N_cpu, N_cpu)
#Perform matrix multiplication on CPU
start_time = time.time()
result_cpu = torch.matmul(A, B)
time_cpu = time.time() - start_time
print(f"Matrices dimension: {N_cpu}, Time taken on CPU: {time_cpu} seconds")
#This is executed on CUDA cores on the GPU
if torch.cuda.is_available():
device = torch.device('cuda')
#torch.cuda.empty_cache()
A_gpu = torch.rand(N_cudacore, N_cudacore, device=device)
B_gpu = torch.rand(N_cudacore, N_cudacore, device=device)
# Perform matrix multiplication on GPU
start_time = time.time()
result_gpu = torch.matmul(A_gpu, B_gpu)
result_cpu_from_gpu = result_gpu.cpu()
time_gpu = time.time() - start_time
print(f"Matrices dimension: {N_cudacore}, Time taken on CUDA cores: {time_gpu} seconds")
else:
print("GPU not available.")
#Pytorch uses tensor cores on the GPU for FP16 computation
if torch.cuda.is_available():
device = torch.device('cuda')
# Create random matrices in half precision (FP16). This can fail, because pytorch needs to allocate memory for this task on GPU RAM, and then less memory is available for matrix multiplication.
A_fp16 = torch.rand(N_tensorcore, N_tensorcore, device=device, dtype=torch.half)
B_fp16 = torch.rand(N_tensorcore, N_tensorcore, device=device, dtype=torch.half)
# Create random matrices in standard precision (FP32) on CPU. This can get really slow, as the computation is done on CPU.
#A_cpu = torch.rand(N_tensorcore, N_tensorcore, dtype=torch.float32)
#B_cpu = torch.rand(N_tensorcore, N_tensorcore, dtype=torch.float32)
# Convert to half precision (FP16) and move to GPU
#A_fp16 = A_cpu.half().to(device)
#B_fp16 = B_cpu.half().to(device)
# Perform matrix multiplication on Tensor Cores
start_time = time.time()
result_fp16 = torch.matmul(A_fp16, B_fp16).float() # Convert FP16 result to FP32 for comparison
time_fp16 = time.time() - start_time
print(f"Matrices dimension: {N_tensorcore}, Time taken on Tensor Cores (FP16): {time_fp16} seconds")
else:
print("GPU not available.")
When experimenting with matrix multiplication on the GPU, you'll notice a remarkable speedup compared to the CPU, particularly when the matrix dimensions exceed a certain threshold, such as 10,000. This illustrates the GPU's efficiency in handling large-scale parallel computations.
The NVIDIA A4000 GPU comes with 16GB of GPU RAM. When dealing with large matrices, say of size 40,000, the memory requirement exceeds the available GPU RAM. For instance, a 40,000-size matrix needs approximately 17.88 GB of RAM (calculated as 40000 x 40000 x 4 x 3 Bytes / (1024^3)). If you try to process a matrix size that significantly exceeds the GPU's memory capacity (e.g., 50,000), the program will fail due to insufficient memory.
Interestingly, PyTorch manages to handle a matrix size of 40,000 efficiently, thanks to its intelligent memory management techniques like streaming computation and memory swapping. However, these techniques have their limits.
Efficiency of Tensor Cores Tensor cores, integral to modern NVIDIA GPUs, are optimized for half-precision (FP16) computations, using half the memory compared to standard precision (FP32). But it's important to note that even though tensor cores are more memory-efficient, doubling the matrix size doesn't directly double the feasible computation size. This is because the number of elements in a matrix increases fourfold when you double its dimensions.
Tensor cores perform a specialized operation per GPU clock cycle known as fused multiply-add (FMA). This operation involves multiplying two 4x4 FP16 matrices and adding the result to a 4x4 FP32 accumulator matrix. This capability makes tensor cores particularly effective for neural network computations, where such operations are frequent.
Execution Results (Click to expand)
paperspace@ps3nfhuzbif3:~$ ./mat_mul.py
Matrices dimension: 10000, Time taken on CPU: 2.8191654682159424 seconds
Matrices dimension: 25000, Time taken on CUDA cores: 4.454557657241821 seconds
Matrices dimension: 25000, Time taken on Tensor Cores (FP16): 0.6860084533691406 seconds
paperspace@ps3nfhuzbif3:~$
paperspace@ps3nfhuzbif3:~$ ./mat_mul.py
Matrices dimension: 10000, Time taken on CPU: 2.7513859272003174 seconds
Matrices dimension: 25000, Time taken on CUDA cores: 4.448623895645142 seconds
Matrices dimension: 30000, Time taken on Tensor Cores (FP16): 1.8367786407470703 seconds
paperspace@ps3nfhuzbif3:~$
GPUs are sophisticated systems-on-chips designed for accelerated computing. Unlike CPUs, GPUs consist of thousands of simpler processors, each equipped with its own set of registers, dedicated memory, and cache. They do not run an operating system themselves but are controlled through drivers and tools installed on the host OS, such as Linux or Windows.
User applications, especially those with parallelizable components, can significantly benefit from GPU acceleration. This is evident in tasks like matrix multiplication, where GPUs can perform computations much faster than traditional CPUs. In the realm of software development, this acceleration is achieved through specialized libraries like CuDNN, optimized for neural network operations, which in turn utilize CUDA (Compute Unified Device Architecture).
Neural networks represent a rapidly expanding area of GPU application. From 2013 to 2020, the size of neural networks has grown exponentially, often by as much as 75x annually. Given this growth, many neural network models exceed the capacity of a single GPU. To address this, both software and hardware solutions have been developed. On the software side, tools like distributed TensorFlow, PyTorch Distributed Data Parallel (DDP), and Ray are used. For hardware, technologies like NVLink, which connects NVIDIA GPUs within a system, and NVSwitch, which connects GPUs across multiple hosts, are becoming standard in large-scale model training.
Over time, GPUs have seen significant advancements to enhance their efficiency in neural network training and inferencing. These include innovations like 16-bit computation and 8-bit integer-based inferencing. NVIDIA, in particular, has been at the forefront, evolving its GPUs to support these advanced capabilities directly on the chip.
In contrast to conventional CPUs, which are constantly active, GPUs are designed to be idle unless specifically called upon by an application. This idleness poses a challenge, especially in environments with multiple GPUs that might not be in constant use. To address this, NVIDIA has developed features for time-sharing or multi-tenancy, allowing multiple users or applications to share GPU resources effectively, ensuring that these powerful resources are utilized efficiently and economically.
Much like their CPU counterparts, GPUs undergo constant evolution in their micro-architecture. This evolution aims to enhance their computational power and efficiency. Key areas of advancement typically include:
-
Increased Compute Capacity: This is achieved by integrating more transistors within the same surface area, allowing for greater processing power without increasing the size of the chip.
-
Enhanced Memory Capabilities: Modern GPUs feature high-speed and increased memory capacity, crucial for handling larger datasets and more complex computations, especially prevalent in machine learning and graphics processing.
-
Improved Connectivity: Advancements in high-speed buses, like PCIe, enhance the communication speed between the GPU and the host system. Additionally, better interconnect technologies between GPUs (like NVLink and NVSwitch) facilitate efficient data transfer in multi-GPU setups, which is essential for scaling up computational tasks.
-
Specialized Computing Units: The introduction and expansion of tensor cores and support for 8-bit inferencing optimize GPUs for specific tasks like neural network computing. These specialized cores are tailored to accelerate the types of calculations most commonly used in modern AI algorithms.
In the context of NVIDIA, the last three micro-architectures – Ampere, Lovelace, and Hopper – represent significant milestones in GPU development.
- Ampere Architecture: GeForce RTX 3090, A4000, and A100 fall under this architecture.
- Lovelace Architecture: RTX 4090.
- Hopper Architecture: Recently released H-100.
For a deeper understanding of these architectures, one can explore their specific features and innovations on resources like NVIDIA's microarchitecture Wikipedia page.
Understanding the specifications of your GPU is crucial for optimizing its performance and aligning it with your computational needs.
For an overview of your GPU's capabilities, you can refer to its product specification sheet. For instance, the NVIDIA RTX A4000's specifications are available in its datasheet, which provides comprehensive information about its architecture, memory, performance capabilities, and more.
An alternative and more hands-on way to understand your GPU's specifications is by querying the device directly using CUDA's deviceQuery
program. This approach gives you real-time, specific information about your GPU's configuration and capabilities.
Click to expand
-
Clone the CUDA Samples Repository: Clone the NVIDIA CUDA samples repository, which contains the deviceQuery program.
git clone https://github.com/NVIDIA/cuda-samples.git
-
Navigate to the deviceQuery Directory:
cd cuda-samples/Samples/1_Utilities/deviceQueryDrv
-
Compile the deviceQuery Program: Compile the program using the
make
command.make
-
Run deviceQuery: Execute the deviceQuery program to get detailed information about your GPU.
./deviceQueryDrv
This will display various details about the GPU, including:
- CUDA driver version.
- CUDA capability (major/minor version number).
- Memory specifications (total global memory, memory clock rate, memory bus width).
- Processor details (number of multiprocessors, CUDA cores per multiprocessor).
- Clock rates, texture dimension sizes, and other technical specifics.
For instance, with the NVIDIA RTX A4000, you'll see information like the total number of CUDA cores, GPU clock rate, and memory specifics.
Understanding the computational power of a GPU like the NVIDIA A4000 requires delving into its core specifications and performance metrics. Let's explore how to calculate the theoretical Floating Point Operations Per Second (FLOPS) for the A4000, a key indicator of its performance capabilities.
The NVIDIA A4000 features 48 Streaming Multiprocessors (SMs), each with 128 CUDA cores, amounting to a total of 6144 CUDA cores. These cores can be thought of as simple processors, similar to Arithmetic Logic Units (ALUs) in CPUs.
- GPU Clock Rate: The A4000 has a maximum clock rate of 1560 MHz.
- FLOPS per CUDA Core: Each CUDA core is capable of performing 2 FLOPS (floating-point operations) per clock cycle.
Thus, the maximum theoretical FLOPS for the A4000 can be calculated as:
Total CUDA Cores x Clock Rate x FLOPS per Core
= 6144 Cores x 1.56 GHz x 2
= 19169 GigaFLOPS
= 19.2 TeraFLOPS
This gives us the maximum single-precision performance of the CUDA cores. For the specifications, you can refer to the A4000 Datasheet.
Apart from CUDA cores, the Ampere architecture of A4000 also includes tensor cores, which are specialized for certain types of computations like those used in neural networks.
We can take another approach to arrive at the benchmark in the datasheet for both CUDA cores and tensor cores. From figure 2 of GPU architecture fundamentals, we can use the benchmark numbers per streaming multiprocessor (SM).
- Performance per SM: According to NVIDIA's documentation, each SM in the Ampere architecture can achieve:
- 256 FP16 FLOPS with CUDA Cores.
- 2048 FP16 FLOPS with Tensor Cores.
Therefore, the overall performance in TFLOPS for both CUDA and Tensor cores can be calculated as follows:
-
CUDA Cores Performance:
(256 FP16 FLOPS x 48 SMs x 1.56 GHz) / 1000 = 19.2 TFLOPS
-
Tensor Cores Performance:
(2048 FP16 FLOPS x 48 SMs x 1.56 GHz) / 1000 = 153.3 TFLOPS
These calculations align with the performance benchmarks provided in the A4000's datasheet and offer insights into how the GPU's FLOPS are estimated.
While a typical CPU comprises several cores, each capable of handling multiple threads, GPUs are designed with a different objective in mind. GPUs are optimized for SIMD (Single Instruction, Multiple Data) operations, enabling them to execute thousands of parallel tasks efficiently. This capability is central to their architecture and the CUDA ecosystem.
A GPU is organized into Streaming Multiprocessors (SMs), with each SM containing a number of CUDA cores. For instance, the NVIDIA A4000 GPU has 48 SMs, each with 128 CUDA cores. These CUDA cores are analogous to Arithmetic Logic Units (ALUs) in CPUs and are the primary units of computation on the GPU.
- Shared Memory and L2 Cache: Each SM includes shared memory, which is accessible to all CUDA cores within that SM. Additionally, the entire GPU board features a L2 cache and global GPU memory, facilitating data storage and retrieval for ongoing computations.
An interesting aspect of GPU architecture is the variable allocation of registers to threads. SMs possess a register file, from which registers are dynamically allocated to threads based on their computational requirements. This flexibility allows for the effective management of memory resources during different tasks.
Tensor cores, another critical component of modern GPUs, are specialized for certain types of computations such as those found in neural networks and machine learning models. For example, in the NVIDIA H-100 GPU, tensor cores are optimized for efficiently executing operations essential for training and inferencing of transformer models (like Large Language Models).
- Efficient Task Switching: When a thread executing on a CUDA core requires a matrix multiply-add (MMA) operation, it can be seamlessly transferred to a tensor core. The tensor core can perform the entire MMA operation in one cycle, with the thread's register state preserved in the register file, thus eliminating the need for data movement.
Unlike CPUs, GPUs contain thousands of cores to handle the massively parallel nature of their tasks. Consequently, the unit of scheduling in GPUs is a 'warp', typically consisting of 32 threads. This approach ensures that large numbers of threads can be managed and executed concurrently, which is crucial given the parallel processing nature of GPU tasks.
- Thread vs. Core: It's important to distinguish between a thread and a core. A thread is a sequence of programmed instructions, whereas a core is the physical unit executing these instructions. The GPU scheduler efficiently manages the concurrent execution of multiple threads across the available cores, accommodating the varying demands of different computational tasks.
CUDA (Compute Unified Device Architecture) acts as the pivotal controller for GPU-based computing. It fulfills several critical functions:
-
Low-Level API Provisioning: CUDA offers a set of low-level APIs for developers and frameworks, facilitating direct interaction with the GPU's hardware capabilities.
-
Scheduling at the SM Level: It handles the scheduling of computational tasks (or kernels) across the various Streaming Multiprocessors (SMs) within the GPU.
-
GPU Management and Maintenance: CUDA is responsible for the overall management and maintenance of the GPU, ensuring efficient operation and resource allocation.
From a software development perspective, the primary focus is on how programs execute on the GPU:
-
Kernel Definition: Developers use high-level frameworks like PyTorch or TensorFlow to define kernels, which are functions intended to be executed on the GPU. These kernels are written using abstractions that allow developers to focus on the computational aspects without needing to manage low-level hardware interactions.
-
Compilation and Translation: The framework's compiler translates this high-level kernel code into lower-level CUDA code, tailoring it to the GPU's architecture and instruction set. This process includes determining the optimal organization of blocks and threads based on the computational workload.
-
Data Transfer: To facilitate GPU computations, the framework manages the transfer of necessary data from the CPU's memory to the GPU's memory.
-
Kernel Launching: The framework then launches the compiled kernel using CUDA API calls, specifying parameters such as grid size, block size, and the amount of shared memory required for the task.
Once the kernel is launched, CUDA takes over on the GPU side:
-
Block Assignment: The CUDA scheduler assigns computational blocks to the available SMs. This decision is based on factors like the current workload on each SM, the availability of resources (such as CUDA cores, tensor cores, and shared memory), and the locality of the required data.
-
Warp Scheduling: Within each SM, a warp scheduler manages the execution of threads. Warps, which are groups of threads, are assigned to available execution units (CUDA cores or tensor cores). The scheduler is responsible for handling context switching between threads and ensuring that each instruction is executed correctly.
-
Tensor Core Utilization: For operations suitable for tensor cores (like mixed-precision computations), the warp scheduler identifies and assigns eligible threads to these specialized cores. This includes managing the efficient transfer of data between CUDA cores and tensor cores, optimizing the overall computation process.
In this chapter, we covered the fundamental aspects of GPUs, from their architecture and parallel processing capabilities to the role of CUDA in managing GPU tasks. By exploring example of a matrix multiplication, we've seen the GPU's power in handling parallelizable tasks. We also reviewed the evolution of GPU technology and its impact on computing efficiency. This foundation sets the stage for better understanding and leveraing GPUs in Kubernetes.