Better NVIDIA Pascal GPU Support #827

shintaro-iwasaki · 2022-10-31T17:06:30Z

This PR clarifies which features are supported on P100 via its tests, though Pascal is not officially and fully supported by Triton.

What this PR does

Skip unsupported tests on P100.
- Atomic RMW
- tl.dot() (perhaps not all patterns, but basically most tl.dot() tests do not work on P100).
Add an explicit error if shared memory size >= 64K on P100.
- Otherwise it causes Invalid CUDA argument error at cuLaunchKernel(), but this error is not very straightforward to understand. Instead of this generic CUDA argument error, this PR makes Triton show an error during codegen when sm < 70. This check happens in C/C++ so won't add an overhead in Triton's Python runtime.
- 3 tests (see below) are currently failing, but these are not marked as skipped because any codegen update in the future can change the kernel size of the other tests.
This change won't affect Triton-MLIR. Hopefully Triton-MLIR's generic tl.dot() implementation would support P100.

Importantly, Triton passed all the other tests on P100. Though this support is not official, it is great for, for example, PyTorch's TorchDynamo/Inductor, which can use Triton (without tl.dot()) for its backend (pytorch/pytorch#93528).

Results on P100 (Google Cloud)

$ pytest test/unit
...
================================================================================== short test summary info ==================================================================================
FAILED test/unit/language/test_core.py::test_reduce2d[argmin-float32-shape99-1] - RuntimeError: Device does not support shared memory of 65536bytes
FAILED test/unit/language/test_core.py::test_reduce2d[argmax-float32-shape113-1] - RuntimeError: Device does not support shared memory of 65536bytes
FAILED test/unit/language/test_core.py::test_permute[float32-shape5-perm5] - RuntimeError: Device does not support shared memory of 67584bytes
================================================================== 3 failed, 3824 passed, 952 skipped in 470.90s (0:07:50) ==================================================================

Environment Details (collapsed)

VM details (Google Cloud)

https://cloud.google.com/

# You need a paid account (free trial does not cover GPUs)
Google Cloud -> New Project -> Compute-Engine -> VM Instance
Machine:
GPU: NVIDIA Tesla P100 x 1
CPU: 2 vCPUs, 7.5GB memory
Boot disk:
  OS: Ubuntu 18.04 LTS
  Disk: 40GB (cannot build Triton on the default 10GB disk)
- When I tried, about $1.2 per hour.
- US instances were full when I tried.  I used Asia or Australia.
- Needed a paid account (GPU is not covered by free trial)
- Needed quota request for any GPU instance (by default, no GPU instance is allowed).  Needed to wait an hour for approval

Reproducer

## 1. Install CUDA and a driver
# Update the apt key (https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/)
sudo apt-key del 7fa2af80
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-keyring_1.0-1_all.deb
# Download CUDA as instructed
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda
# Are you using P100?
nvidia-smi | grep "Tesla P100"

## 2. Setup the build environment
sudo apt update
sudo apt install -y build-essential wget git libz-dev
wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
bash Anaconda3-2022.05-Linux-x86_64.sh -b -p $(pwd)/anaconda3
eval "$($(pwd)/anaconda3/bin/conda shell.bash hook)"
conda create -y --name triton_base
conda activate triton_base
conda install -y cmake setuptools

## 3. Build Triton
git clone https://github.com/openai/triton.git
cd triton/python
pip3 install -e '.[tests]'

## 4. Test
pytest test/unit

Environment

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    25W / 250W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

shintaro-iwasaki · 2022-11-02T17:59:59Z

Would someone merge this PR if it looks okay? I don't have write access.

This PR clarifies which features are supported on P100 via its tests, though Pascal is not officially and fully supported by Triton. ## What this PR does - Skip unsupported tests on P100. - Atomic RMW - `tl.dot()` (perhaps not all patterns, but basically most `tl.dot()` tests do not work on P100). - Add an explicit error if shared memory size >= 64K on P100. - Otherwise it causes `Invalid CUDA argument` error at `cuLaunchKernel()`, but this error is not very straightforward to understand. Instead of this generic CUDA argument error, this PR makes Triton show an error during codegen when `sm < 70`. This check happens in C/C++ so won't add an overhead in Triton's Python runtime. - 3 tests (see below) are currently failing, but these are not marked as skipped because any codegen update in the future can change the kernel size of the other tests. - This change won't affect Triton-MLIR. Hopefully Triton-MLIR's generic `tl.dot()` implementation would support P100. Importantly, Triton passed all the other tests on P100. Though this support is not official, it is great for, for example, PyTorch's TorchDynamo/Inductor, which can use Triton (without `tl.dot()`) for its backend (https://github.com/pytorch/torchdynamo/issues/1591). ### Results on P100 (Google Cloud) ```sh $ pytest test/unit ... ================================================================================== short test summary info ================================================================================== FAILED test/unit/language/test_core.py::test_reduce2d[argmin-float32-shape99-1] - RuntimeError: Device does not support shared memory of 65536bytes FAILED test/unit/language/test_core.py::test_reduce2d[argmax-float32-shape113-1] - RuntimeError: Device does not support shared memory of 65536bytes FAILED test/unit/language/test_core.py::test_permute[float32-shape5-perm5] - RuntimeError: Device does not support shared memory of 67584bytes ================================================================== 3 failed, 3824 passed, 952 skipped in 470.90s (0:07:50) ================================================================== ``` <details><summary> <b>Environment Details (collapsed)</b></summary> <p> ### VM details (Google Cloud) https://cloud.google.com/ ``` # You need a paid account (free trial does not cover GPUs) Google Cloud -> New Project -> Compute-Engine -> VM Instance Machine: GPU: NVIDIA Tesla P100 x 1 CPU: 2 vCPUs, 7.5GB memory Boot disk: OS: Ubuntu 18.04 LTS Disk: 40GB (cannot build Triton on the default 10GB disk) - When I tried, about $1.2 per hour. - US instances were full when I tried. I used Asia or Australia. - Needed a paid account (GPU is not covered by free trial) - Needed quota request for any GPU instance (by default, no GPU instance is allowed). Needed to wait an hour for approval ``` ### Reproducer ```sh ## 1. Install CUDA and a driver # Update the apt key (https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/) sudo apt-key del 7fa2af80 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-keyring_1.0-1_all.deb # Download CUDA as instructed wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600 sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /" sudo apt-get update sudo apt-get -y install cuda # Are you using P100? nvidia-smi | grep "Tesla P100" ## 2. Setup the build environment sudo apt update sudo apt install -y build-essential wget git libz-dev wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh bash Anaconda3-2022.05-Linux-x86_64.sh -b -p $(pwd)/anaconda3 eval "$($(pwd)/anaconda3/bin/conda shell.bash hook)" conda create -y --name triton_base conda activate triton_base conda install -y cmake setuptools ## 3. Build Triton git clone https://github.com/openai/triton.git cd triton/python pip3 install -e '.[tests]' ## 4. Test pytest test/unit ``` ### Environment ```sh $ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:00:04.0 Off | 0 | | N/A 36C P0 25W / 250W | 0MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ ``` </p></details>

Resolves triton-lang#827 Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>

shintaro-iwasaki mentioned this pull request Oct 31, 2022

[TorchInductor] Add support for Pascal GPUs (P100, GTX 1080, etc) pytorch/pytorch#93528

Closed

shintaro-iwasaki force-pushed the siwasaki/pr/test_p100 branch from 5080836 to 9829792 Compare October 31, 2022 18:48

ptillet approved these changes Nov 1, 2022

View reviewed changes

Jokeren mentioned this pull request Nov 1, 2022

[Triton-MLIR][Frontend] Interface fixes for libdevice #829

Merged

shintaro-iwasaki added 3 commits November 2, 2022 09:43

[TESTS] Skip test_atomic_rmw on P100

9ece25b

[TESTS] Skip tests with tl.dot() on P100

923ea7e

[CODEGEN] Show an error if shared memory size is too large on P100

5968702

shintaro-iwasaki force-pushed the siwasaki/pr/test_p100 branch from 296903a to 5968702 Compare November 2, 2022 16:44

Jokeren merged commit 77bc518 into triton-lang:master Nov 3, 2022

ptillet added a commit that referenced this pull request Dec 20, 2022

Better NVIDIA Pascal GPU Support (#827)

d43dfea

ZzEeKkAa pushed a commit to ZzEeKkAa/triton that referenced this pull request Aug 5, 2024

Support shuffle for bool values. (triton-lang#828)

5a91b86

Resolves triton-lang#827 Signed-off-by: Ilya Enkovich <ilya.enkovich@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better NVIDIA Pascal GPU Support #827

Better NVIDIA Pascal GPU Support #827

shintaro-iwasaki commented Oct 31, 2022 •

edited

Loading

shintaro-iwasaki commented Nov 2, 2022

Better NVIDIA Pascal GPU Support #827

Better NVIDIA Pascal GPU Support #827

Conversation

shintaro-iwasaki commented Oct 31, 2022 • edited Loading

What this PR does

Results on P100 (Google Cloud)

VM details (Google Cloud)

Reproducer

Environment

shintaro-iwasaki commented Nov 2, 2022

shintaro-iwasaki commented Oct 31, 2022 •

edited

Loading