Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build zoom backend #7

Closed
wants to merge 77 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
14e709f
initial impl w/ empty tensor
stephen-youn Jun 13, 2024
ba0e00e
add setup.py and hook to device unit tests
stephen-youn Jun 13, 2024
6450ba4
update some impls
stephen-youn Jun 14, 2024
80a2762
caching allocator, rng, guard, events
123epsilon Jun 21, 2024
08d7fdf
linker bugfixes
123epsilon Jun 21, 2024
6b5f78b
register hooks properly
123epsilon Jun 24, 2024
5924e75
add copy op
123epsilon Jun 26, 2024
acc00ef
initial JIT infra for HIP kernels
123epsilon Jun 27, 2024
88710f3
renaming gpu headers
123epsilon Jun 28, 2024
e8a9444
impl abs kernel
123epsilon Jun 28, 2024
fa57b0a
add more unary kernels
123epsilon Jul 1, 2024
a7bd747
fill kernel
123epsilon Jul 2, 2024
ada6861
impl distribution kernels
123epsilon Jul 3, 2024
13dcc0e
cleanup distribution files + add test helper script
123epsilon Jul 3, 2024
fcbe9da
add storage and view manipulation kernels
123epsilon Jul 5, 2024
2e29a9c
init blas kernels
123epsilon Jul 10, 2024
cea0cc8
blas kernel checks
123epsilon Jul 10, 2024
2ada398
renaming blas kernels
123epsilon Jul 10, 2024
a14e444
add more view kernels
123epsilon Jul 11, 2024
32016cd
update test script to track test failures
123epsilon Jul 11, 2024
c48f947
check in hipify shotgun
123epsilon Jul 11, 2024
0a9c10a
add various indexing kernels
123epsilon Jul 11, 2024
555659d
more indexing kernels
123epsilon Jul 12, 2024
754e629
add masked select and scatter kernels
123epsilon Jul 12, 2024
cf7af68
add equals kernels
123epsilon Jul 12, 2024
36c251e
move zoom backend in-tree
123epsilon Jul 13, 2024
c1db8c0
fix some build errors
123epsilon Jul 15, 2024
5b1ac21
fix more build issues
123epsilon Jul 16, 2024
2a51c4b
more in-tree setup
123epsilon Jul 16, 2024
b2c8082
implement codegen for structured/ufunc kernels for zoom backend
123epsilon Jul 18, 2024
2516cf9
impl structured generation for existing ops
123epsilon Jul 19, 2024
fe7b061
add reduction op kernels
123epsilon Jul 22, 2024
0a4b238
add binary op kernels
123epsilon Jul 22, 2024
6fdd791
add more shape kernels + randperm kernels
123epsilon Jul 22, 2024
b4669b8
add unpooling and nonzero kernels
123epsilon Jul 22, 2024
625fd00
add scan and elementwise kernels
123epsilon Jul 22, 2024
d025428
fix some dispatch stubs
123epsilon Jul 23, 2024
2bdb970
fix stub generation for PU1
123epsilon Jul 23, 2024
4dca1e6
add compare kernels
123epsilon Jul 23, 2024
00b1721
more unary + multinomial kernels
123epsilon Jul 23, 2024
b0e30f7
add distance and foreach kernels
123epsilon Jul 23, 2024
afb1b36
add activation kernels
123epsilon Jul 23, 2024
12156d8
add adaptive pool and AMP kernels
123epsilon Jul 23, 2024
d7f1032
add some polynomial kernels
123epsilon Jul 24, 2024
a48a87d
complex, conv2d kernels
123epsilon Jul 24, 2024
2fd3520
add depthwise conv, maxpool, dropout, embedding kernels
123epsilon Jul 24, 2024
5344965
fused optim, grid sample, and hermite kernels
123epsilon Jul 26, 2024
4a10624
edit compiler message for fused sgd
123epsilon Jul 26, 2024
14fa31c
loss kernels, more polynomials, cutlass header stubs
123epsilon Jul 26, 2024
59362f3
norm, reflection, rrelu kernels
123epsilon Jul 26, 2024
c34d984
add zoom to some non-deterministic tests
123epsilon Jul 29, 2024
d46ad44
add more operators - sorting, comparison, shape, and unary
123epsilon Jul 29, 2024
2305755
add linalg operators with hipsolver and hipblas, unique kernels, upsa…
123epsilon Jul 30, 2024
6c56d5f
add python module infra
123epsilon Jul 31, 2024
c84e248
more python infra, get all applicable tests running
123epsilon Aug 1, 2024
2e45079
update setup notes, edit version guard in fused sgd kernel
123epsilon Aug 1, 2024
3ce3bfc
clean copy kernel, reroute to deterministic index_copy, disable gesvd…
123epsilon Aug 2, 2024
90e2130
add some zoom rerouting logic for nondeterministic operators and tests
123epsilon Aug 3, 2024
ea1456f
more misc test fixes, rerouting nondeterministic ops through decompos…
123epsilon Aug 6, 2024
d25a78c
add dispatch for autocast kernels
123epsilon Aug 6, 2024
98a3b20
fix indexing logic with scalar index_put, add is_zoom to torch Tensor
123epsilon Aug 6, 2024
14fd1d5
catch all errors in test script
123epsilon Aug 6, 2024
3ad5551
add some determinism checks
123epsilon Aug 7, 2024
be4531c
various op dispatch fixes
123epsilon Aug 8, 2024
19f8223
add sparse operators - device tests at 100%
123epsilon Sep 5, 2024
1e1f2af
hiprtc minimal test (fails due to SIGBUS)
123epsilon Nov 20, 2024
1560c0e
dot+vdot HIP jit kernels + some misc native_functions mappings for op…
123epsilon Nov 27, 2024
0322861
fix some failing op tests
123epsilon Nov 29, 2024
afec805
expose current_stream to python
123epsilon Dec 8, 2024
768efe2
fix stream init python side
123epsilon Dec 10, 2024
080e7a3
add instructions for running triton llama, separate build docs
123epsilon Dec 10, 2024
7e0fa56
add build
makslevental Dec 16, 2024
99a54d4
add GH workflow
makslevental Dec 17, 2024
9e08eb9
compile without hipblas etc
makslevental Dec 17, 2024
7f95e5e
use rocm docker
makslevental Dec 17, 2024
1d36920
Trigger Build
makslevental Dec 17, 2024
503b87b
audit wheel
makslevental Dec 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
124 changes: 124 additions & 0 deletions .github/workflows/build_zoom_backend.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
name: "Build PyTorch"

on:
workflow_dispatch:
inputs:
force_debug_with_tmate:
type: boolean
description: 'Run the build with tmate session'
required: false
default: false
debug_with_tmate:
type: boolean
description: 'Run the build with a tmate session ONLY in case of failure'
required: false
default: false
pull_request:
push:
branches:
- main

concurrency:
group: ${{ github.workflow }}-${{ github.event.number || github.sha }}
cancel-in-progress: true

jobs:
build:

strategy:
fail-fast: false
matrix:
include:
- name: "ubuntu-22.04"
runs-on: "mi300"
# container: "rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0"
# runs-on: "nod-ai-shared-cpubuilder-manylinux-x86_64"

runs-on: ${{ matrix.runs-on }}

name: ${{ matrix.name }}

env:
CACHE_DIR: ${{ github.workspace }}/.container-cache
# either the PR number or `branch-N` where N always increments
CACHE_KEY: linux-build-test-cpp-asserts-manylinux-v2-${{ format('{0}-{1}', github.ref_name, github.run_number) }}

defaults:
run:
shell: bash

permissions:
id-token: write
contents: write

container:
image: ${{ matrix.container }}

steps:
- name: "Check out repository"
uses: actions/checkout@v4.2.2
with:
submodules: true

- name: Enable cache
uses: actions/cache/restore@v3
with:
path: ${{ env.CACHE_DIR }}
key: ${{ env.CACHE_KEY }}
restore-keys: linux-build-test-cpp-

- name: "Build PyTorch"
id: build
run: |

export CCACHE_DIR="${{ env.CACHE_DIR }}"
export CMAKE_C_COMPILER_LAUNCHER=ccache
export CMAKE_CXX_COMPILER_LAUNCHER=ccache
export CCACHE_SLOPPINESS=include_file_ctime,include_file_mtime,time_macros

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
./build.sh

- name: "Audit"
id: audit
run: |

sudo apt install patchelf
source venv/bin/activate
pip install auditwheel
auditwheel repair -w dist --plat manylinux_2_39_x86_64 dist/torch*

- name: Save cache
uses: actions/cache/save@v3
if: ${{ !cancelled() }}
with:
path: ${{ env.CACHE_DIR }}
key: ${{ env.CACHE_KEY }}

- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: ${{ matrix.name }}_artifact
path: dist
if-no-files-found: warn

- name: Release current commit
uses: ncipollo/release-action@v1.12.0
with:
artifacts: "dist/torch*.whl"
token: "${{ secrets.GITHUB_TOKEN }}"
tag: "latest"
name: "latest"
removeArtifacts: false
allowUpdates: true
replacesArtifacts: true
makeLatest: true

- name: "Setup tmate session"
if: ${{ (failure() && inputs.debug_with_tmate) || inputs.force_debug_with_tmate }}
uses: mxschmitt/action-tmate@v3.18
with:
limit-access-to-actor: true
install-dependencies: ${{ startsWith(matrix.runs-on, 'macos') || startsWith(matrix.runs-on, 'windows') }}
16 changes: 12 additions & 4 deletions BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ load("@pytorch//tools/rules:cu.bzl", "cu_library")
load("@pytorch//tools/config:defs.bzl", "if_cuda")
load("@pytorch//:aten.bzl", "generate_aten", "intern_build_aten_ops")
load(":build.bzl", "GENERATED_AUTOGRAD_CPP", "GENERATED_AUTOGRAD_PYTHON", "define_targets")
load(":build_variables.bzl", "jit_core_sources", "lazy_tensor_ts_sources", "libtorch_core_sources", "libtorch_cuda_sources", "libtorch_distributed_sources", "libtorch_extra_sources", "libtorch_python_core_sources", "torch_cpp_srcs", "libtorch_python_cuda_sources", "libtorch_python_distributed_sources")
load(":ufunc_defs.bzl", "aten_ufunc_generated_cpu_kernel_sources", "aten_ufunc_generated_cpu_sources", "aten_ufunc_generated_cuda_sources")
load(":build_variables.bzl", "jit_core_sources", "lazy_tensor_ts_sources", "libtorch_core_sources", "libtorch_cuda_sources", "libtorch_distributed_sources", "libtorch_extra_sources", "libtorch_python_core_sources", "torch_cpp_srcs", "libtorch_python_cuda_sources", "libtorch_python_zoom_sources", "libtorch_python_distributed_sources")
load(":ufunc_defs.bzl", "aten_ufunc_generated_cpu_kernel_sources", "aten_ufunc_generated_cpu_sources", "aten_ufunc_generated_cuda_sources", "aten_ufunc_generated_zoom_sources")
load("//:tools/bazel.bzl", "rules")

define_targets(rules = rules)
Expand Down Expand Up @@ -104,15 +104,23 @@ generated_cuda_cpp = [
"aten/src/ATen/RegisterSparseCsrCUDA.cpp",
]

generated_zoom_cpp = [
"aten/src/ATen/ZoomFunctions.h",
"aten/src/ATen/ZoomFunctions_inl.h",
"aten/src/ATen/RegisterPrivateUse1.cpp",
]

generate_aten(
name = "generated_aten_cpp",
srcs = aten_generation_srcs,
outs = (
generated_cpu_cpp +
generated_cuda_cpp +
generated_zoom_cpp +
aten_ufunc_generated_cpu_sources("aten/src/ATen/{}") +
aten_ufunc_generated_cpu_kernel_sources("aten/src/ATen/{}") +
aten_ufunc_generated_cuda_sources("aten/src/ATen/{}") + [
aten_ufunc_generated_cuda_sources("aten/src/ATen/{}") +
aten_ufunc_generated_zoom_sources("aten/src/ATen/{}") + [
"aten/src/ATen/Declarations.yaml",
]
),
Expand Down Expand Up @@ -888,7 +896,7 @@ cc_library(
name = "torch_python",
srcs = libtorch_python_core_sources
+ if_cuda(libtorch_python_cuda_sources)
+ if_cuda(libtorch_python_distributed_sources)
+ if_cuda(libtorch_python_distributed_sources)=
+ GENERATED_AUTOGRAD_PYTHON,
hdrs = glob([
"torch/csrc/generic/*.cpp",
Expand Down
90 changes: 90 additions & 0 deletions BuildingZoom.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Setup Python Env

To start out, we just need to follow the normal procedure to build PyTorch from source. For convenience I've included these steps here:

```bash
conda create -n nod-pytorch python==3.10
conda activate nod-pytorch
conda install cmake ninja
pip install -r requirements.txt
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py develop
```

# CMake

Using the `USE_ZOOM` flag with CMake will enable building with HIP for ROCm without requiring any of the "HIPify" scripts in order to build. This will include HIP libraries and populate `torch.version.hip` appropriately. This flag is NOT yet entered into the `setup.py` script, so for now it needs to be added manually via `cmake` or `ccmake`.

You'll need to set the `ROCM_PATH` and `HIP_ROOT_DIR` environment variables appropriately, by default on linux these should be `/opt/rocm/` and `/opt/rocm/hip` respectively.

```bash
cd build/
export PYTORCH_ROCM_ARCH=gfx90a
export ROCM_PATH=/opt/rocm
export HIP_ROOT_DIR=/opt/rocm/hip
cmake -DUSE_ZOOM=ON --build . --target install
```

# Running PyTorch with Zoom

Programs using the zoom backend must be prefaced with this stub until we register a proper dispatch key in pytorch

```python
import torch
import torch.zoom
torch.utils.rename_privateuse1_backend('zoom')
torch.utils.generate_methods_for_privateuse1_backend(unsupported_dtype=None)
```

# Installing Triton

Since main Triton currently treats ROCm as if its masquerading as `torch.cuda`, we need a custom installation:

```bash
git clone https://github.com/123epsilon/triton.git
cd triton/
git checkout zoom
pip install pybind11
pip install python/
```

# Running LLama3 with Triton using LigerKernels and HuggingFace

```bash
pip install liger-kernel
```

```python
# pytorch/zoom_extension/examples/ligerllama.py
import torch
from transformers import AutoTokenizer
from liger_kernel.transformers import AutoLigerKernelForCausalLM
from time import perf_counter as pf
torch.utils.rename_privateuse1_backend('zoom')

# Set up the model and tokenizer
model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoLigerKernelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="zoom"
)

# Function to generate text
def generate_text(prompt, max_length=30):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=max_length)
return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
prompt = "Hey, how are you doing today?"
s = pf()
response = generate_text(prompt)
e = pf()
print(f"Prompt: {prompt}")
print(f"Response: {response}")

print(f"{e-s} seconds")
```

1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@ option(USE_CPP_CODE_COVERAGE "Compile C/C++ with code coverage flags" OFF)
option(USE_COLORIZE_OUTPUT "Colorize output during compilation" ON)
option(USE_ASAN "Use Address+Undefined Sanitizers" OFF)
option(USE_TSAN "Use Thread Sanitizer" OFF)
option(USE_ZOOM "Use ZOOM HIP Backend" OFF)
option(USE_CUDA "Use CUDA" ON)
cmake_dependent_option(
USE_XPU "Use XPU. Only available on Linux." ON
Expand Down
19 changes: 19 additions & 0 deletions CUDA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Context
A [Context](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#context) represents all the relevant state that are required on an accelerator in order to instantiate and perform tasks. A Context includes data, variables, conditions, and more which define the environment in which the provided tasks are executed. Commands such as launching a kernel on a gpu are executed in a Context. Once a context is destroyed CUDA cleans up all the resources associated with it. Therefore, pointers originating from different contexts reference distinct address spaces (memory locations). Contexts are manages in a stack, each host (CPU) thread scheduling tasks has its own stack of contexts. Contexts can be exchanged between host threads. For instance, popping `ctx` from HostA and pushing it onto HostB will force operations executed from HostB to be executed in `ctx` while HostA will operate under the previous context in the stack.

The context utilized for a device by the runtime API is the device's primary context. From the perspective of the runtime API a device and its primary context are synonymous.

# Module
Modules are dynamically loadable packages akin to DLLs or shared libraries. These include symbols, functions, and global variables that usres can call on. Modules maintain a module scope to avoid namespace collisions with other concurrently loaded modules.

# Hooks
Inheriting from `AcceleratorHooksInterface`, Hook implementations in PyTorch provide a generic interface through which host (CPU) code can query and set properties for the provided accelerators.

# CUDAStream
A stream is a structure that accepts events in a FIFO queue and executes them in a synchronous way, it can be thought of as a queue or pipeline for scheduling tasks on an accelerator. Spinning up multiple concurrent streams can enable task parallelism, for instance when we have multiple devices. In this case, each stream is uniquely associated with a device and queueing tasks to a stream will execute them on that device. Really, streams are specific to a context which are in-turn specific to a device. Streams have an associated integer priority, lower values are considered "high priority" by the accelerator's scheduling algorithm.

CUDAStream abstracts the concept of a cuda stream (`cudaStream_t`), it maintains several pools of streams to reduce the overhead associated with common stream operations such as creation and destruction. Each device maintains 3 lazily intialized pools of streams, where the first pool contains the default stream. Pool 2 contains low priority streams. Pool 3 contains the high priority streams. Despite the fact that each thread in principle has its own "current stream," this stream pool is global across threads. Hence many host threads can potentially dispatch kernels and synchronize on the same stream. Synchronization can have [different meanings](https://leimao.github.io/blog/CUDA-Default-Stream/) depending on whether we are synchronizing to the legacy stream or via per-thread streams.

# CUDACachingAllocator
https://cs.stackexchange.com/questions/143650/difference-between-caching-and-slab-allocator
https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html
51 changes: 51 additions & 0 deletions ZoomNotes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Running Device Type Tests
Set up the environment using `env.sh`. You may have to edit these variables if cloning. `TORCH_TEST_DEVICES` should point to `zoom_extension/test/pytorch_test_base.py`.

Then you can run `test.sh` to run the pytorch device test suite. This script will have a few output artifacts, one will be `test.log` with a verbose log of the `unittest` output from the test suite. Another is `zoom_unimplemented_operators.log` which will contain a list of unimplemented operators in the zoom backend, as well as the frequency with which this operator was called in the test suite. Finally, it will output a list of test failures (i.e. `AssertionError`) that were encountered in the test suite in `zoom_test_errors.log`.

The unimplemented operator log should not be considered exhaustive as additional operator failures may occur once the offending operator is implemented. This is just meant to be a tool to drive development.

# HIP Library Dependencies
For these running on ROCm, this also means that we take a dependency on the 'roc*' equivalent (e.g. hipBLAS requires rocBLAS)

* HIP - runtime, dtypes
* hipBLAS
* hipBLASLt
* hipRand
* hipSparse
* hipFFT
* rccl - TODO: add this in lieue of NCCL functionality
* hipThrust
* hipCub
* hipSolver

# HIPBlasLt

This is temporarily disabled via the macro `DISABLE_HIPBLASLT` in `ZoomContextLight.h`, we can reenable it by undef'ing that macro. This means that right now `scaledgemm` and `intmm` dont work, but we can implement hipblas versions of them and/or just enable hipblaslt.

# JITerator Notes:
https://dev-discuss.pytorch.org/t/keeping-pytorchs-ops-maintainable-the-jiterator/468


# Zoom JIT
Kernels are run via hiprtc and use a template specifier `scalar_t` which is filled in by `zoom_generate_code`. JIT functions are in `ATen/zoom/jit/jit_utils.*`. Kernels need to be defined with `extern "C"` to prevent name mangling, otherwise we can't retrieve our kernel properly at launch time with `hipModuleGetFunction`. See `ATen/native/zoom/Blas.cpp:dot_hip` for an example implementation.

## Testing Operators on Zoom
See `test/test_ops.py`, `test_numpy_ref` and `test_compare_cpu`.

TODO List:

- Add RCCL
- Determine rocBLAS determinism requirements as far as config and versions (necessary to throw determinism errors when appropriate)

Note on error in test suite: `RuntimeError: t.use_count() <= 1`
This error is thrown in the `test_parallel_cow_materialize_error` test in the torch device type tests because
of many parallel references being held on the same tensor. This will only throw in debug mode. I think we can ignore this since
this same error is thrown on the CPU backend in debug mode, and passes in release.

Note on error in `test_grad_scaling_state_dict`, this error occurs in the instance check `isinstance(s1._scale, torch.FloatTensor)`
because, despite their datatypes being equal, the PU1 dispatch key is a mismatch with the CPU dispatch key of the `FloatTensor` class.
These tensor types are deprecated anyways, and the rest of the test works so we can just ignore - if we want to we can add a
`torch.zoom.FloatTensor` (though this is a deprecated design pattern and likely frowned upon). The real correct thing to do is to refactor the instance check. See `python_tensor.cpp:Tensor_instancecheck`

For now, I've added a Macro in `Allocator.h` that registers a functor that retrieves the `ZoomCachingAllocator` for us since we're currently implemented as an external backend (e.g. using PU1 dispatch key). Once, we're in the main repo we can replace it with the proper logic when retrieving the allocator for the Zoom backend.
Loading