Skip to content

Commit

Permalink
SAXPY finished
Browse files Browse the repository at this point in the history
  • Loading branch information
MathiasMagnus committed Aug 23, 2023
1 parent 3e02a2e commit e08b984
Showing 1 changed file with 75 additions and 17 deletions.
92 changes: 75 additions & 17 deletions docs/tutorials/saxpy.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,11 @@ cause and design a fix.
## Your First Piece of HIP Code

First, let's take the "Hello, World!" of GPGPU: SAXPY. The name comes from
math, a vector equation at that: `a * x + y = z` where `a ∈ ℝ` is a scalar and
`x,y,z ∈ 𝕍` are vector quantities of some large dimensionality. (Allow us to
omit defining what vectors and vector spaces are in the mathematical sense.)
From a practical perspective we can compute this using a single `for` loop over
3 arrays.
math, a vector equation at that: {math}`a\cdot x+y=z` where
{math}`a\in\mathbb{R}` is a scalar and {math}`x,y,z\in\mathbb{V}` are vector
quantities of some large dimensionality. (Our vector space is defined over the
set of reals.) From a practical perspective we can compute this using a single
`for` loop over 3 arrays.

```c++
for (int i = 0 ; i < N ; ++i)
Expand All @@ -50,8 +50,8 @@ for (int i = 0 ; i < N ; ++i)

In linear algebra libraries, such as BLAS (Basic Linear Algebra Subsystem) this
operation is defined as AXPY "A times X Plus Y". The "S" comes from
single-precision, meaning that every element of our array are `float`s (our
vector space was defined over the set of reals).
single-precision, meaning that every element of our array are `float`s. (We
choose IEEE 754: binary32 arithmetic as the representation of our algebra.)

To get quickly off the ground, we'll take off-the-shelf piece of code, the set
of [HIP samples from GitHub](https://github.com/amd/rocm-examples/). Assuming
Expand Down Expand Up @@ -90,7 +90,7 @@ device is `0`, which is equivalent to calling `hipSetDevice(0)`.

Once our data has been dispatched, we can launch our calculation on the device.

```hip
```cu
__global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const unsigned int size)
{
...
Expand Down Expand Up @@ -122,9 +122,9 @@ First let's discuss the signature of the offloaded function:
constructors. Where would that logic execute? On the host? On the device?)_
Pointer arguments are pointers to device memory, one typically backed by
VRAM.
- We said that we'll be computing `a * x + y = z`, however we only pass two
pointers to the function. We'll be canonically reusing one of the inputs as
outputs.
- We said that we'll be computing {math}`a\cdot x+y=z`, however we only pass
two pointers to the function. We'll be canonically reusing one of the inputs
as outputs.

There's quite a lot to unpack already. How is this function launched from the
host? Using a language extension, the so-called triple chevron syntax. Inside
Expand All @@ -138,7 +138,7 @@ the angle brackets we can provide the following:
Following the triple chevron is ordinary function argument passing. Now let's
take a look how the kernel is implemented.

```hip
```cu
__global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const unsigned int size)
{
// Compute the current thread's index in the grid.
Expand All @@ -160,7 +160,7 @@ __global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const

Retrieval of the result from the device is done much like its dispatch:

```hip
```cu
HIP_CHECK(hipMemcpy(y.data(), d_y, size_bytes, hipMemcpyDeviceToHost));
```

Expand Down Expand Up @@ -671,6 +671,14 @@ major.minor: 8.6
major.minor: 7.0
```

```{tip}
Next to the `nvcc` executable is another tool called `__nvcc_device_query`
which simply prints the SM Architecture numbers to standard out as a comma
separated list of numbers. The naming of this utility suggests it's not a user
facing executable but is used by `nvcc` to determine what devices are in the
system at hand.
```

:::
:::{tab-item} Windows & AMD
:sync: windows-amd
Expand Down Expand Up @@ -712,15 +720,23 @@ major.minor: 8.6
major.minor: 7.0
```

```{tip}
Next to the `nvcc` executable is another tool called `__nvcc_device_query.exe`
which simply prints the SM Architecture numbers to standard out as a comma
separated list of numbers. The naming of this utility suggests it's not a user
facing executable but is used by `nvcc` to determine what devices are in the
system at hand.
```

:::
::::

Now that we know which versions of graphics IP our devices use, we can
recompile our program with said parameters.

::::{tab-set}
:::{tab-item} Linux
:sync: linux
:::{tab-item} Linux & AMD
:sync: linux-amd

```bash
amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 --offload-arch=gfx906:sramecc+:xnack-
Expand All @@ -735,8 +751,29 @@ First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
```

:::
:::{tab-item} Windows
:sync: windows
:::{tab-item} Linux & NVIDIA
:sync: linux-nvidia

```bash
nvcc ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -I /opt/rocm/include -O2 -x cu -arch=sm_70,sm_86
```

```{tip}
If you want to portably target the development machine which is compiling, you
may specify `-arch=native` instead.
```

Now our sample will surely run.

```none
./saxpy
Calculating y[i] = a * x[i] + y[i] over 1000000 elements.
First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
```

:::
:::{tab-item} Windows & AMD
:sync: windows-amd

```pwsh
clang++ .\HIP-Basic\saxpy\main.hip -o saxpy.exe -I .\Common -lamdhip64 -L ${env:HIP_PATH}lib -O2 --offload-arch=gfx1032 --offload-arch=gfx1035
Expand All @@ -750,5 +787,26 @@ Calculating y[i] = a * x[i] + y[i] over 1000000 elements.
First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
```

:::
:::{tab-item} Windows & NVIDIA
:sync: windows-nvidia

```pwsh
nvcc .\HIP-Basic\saxpy\main.hip -o saxpy.exe -I ${env:HIP_PATH}include -I .\Common -O2 -x cu -arch=sm_70,sm_86
```

```{tip}
If you want to portably target the development machine which is compiling, you
may specify `-arch=native` instead.
```

Now our sample will surely run.

```none
.\saxpy.exe
Calculating y[i] = a * x[i] + y[i] over 1000000 elements.
First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
```

:::
::::

0 comments on commit e08b984

Please sign in to comment.