- HIP: Heterogenous Interface for portability
- Rocm vs HIP: Rocm is the big open-source stack which encomposes HIP among other things
- Compute Unit ~ executes SIMD on GPU.
- CPU is latency oriented. GPU is throughput oriented (total workload)
- MI300X:
- CUs : 304
- Streaming processors = CU * 64 = 19,456
- Mem : 192 GB
- Host & Device maintain their own memory separately
- hipGetDeviceCount(int* count) returns no. of available "compute devices". - Some AMD GPUs (like those in laptop) are currently not considered "compute devices"
- hipGetDeviceProperties(hipDeviceProp_t* prop, int deviceId)
- hipMalloc(void** ptr, size_t size)
- the ptr is defined on the cpu but the address returned is GPU address
- hipFree(void* ptr)
- the ptr should be pointing to GPU address. Don't use this api to free CPU memory
- hipMemset(void* dst, int value, size_t sizeBytes) : useful for zeroing out device mem
- Error handling
- HIP APIs return hipError_t
- cont char* hipGetErrorString(hipError_t hipError)
- hipError_t hipGetLastError(void) : this resets the error status for the next call.
- hipError_t hipPeekAtLastError(void) : this does Not reset the error status
- __ global __ functions have to return
void
- __ global __ void simplekernel(float *a){ ... }
- However, device functions can return other stuff:
- __ device __ float otherKernel(float *a){ ... }
- blockDim.x <= 1024
-
GPUs --> Many CUs
-
Each CU:
- 4 Vector ALUs (each having 64 lanes i.e. 64 threads). 1 wavefront matches to 1 VALU
- 1 Scalar ALU
- LDS (local data share): a bank of shared memory 64k, 160k
- Threads within CU can use LDS
-
Each Thread Block will run on 1 CU
-
LDS can only be accessed by threads in that particular CU
-
Global Memory
-
- hipMalloc & hipFree goes here
-
accesable by all threads
-
Shared Memory (LDS)
- 2 types: Static && Dynamic
- Static:
__shared__ float sharedBuff[SIZE]
- Dynamic
-
Local (private) memory
__syncthreads()
__threadfence()
__threadfence_block()
- Atomics:
- Many threads writing to same place
=======
- First hipLaunchKernel takes a long time: Some potential reasons is loading the library for the first time (could be some MBs of size) + stream creation after the loading.
- This could be handled by launching a dummy kernel first.
- hipMemcpy is blocking/sync (i.e. CPU will wait before moving to the next task). You can use hipMemcpyAsync for non-blocking/async call (i.e. CPU will move to execute the next task).