You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When running a simple CUDA vector addition kernel, the expected number of floating point additions is equal to the number of elements in the vector. Nvidia Nsight Compute gives the correct result for the same metric.
To Reproduce
The following is the vectorAdd.cu source code (without error checking to simplify it):
#include <likwid-marker.h>
#include <cuda_runtime.h>
__global__ void vectorAdd(const float *A, const float *B, float *C,
const int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}
int main(void) {
LIKWID_NVMARKER_INIT;
LIKWID_NVMARKER_REGISTER("vecAdd");
const int N = 500;
size_t size = N * sizeof(float);
float *h_A = (float *)malloc(size);
float *h_B = (float *)malloc(size);
for (int i = 0; i < N; ++i) {
h_A[i] = 1.0f;
h_B[i] = 2.0f;
}
float *d_A, *d_B, *d_C;
cudaMalloc((void **)&d_A, size);
cudaMalloc((void **)&d_B, size);
cudaMalloc((void **)&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
int threadsPerBlock = 128;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
LIKWID_NVMARKER_START("vecAdd");
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize();
LIKWID_NVMARKER_STOP("vecAdd");
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
LIKWID_NVMARKER_CLOSE;
return 0;
}
The above can be easily compiled using nvcc as follows:
Thanks for the output. It seems to me that despite calling cudaDeviceSynchronize() before LIKWID_NVMARKER_STOP, the counts are not finished yet. In LIKWID_NVMARKER_CLOSE, the library finally stops the counting (last lines of debugging output after nvmon_markerStopRegion) and reads still 166 fadd instructions although the GPU should be inactive after the execution of LIKWID_NVMARKER_STOP. I have to investigate what's going on there. Thanks for the test code.
Describe the bug
When running a simple CUDA vector addition kernel, the expected number of floating point additions is equal to the number of elements in the vector. Nvidia Nsight Compute gives the correct result for the same metric.
To Reproduce
The following is the
vectorAdd.cu
source code (without error checking to simplify it):The above can be easily compiled using
nvcc
as follows:And then executed with:
likwid-perfctr -W FLOPS_SP -m ./vectorAdd
The output is giving:
Running using Nvidia Nsight Compute:
To Reproduce with a LIKWID command
The text was updated successfully, but these errors were encountered: