Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA not available or Failed to Launch Kernels (error code invalid argument) #283

Open
kings177 opened this issue May 17, 2024 · 15 comments · May be fixed by #413
Open

CUDA not available or Failed to Launch Kernels (error code invalid argument) #283

kings177 opened this issue May 17, 2024 · 15 comments · May be fixed by #413
Labels
enhancement New feature or request

Comments

@kings177
Copy link
Member

kings177 commented May 17, 2024

References: HigherOrderCO/Bend#320 by rubenjr0 with contents:

"
Hello! I think I've encountered a bug. When running this example from the readme:

def sum(depth, x):
  switch depth:
    case 0:
      return x
    case _:
      fst = sum(depth-1, x*2+0) # adds the fst half
      snd = sum(depth-1, x*2+1) # adds the snd half
      return fst + snd
    
def main:
  return sum(30, 0)

The output is 0. I've tried bend run, bend run-c, and bend gen-cu (bend run-cu says cuda is not available, so I manually compile it with nvcc).

The output on my machine when running sum(24, 0) is 8388608, but on equivalent Haskell and Python programs the programs return 140737479966720. The results start to diverge when depth>=13.

I was wondering what could be causing these issues, both the incorrect result when depth>=13, and the result=0 when depth>=25.

My computer specs:

OS: Pop_OS 22.04
CPU: AMD Ryzen 5 2600x (12 cores)
GPU: NVIDIA RTX 4060 ti (16GB)
"

@jzinno
Copy link

jzinno commented May 17, 2024

Hey saw on HigherOrderCO/Bend/issues/320 that you've run into the Failed to launch kernels (error code invalid argument)! issue. Adding some more info here:

Tested on Ryzed 3600XT & 2070 super

This appears to be a related to the following in hvm.cu

// Local Net
const u32 L_NODE_LEN = 0x2000;
const u32 L_VARS_LEN = 0x2000;
struct LNet {
  Pair node_buf[L_NODE_LEN];
  Port vars_buf[L_VARS_LEN];
};

It was suggested this might be related to shared mem, added a test to debug:

  // Check max shared memory size
  int maxSharedMem;
  cudaDeviceGetAttribute(&maxSharedMem, cudaDevAttrMaxSharedMemoryPerBlock, 0);
  printf("Max shared memory per block: %d bytes\n", maxSharedMem);

  // Configures Shared Memory Size
  if (sizeof(LNet) <= maxSharedMem)
  {
    cudaFuncSetAttribute(evaluator, cudaFuncAttributeMaxDynamicSharedMemorySize, sizeof(LNet));
  }
  else
  {
    fprintf(stderr, "Error: LNet size (%zu bytes) exceeds max shared memory per block (%d bytes)\n", sizeof(LNet), maxSharedMem);
    exit(EXIT_FAILURE);
  }

  // Configures Shared Memory Size
  // cudaFuncSetAttribute(evaluator, cudaFuncAttributeMaxDynamicSharedMemorySize, sizeof(LNet));
╰─ ./a.out
Max shared memory per block: 49152 bytes
Error: LNet size (98304 bytes) exceeds max shared memory per block (49152 bytes)

changed LNet def to:

// Local Net
const u32 L_NODE_LEN = 0x1000;
const u32 L_VARS_LEN = 0x1000;
struct LNet {
  Pair node_buf[L_NODE_LEN];
  Port vars_buf[L_VARS_LEN];
};

recompiled and running on GPU work well!

╰─ hyperfine './a.out' #cuda
Benchmark 1: ./a.out
  Time (mean ± σ):     914.8 ms ±  36.1 ms    [User: 429.0 ms, System: 57.4 ms]
  Range (min … max):   886.7 ms … 985.7 ms    10 runs

╰─ ./a.out
Result: 8388608
- ITRS: 401921647
- LEAK: 19502207
- TIME: 0.50s
- MIPS: 809.96

╰─ hyperfine './b.out' #C
Benchmark 1: ./b.out
  Time (mean ± σ):     18.806 s ± 13.777 s    [User: 67.170 s, System: 82.483 s]
  Range (min … max):    7.533 s … 41.807 s    10 runs

╰─ ./b.out
Result: 8388608
- ITRS: 402653167
- TIME: 7.74s
- MIPS: 52.03

Perhaps cudaDevAttributeMemoryPerBlock could be inspected to automatically appropriately set this on more GPU archs.

@kings177
Copy link
Member Author

oh nice, thank you, we will be working on making the HVM adaptable to multiples GPU, the current iteration was only developed taking the 4090 into account.

@kings177 kings177 reopened this May 17, 2024
@kings177
Copy link
Member Author

i'll Leave this open as a reminder

@Epicguru
Copy link

Hi, I've seen that HigherOrderCO/Bend#342 has been marked as a duplicate of this so I'll comment here. I'm using WSL on Windows 11, CUDA toolkit installed and verified working with other WSL and Docker programs, but getting the same CUDA not available! message.

My GPU is a RTX 4090.

I don't necessarily see how this issue is a duplicate of the other, given that this issue was originally closed with the explaination that it was only developed with a 4090 in mind, but that is the same GPU I'm running (unless it's vendor specific somehow?).

In any case, an exciting project. Keep up the good work!

@MrCreeps
Copy link

MrCreeps commented May 17, 2024

Might as well add that I also am having this issue present in HigherOrderCO/Bend#342. I'm using WSL (Ubuntu 22.04.03 LTS) on Windows 11 23H2.

Output from nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Output from nvidia-smi

Fri May 17 17:13:17 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 551.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080 Ti     On  |   00000000:01:00.0  On |                  N/A |
|  0%   50C    P8             49W /  400W |     915MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A       390      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

Already tested that nvcc works by making a simplistic cuda test program.

Edit:
The issue being:

Errors:
Error reading result from hvm. Output :
CUDA not available!

when trying to run a .bend file with the run-cu command.

@jzinno
Copy link

jzinno commented May 17, 2024

Afaik CUDA not available on WSL can be resolved following:

https://docs.nvidia.com/cuda/wsl-user-guide/index.html

Make sure the CUDA paths are sourced in your shell of choice's config as well. Can verify with:

which nvcc

Adding CUDA libs to LD_LIBRARY_PATH as well is probably good to do as well for good measure. Not sure how important here, on mobile currently so can't reproduce right now.

If CUDA not found persists, try reinstalling Bend and HVM after making sure CUDA is available orthogonally.

cargo +nightly hvm --force
cargo +nightly bend --force

Force reinstall because I believe CUDA availability is check as part of the build script.

@Andreesian
Copy link

Afaik CUDA not available on WSL can be resolved following:

https://docs.nvidia.com/cuda/wsl-user-guide/index.html

Make sure the CUDA paths are sourced in your shell of choice's config as well. Can verify with:

which nvcc

Adding CUDA libs to LD_LIBRARY_PATH as well is probably good to do as well for good measure. Not sure how important here, on mobile currently so can't reproduce right now.

If CUDA not found persists, try reinstalling Bend and HVM after making sure CUDA is available orthogonally.

cargo +nightly hvm --force
cargo +nightly bend --force

Force reinstall because I believe CUDA availability is check as part of the build script.

This worked for me after trying everything else.

@Epicguru
Copy link

Can also confirm that forcing a re-install also fixed the issue for me, although I did not do anything else, CUDA was already added to the path.

@VictorTaelin
Copy link
Member

Since HVM-CUDA has been hardcoded to RTX 4090, older GPUs (which have 1/2 the shared_memory size) will not work. That's a hindsight. I'll refactor that hardcoded number to be dynamic instead, and properly query the available L1 cache size.

@pema99
Copy link

pema99 commented May 18, 2024

I guess that clears things up - I was in the previous issue, facing the "invalid arguments" issue. Using a 2070 super

@0m3rF
Copy link

0m3rF commented May 19, 2024

Facing the same issue "invalid arguments" in WSL2 ubuntu. Using 2080. Can run python with numba but not bend run-cu.

@kings177 kings177 changed the title CUDA not available when running gen-cu CUDA not available or Failed to Launch Kernels May 20, 2024
@kings177
Copy link
Member Author

So, as a summary for the problems:

when it comes to CUDA not available the problem can vary, with the cause for most being CUDA paths not set correctly.

when it comes to Failed to launch kernels, that stems from the fact that the number for the shared mem is currently hardcoded to fit the GPUs with 96KB of per block shared memory, we plan on soon releasing a dynamic version of this.

@RileySeaburg
Copy link

Since HVM-CUDA has been hardcoded to RTX 4090, older GPUs (which have 1/2 the shared_memory size) will not work. That's a hindsight. I'll refactor that hardcoded number to be dynamic instead, and properly query the available L1 cache size.

This would be great, I'm running an older NVIDIA GPU and getting the same error after reinstalling.

Using WSL

riley@Virtual-Desktop-1:~/programming/bend$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
riley@Virtual-Desktop-1:~/programming/bend$ nvidia-smi
Tue May 21 12:47:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.103                Driver Version: 537.13       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 ...    On  | 00000000:01:00.0 Off |                  N/A |
| 40%   33C    P8              14W / 250W |   1101MiB /  8192MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       439      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

@shevisj
Copy link

shevisj commented May 21, 2024

Same issue running on a 2080ti. Commenting here to follow up once dynamic shared memory support is added.

@kings177 kings177 changed the title CUDA not available or Failed to Launch Kernels CUDA not available or Failed to Launch Kernels (Error code invalid argument) May 29, 2024
@kings177 kings177 changed the title CUDA not available or Failed to Launch Kernels (Error code invalid argument) CUDA not available or Failed to Launch Kernels (error code invalid argument) May 29, 2024
@speedfox-uk
Copy link

Getting the same issue. Here is my nvidia-msi output

speedfox@london:~/src/Bend/examples$ nvidia-smi
Tue Jun 11 22:21:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2060         On | 00000000:01:00.0 Off |                  N/A |
|  0%   43C    P8                7W / 184W|    135MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3879      G   /usr/lib/xorg/Xorg                          101MiB |
|    0   N/A  N/A      4029      G   /usr/bin/gnome-shell                         10MiB |
|    0   N/A  N/A      6336      G   ...seed-version=20240607-130129.053000       21MiB |
+---------------------------------------------------------------------------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.