Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bitonic Sort on fails on CUDA with (error code an illegal memory access was encountered) #314

Open
developedby opened this issue May 20, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@developedby
Copy link
Member

Originally from HigherOrderCO/Bend#364 by user @ethanbarry

Description

When I run the compiled CUDA bitonic sorter example (linked in the README) I get this error:

Failed to launch kernels (error code an illegal memory access was encountered)!

To Reproduce

Steps to reproduce the behavior:

bend gen-cu sorter.bend > sorter.cu
nvcc sorter.cu -o sorter
prime-run ./sorter (Launches it on the GPU for Arch Linux.)
Error recieved.

Expected behavior

The program runs on the GPU.
Desktop (please complete the following information):

OS: Linux (Arch 6.9.1-arch1-1)
CPU: Intel i7-11800H
GPU: RTX 3050 Ti Mobile
GPU Driver: Nvidia open kernel modules v550.78
CUDA release 12.4, V12.4.131

Additional context

The program runs using the C codegen backend, but with the CUDA backend, it seems to fail regardless of what I do. If anyone is curious about the prime-run command, it's really just a script that forces the dGPU to handle a task - nothing fancy.

@NotCyberLemon
Copy link

I am here from the main Bend repo with issue #Bitonic Sort example failed with GPU kernel error.

I too am having a kernel memory issue:

$ ./sorter # The same as prime-run due to environment variables already being set.
| Failed to launch kernels (error code an illegal memory access was encountered)!

I am also running a mobile gpu where I am getting this issue.

Some GPU properties and info from exec:

--- General Information for device 0 ---
Name: NVIDIA GeForce RTX 3060 Laptop GPU
Compute capability: 8.6
Clock rate: 1425000
Device copy overlap: Enabled
Kernel execution timeout: Enabled

--- Memory Information for device 0 ---
Total global memory: 5996544000
Total constant memory: 65536
Max memory pitch: 2147483647
Texture alignment: 512

--- MP Information for device 0 ---
Multiprocessor count: 30
Shared memory per MP: 49152
Registers per MP: 65536
Threads in warp: 32
Max threads per block: 1024
Max thread dimensions: (1024, 1024, 64)
Max grid dimensions: (2147483647, 65535, 65535)

--- Memory Allocation Test ---
Memory allocation successful!

Specs:

OS: Arch Linux x86_64
Kernel: 6.9.1-zen1-1-zen
GPU: NVIDIA GeForce RTX 3060 Mobile / Max-Q
GPU Driver: nvidia-open-dkms 550.78-4
CUDA Version: 12.4.1-4

As well as that, running it through bend run-cu ./sorter it seems to run indefinitely, though - after a while of testing - I am unable to find what exactly is the cause nor what the execution is being caught on.

@2lian
Copy link

2lian commented May 22, 2024

I had the same issue. I cloned, HVM changed LNet seeting according to #283 , but the current repo V2.0.14 does not work with bend, and I do not know where V2.0.13 (for bend) is.

I never used cargo so excuse me if I am doing some black magic here, but this is how I fixed it for bend:

mkdir ~/hvmtmp
cd ~/hvmtmp
cargo init
cargo add hvm@=2.0.13
cargo vendor vendor
cd vendor/hvm

You are now inside the source of hvm V2.0.13.

Open and edit src/hvm.cu. Line 334 reduce L_NODE_LEN and L_VARS_LEN, but do not reduce too much. This value works on my GTX 1080Ti:

// Local Net
const u32 L_NODE_LEN = 0x2000/4;
const u32 L_VARS_LEN = 0x2000/4;
struct LNet {
  Pair node_buf[L_NODE_LEN];
  Port vars_buf[L_VARS_LEN];
};

Now go back to hvm V2.0.13 you downloaded and install it:

cd ~/hvmtmp/vendor/hvm
cargo +nightly install --path .

This should work, you can now delete ~/hvmtmp.

@VictorTaelin
Copy link
Member

I wonder why they needed /4 there - 0x1000 should be safe for every architecture, shouldn't it? AFAIK all devices support 48KB shared memory. Perhaps this is using a little bit more, due to the other shared structures?

@2lian
Copy link

2lian commented May 23, 2024

I wonder why they needed /4 there - 0x1000 should be safe for every architecture

To report more about this, on my GTX 1080Ti (using WSL2, cuda toolkit 12.3), I have tried:

  • 0x2000
  • 0x1000 = 0x2000/2
  • 0x2000/3
  • 0x2000/4
  • 0x0500
  • 0x0100

Only 0x2000/4 and 0x0500 did work.

@gladmo
Copy link

gladmo commented May 23, 2024

all tries not work for me, on my GTX 1050 Ti.

OS: CentOS Linux release 7.9.2009 (Core)
CPU: Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
GPU: GTX 1050 Ti
GPU Driver: Nvidia open kernel modules v550.78
CUDA release 12.4, V12.4.131
$ nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1050 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   58C    P8             N/A /   72W |       2MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

test example:

$ time bend run-c sorter.bend
Result: 16646144
bend run-c sorter.bend  47.63s user 0.32s system 435% cpu 11.001 total

$ time bend run-cu sorter.bend
Errors:
1.Failed to parse result from HVM.
Output from HVM was:
"Failed to launch kernels. Error code: an illegal memory access was encountered.\n""exit status: 1"""

bend run-cu sorter.bend  0.03s user 0.06s system 89% cpu 0.097 total

@TimotejFasiang
Copy link

Did anyone manage to find some L_NODE_LEN and L_VAR_LEN values that work for other GPUs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants