-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Warp with python threading #224
Comments
Hi @BernardoCovas, thanks for the report - indeed our CPU execution does not currently support multi-threaded execution. The main reason for this is the use of a static variable (https://github.com/NVIDIA/warp/blob/main/warp/codegen.py#L2401) to hold the thread index during kernel grid iteration. I think we could easily avoid this by updating the CPU path to use a simple code generation approach that mirrors the way the CUDA code works. I will add a task to look into this. |
Hi @mmacklin |
Hey @BernardoCovas, are you able to provide a self-contained reproducible example for this issue? I'd like to make sure your original issue is fixed. |
Also, this this using the new GIL-free Python? |
Somehow, and quite embarisingly so, since posting this, I could not reproduce the issue. It used to happen when some warp processing was done before the main algorithm, inside a thread while other threads were executing. I re-rested with an older version 1.0.0, and somehow it also did not happen. As soon as I come across it again, I will reach out. In the meantime, here is an example of what should cause the issue, as I initially understood it: import typing as T
import warp as wp
import numpy as np
wp.init()
vec3f_1d : T.TypeAlias = wp.array1d(dtype=wp.vec3f) # type: ignore
@wp.func
def fn_compute(
v0: wp.vec3f,
v1: wp.vec3f):
# keep busy
for _ in range(10):
v2 = wp.cross(v0, v1)
return v2
@wp.kernel
def kn_compute(
verts0: vec3f_1d,
verts1: vec3f_1d,
out: vec3f_1d):
index = wp.tid()
v0 = verts0[index]
v1 = verts1[index]
out[index] = fn_compute(v0, v1)
def test_device(i: int, N: int, device: str):
verts0 = np.random.uniform(0, 1, (N, 3)).astype(np.float32)
verts1 = np.random.uniform(0, 1, (N, 3)).astype(np.float32)
res = np.cross(verts0, verts1).astype(np.float32)
# try multiple times
for _ in range(100):
out = wp.empty((N,), dtype=wp.vec3f, device=device)
verts0_ = wp.from_numpy(verts0, dtype=wp.vec3f, device=device)
verts1_ = wp.from_numpy(verts1, dtype=wp.vec3f, device=device)
wp.launch(
kn_compute,
dim=(N,),
device=device,
inputs=[verts0_, verts1_],
outputs=[out])
out = out.numpy()
close = np.isclose(res, out, atol=1e-7)
close_1d = np.all(close, axis=-1)
not_close_1d = ~close_1d
not_close_1d, = np.nonzero(not_close_1d)
assert np.all(close_1d), f'{out[not_close_1d[0]], res[not_close_1d[0]]}'
return i, N
if __name__ == '__main__':
from concurrent import futures
import random
thread_pool = futures.ThreadPoolExecutor()
N_MIN = 1
N_MAX = 1000000
M = 100
DEVICE = 'cuda:0'
DEVICE = 'cpu'
# NOTE:(b.covas): Load kernels
test_device(0, 1, DEVICE)
# NOTE:(b.covas): running with different dimentions
fs = {thread_pool.submit(test_device, i, random.randint(N_MIN, N_MAX), DEVICE) for i in range(M)}
for j, f in enumerate(futures.as_completed(fs)):
i, n = f.result()
print(f'{j}/{M}: completed run {i} with {n} elements') The reason for trying multiple times was to have a situation in which two different threads are iterating through different sized kernels, and expecting a segfault or wrong result. Not only does this work, but also all CPU's are busy. I'm confused, as this was a somewhat long-standing issue that forced the use of cuda devices only. |
Also, @shi-eric , this is using standard python 3.10 |
Thanks for the helpful example @BernardoCovas! I was also having problems getting my mental picture of the issue to materialize in an example program to verify my changes. Even after building a GIL-free Python 3.13 to test this issue with a |
Hi @shi-eric , Thank you for the help! So maybe I am not understanding the current approach. If a static variable is used, it means that warp never releases the GIL while executing kernels? If it would, then two simultaneous kernels would (wrongly) use and increment the same variable. Is this right? I am asking because it seems that all CPU's are performing some work by looking at htop, but somehow the result is still verified |
Take this with a grain of salt because as I said in my previous comment, there are still some things that tell me I don't have the full picture yet. My understanding of the problem is that because a static variable is used in Warp, this gives the wrong behavior when more than one CPU kernel is being executed at the same time. However, my guess is that the Python GIL prevents more than one CPU Warp kernel from running at the same time when using a single Python process and multiple threads. When running on a special CPython that supports disabling the GIL (upcoming option for Python 3.13) then multiple Warp CPU kernels can execute concurrently when using thread-based parallelism and we should see an issue with thread indexing because of the use of the Warp does not have any code that releases/reacquires the GIL, but numpy does. Maybe you see a speedup from mulithreading despite the GIL because of this? |
Yes that is likely it. I do use numpy quite a bit, and I do also release the gil in my own codebase basically everywhere, so I use the thread pool precisely for the speedup. |
After printing the thread index inside the kernel in the beginning and end, I can confirm no two kernels are executing simultaneously, as one allways starts and ends before any other begins @shi-eric import typing as T
import warp as wp
import numpy as np
wp.init()
vec3f_1d : T.TypeAlias = wp.array1d(dtype=wp.vec3f) # type: ignore
@wp.func
def fn_compute(
v0: wp.vec3f,
v1: wp.vec3f):
# keep busy
for _ in range(1000):
v2 = wp.cross(v0, v1)
return v2
@wp.kernel
def kn_compute(
verts0: vec3f_1d,
verts1: vec3f_1d,
out: vec3f_1d):
index = wp.tid()
v0 = verts0[index]
v1 = verts1[index]
wp.printf('start thread %d\n', index)
out[index] = fn_compute(v0, v1)
wp.printf('end thread %d\n', index)
def test_device(i: int, N: int, device: str):
verts0 = np.random.uniform(0, 1, (N, 3)).astype(np.float32)
verts1 = np.random.uniform(0, 1, (N, 3)).astype(np.float32)
res = np.cross(verts0, verts1).astype(np.float32)
# try multiple times
for _ in range(100):
out = wp.empty((N,), dtype=wp.vec3f, device=device)
verts0_ = wp.from_numpy(verts0, dtype=wp.vec3f, device=device)
verts1_ = wp.from_numpy(verts1, dtype=wp.vec3f, device=device)
wp.launch(
kn_compute,
dim=(N,),
device=device,
inputs=[verts0_, verts1_],
outputs=[out])
out = out.numpy()
close = np.isclose(res, out, atol=1e-7)
close_1d = np.all(close, axis=-1)
not_close_1d = ~close_1d
not_close_1d, = np.nonzero(not_close_1d)
assert np.all(close), f'{out[not_close_1d[0]], res[not_close_1d[0]]}'
return i, N
if __name__ == '__main__':
from concurrent import futures
import random
thread_pool = futures.ThreadPoolExecutor()
N_MIN = 1
N_MAX = 1000000
M = 100
DEVICE = 'cuda:0'
DEVICE = 'cpu'
# NOTE:(b.covas): Load kernels
test_device(0, 1, DEVICE)
# NOTE:(b.covas): running with different dimentions
fs = {thread_pool.submit(test_device, i, random.randint(N_MIN, N_MAX), DEVICE) for i in range(M)}
for j, f in enumerate(futures.as_completed(fs)):
i, n = f.result()
print(f'{j}/{M}: completed run {i} with {n} elements') |
Is it possible to use warp with python threading? I use python's
ThreadPoolExecutor
frequently to paralelize binded c++ code, and started using warp also along with it. It seems to work with cuda, as long as the kernels are loaded before thread execution (presumably for kernel registration on global variables). However, CPU warp code does not seem to work so smoothly. It frequently crashes. Will it be possible to use warp code from independent python threads? I presume it would also allow for parallel CPU warp kernels (which currently seem to be sequential).Thank you in advance
The text was updated successfully, but these errors were encountered: