-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance evaluation #45
Comments
Apologies for missing those. Should I close this issue? |
Not at all! I just posted these, so we don't need to start the discussion from scratch. :) |
Perfect! |
Sorry, I didn't mean to kill this discussion. I will try to summarize the aspects that we had discussed to include in a performance evaluation.
I am sure this list is incomplete, so please feel free to add/delete items. Maybe @iangrooms also saved the computational section that was originally part of the JAMES paper. Question: Do we want to include the performance evaluation
I think 3. would maybe work best. |
Thanks for the writeup. I think your list is perfect, and I favor option 3. For me, the big question is the following. Most HPC system nodes have either ~24 CPU cores or 1 GPU. By using parallelism appropriately (e.g. parallelize across non-spatial dimensions), can the 24-core CPU speed get close to the 1 GPU speed? I find this problem fun and would like to work on it a bit. |
I know I volunteered to do this. I still hope to work on it. Will try to squeeze it in over the next few days. |
Ok, so I started the benchmarking exercise in a a standalone repo: https://github.com/ocean-eddy-cpt/gcm-filters-benchmarks The repo is organized as follows:
One central goal is to figure out how many CPUs you need to use to replicate the GPU performance. What I did so farWhat I did was run the basic tutorial example on Casper with various numbers of CPUs and look at strong and weak scaling. I bypass questions about i/o performance by using random data. When x and y are contiguous (not chunked) the GCM filters approach is embarrassingly parallel, so in theory should scale perfectly (runtime inversely proportional to the number of CPUs used). What I found was a bit different. Here is the weak scaling (from the plot_results.ipynb notebook). I used a single node of Casper, varying the number of CPUS from 1 to 36, with various combinations of workers and threads. What the graph shows is decent scaling for low core counts, then a saturation at high core counts. As a reminder, this is weak scaling, which means we actually we use more cores, we make the problem bigger. The fact that the throughput saturates at a relatively low rate (200 MB/s) indicates that scaling is breaking down for some reason. Instead, as we add more cores, we should keep processing data faster. It would be great to dig into this a bit deeper to understand why this is happening. Perhaps that is something that Dask experts @gjoseph92 and @jrbourbeau could help us debug? A useful comparison point is simply taking the mean of the same array, which exposes Dask's own scaling properties: Although it is somewhat sub-linear, it is better than gcm-filters. Next stepsCompare to GPU performanceWe should compare these single-node CPU results to the GPU results. Unfortunately I couldn't do this because I broke my cupy installation in my Casper python environment 😢. Ideally someone with a working cupy could rerun the basic benchmark with the two different types of Casper GPUs. Note that, with a single GPU, we can't really look at scaling proper, since we can't use multiple GPUs. But we can still locate the GPU performance on the same axis (MB/s) as the CPU performance. If we wanted to use a multi-GPU cluster, we could get into dask-cuda, but IMO this is overkill for the JOSS paper. Look at distributed scaling on multiple nodesIt's a mystery why gcm-filters is not scaling linearly up to 36 cores on a single node. We should debug this and try to understand why. Perhaps it's related to memory usage? (There is no disk i/o in the example, so that can't be an issue.) We could also look at scaling across multiple different nodes. Maybe if we keep the core count low on each node (e.g. use only 4 cores per node), but spread over several nodes, we could get better scaling. We would need to figure out the best way to launch a multi-node Dask cluster on Casper or Cheyenne (probably Cheyenne). Pining @andersy005, software engineer at CISL, for help with this. Vary some of the array parametersI used an array with chunks of shape
My hope is that, by getting this started and giving a structure to the benchmark code, this is easy to hand off to other folks. Please feel free to make PRs to the gcm-filters-benchmarks repo, updating the notebook and / or adding more CSV results to the We are flying to Italy tonight, so I will be offline for a day or two while in transit. Will check back in later in the week. Thanks for your patience, as I have been very slow with my task. |
Thanks @rabernat! I think @jrbourbeau will look into this soon. Two things that would help us investigate this faster:
|
Checking in here. It would be useful to get a performance report of one anecdotal run. That might help us to better understand the cause of the sub-linear scaling. I would strongly recommend doing this before investing time in running larger parameter sweeps. |
Sorry for the delayed response. We just transitioned to Italy and I have fallen behind on all notifications. @NoraLoose - since this is likely to take a while, do you think we should move forward with the JOSS submission without the detailed performance assessment? |
@rabernat I think we are not in a big rush. We still need to work on other things (like restructuring the documentation). So there may be time to get the performance improvement / assessment done by the time everything else is ready. |
I am having trouble with with performance_report("unfiltered_mean_36_cores_4_workers.html"):
unfiltered.data.mean().compute() I am getting this error with both. I am using dask version 2021.05.01
|
Can you try upgrading distributed to the latest version? |
I originally had to stick with that version because of a weird conflict with pynvml. However, after uninstalling pynvml, I was able to update to 2021.06.02 and the problem resolved. Performance report (36 cores spread over 4 workers): https://rawcdn.githack.com/ocean-eddy-cpt/gcm-filters-benchmarks/8b03462e777a052cb6f00f7f71a3f4320c4f2d8a/performance_reports/filtered_mean_36_cores_4_workers.html |
In contrast, here is the weak scaling report for a smaller cluster (9 cores) and proportionally smaller problem: https://rawcdn.githack.com/ocean-eddy-cpt/gcm-filters-benchmarks/1251584c735dff55ccd1633a3f07b8ee8b8daff3/performance_reports/filtered_mean_9_cores_1_workers.html Even though the chunk size is the same, the |
Just for due diligence, here is the report for the smaller cluster with the bigger array: https://rawcdn.githack.com/ocean-eddy-cpt/gcm-filters-benchmarks/6ebd46258c97e48f8ad5dad1dba5f32022b9c5c1/performance_reports/filtered_mean_9_cores_1_workers_bigger.html The tasks are still running faster (8s). If this were GIL related, wouldn't it run faster using processes rather than threads? but that's not what we saw above. |
I don't have an immediate explanation for this. It's also odd to me that the worker profiles look almost exactly the same, just the times are larger with more processes. Can you try 1 worker, 36 processes for comparison? Also 36 processes with 1 thread each? Not sure it'll tell us that much, but I'm just curious. |
About 16s per task
Marginally faster, about 14 s per task |
And just for fun, here is a report with 4 workers, 1 thread each: https://rawcdn.githack.com/ocean-eddy-cpt/gcm-filters-benchmarks/2dfc4118ad20b7c903d4ab05672645b068737301/performance_reports/filtered_mean_4_cores_4_workers.html
I suppose another possibility is that the system is not actually giving me the 36 cores I am asking for. I will investigate this. |
Yeah, these reports certainly look like the problem is significant, consistent resource contention (versus, say, bad scheduling). The question is, what resource? If it's possible you don't have the cores you think, that would line up very well with what we're seeing. I do notice that the amount of available memory changes a lot between reports. Not that there's memory pressure, but just that it might indicate something is inconsistent about the system. |
@rabernat, I'm curious... at the beginning of the notebook you mention "Run on Casper full node (36 cores)", is this running on a compute node? If so, how are these resources being requested i.e. the PBS job submission? |
Nevermind :)... Found your most recent job: Job ID User Queue Nodes NCPUs NGPUs Finish Time Req Mem Used Mem(GB) Avg CPU (%) Elapsed (h) Job Name
509908 rpa htc 1 36 0 06-30T12:09 360.0 37.5 8.4 4.02 jhcrbatch
Provided I am looking at the right job, the PBS report shows that your job has the requested 36 CPUs. However, it appears that these are underutilized (see the Avg CPU column above). not sure this relates to the performance issues you are noticing in your benchmarks... |
Thanks for checking on this Anderson! I launched the job via https://jupyterhub.hpc.ucar.edu/ on the "Casper PBS Batch" option.
Since this is an interactive job (JupyterLab), I wouldn't read too much into the PBS job stats. Most of my walltime was spent typing code. More useful is the CPU usage pane from the dask performance report, which shows about 20% efficiency. We should be trying to hit 100 for compute-bound code such as this. |
I have been working on this today. What I did was step back a bit and strip away some of the layers of the problem. I have a new notebook - https://github.com/ocean-eddy-cpt/gcm-filters-benchmarks/blob/master/benchmark_kernels.ipynb - which just focuses on benchmarking the kernels: their raw speed (with Some relevant results:
I'm going to push forward a bit with numba, using this new, simpler, benchmarking framework. |
I did some simple, self-contained numba experiments with reimplementing the import numpy as np
from numba import stencil, njit
%load_ext ptime
@stencil
def regular_laplacian_numba_stencil(a):
# Does not handle boundaries correctly
return -4 * a[0, 0] + a[0, 1] + a[0, -1] + a[1, 0] + a[-1, 0]
# equivalent to @jit(nopython=True)
@njit
def regular_laplacian_numba_jit(a):
# Does handle boundaries correctly
ny, nx = a.shape
out = np.zeros_like(a)
for j in range(ny):
for i in range(nx):
out[j, i] = -4 * a[j, i] + a[j-1, i] + a[j+1, i] + a[j, i+1] + a[j, i-1]
return out
shape = (1024, 1024)
data = np.random.rand(*shape)
# call everything once to trigger compilation
# verify results identical outside the boundary
np.testing.assert_allclose(
regular_laplacian_numba_stencil(data)[1:-1, 1:-1],
regular_laplacian_numba_jit(data)[1:-1, 1:-1]
)
%timeit -n10 regular_laplacian_numba_stencil(data)
%ptime -n4 regular_laplacian_numba_stencil(data)
%timeit -n10 regular_laplacian_numba_jit(data)
%ptime -n4 regular_laplacian_numba_jit(data) stencil results
(not sure what is causing the error or whether it matters) njit results
Conclusions
I am truly puzzled why the numba functions don't parallelize well. I though the whole point of It's very likely I have made a n00b mistake here, so thanks for your patience... |
Have you tried to JIT compile your
In this example, this additional step gives a speedup of factor 1000. |
Good catch Nora! Definitely did not read the documentation clearly enough 🤦 . Redefining the stencil function as @stencil
def _regular_laplacian_numba_stencil(a):
# Does not handle boundaries correctly
return -4 * a[0, 0] + a[0, 1] + a[0, -1] + a[1, 0] + a[-1, 0]
@njit
def regular_laplacian_numba_stencil(a):
return _regular_laplacian_numba_stencil(a) Definitely brings the stencil performance on par with the standalone njit performance. stencil + njit
njit standalone
However, the core problem of negative parallel scaling remains! |
cc @seibert in case he or Siu want to get engaged |
(no expectation though) |
Final update for the day: boundary conditions. It turns out that my njit standalone version did not handle boundaries properly because numba performs no range checking on indexing. Here is some code that properly handles the wrap boundary conditions for both stencil and njit standalone. @stencil
def _regular_laplacian_numba_stencil(a):
# Does not handle boundaries correctly
return -4 * a[0, 0] + a[0, 1] + a[0, -1] + a[1, 0] + a[-1, 0]
@njit
def regular_laplacian_numba_stencil(a):
return _regular_laplacian_numba_stencil(a)
@njit
def pad_array(a):
ny, nx = a.shape
padded = np.empty((ny+2, nx+2), a.dtype)
padded[1:-1, 1:-1] = a
padded[0, 1:-1] = a[-1, :]
padded[-1, 1:-1] = a[0, :]
padded[1:-1, 0] = a[:, -1,]
padded[1:-1, -1] = a[:, 0]
return padded
@njit
def regular_laplacian_numba_stencil_fix_boundary(a):
padded = pad_array(a)
b = _regular_laplacian_numba_stencil(padded)
return b[1:-1, 1:-1]
@njit
def regular_laplacian_numba_jit(a):
# Does handle boundaries correctly
ny, nx = a.shape
out = np.zeros_like(a)
for j in range(ny):
for i in range(nx):
# make sure boundaries are handled right
ileft = (i - 1) % nx
iright = (i + 1) % nx
jleft = (j - 1) % ny
jright = (j + 1) % ny
out[j, i] = -4 * a[j, i] + a[jleft, i] + a[jright, i] + a[j, ileft] + a[j, iright]
return out Timing %timeit -n10 regular_laplacian_numba_stencil(data)
%ptime -n4 regular_laplacian_numba_stencil(data)
# 2.99 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# For a 0.98X speedup across 4 threads
%timeit -n10 regular_laplacian_numba_stencil_fix_boundary(data)
%ptime -n4 regular_laplacian_numba_stencil_fix_boundary(data)
# 3.79 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# For a 0.77X speedup across 4 threads
%timeit -n10 regular_laplacian_numba_jit(data)
%ptime -n4 regular_laplacian_numba_jit(data)
# 7.13 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# For a 1.01X speedup across 4 threads |
Have you tried Also, I'd be curious how |
This was probably the trick I was looking for! With
Adding
So Any advice on where our parallelism would best be spent? We could either be using dask or numba to achieve single-machine parallel scaling. Using both would probably not be the right choice. Is there a best practice here? |
Whatever works :)
…On Mon, Jul 26, 2021 at 8:52 AM Ryan Abernathey ***@***.***> wrote:
Have you tried @njit(nogil=True)?
This was probably the trick I was looking for! With @njit(nogil=True) on
all the functions above, I get:
2.77 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.01 s
Total parallel time: 0.01 s
For a 1.33X speedup across 4 threads
5.14 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.02 s
Total parallel time: 0.01 s
For a 1.49X speedup across 4 threads
7.11 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.03 s
Total parallel time: 0.01 s
For a 3.35X speedup across 4 threads
Adding parallel=True just worked with the stencil functions. For
regular_laplacian_numba_jit, I replaced the outer j loop with prange, and
got the following timings
1.97 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.01 s
Total parallel time: 0.01 s
For a 0.90X speedup across 4 threads
4.46 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.02 s
Total parallel time: 0.02 s
For a 0.85X speedup across 4 threads
4.01 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time: 0.01 s
Total parallel time: 0.01 s
For a 1.28X speedup across 4 threads
So parallel does seem to speed some things up, but at the expense of
other layers of parallel scaling.
Any advice on where our parallelism would best be spent? We could either
be using dask or numba to achieve single-machine parallel scaling. Using
*both* would probably not be the right choice. Is there a best practice
here?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEIA2C6QOIWY766BDDTZWAC7ANCNFSM44FKBUCA>
.
|
Great to confirm our theory that it was indeed the GIL. "3.35X speedup across 4 threads" sounds like the performance we were hoping for. To note for the future, if you don't want to deal with with concurrent.futures.ThreadPoolExecutor() as exc:
for i in range(5000):
fs = [exc.submit(regular_laplacian_numba_jit, data) for _ in range(4)]
concurrent.futures.wait(fs) and ran
This is what we'd expect. I wasn't thinking "we should parallelize |
I've tested this locally on macOS on intel, and Numba's parallelism seems ~50% faster than naive parallelism. However, you should test on a setup representative of what you'll actually run this on. There are many small details that affect this.
When it doesn't make a performance difference, I'd recommend fewer knobs to turn. So letting dask be the sole concurrency layer is simpler to reason about than layering a dask threadpool and a Numba threadpool. However, Numba may be faster, so the complexity is probably worth it. First a sidenote: I believe the above code doesn't actually handle boundary conditions correctly. For example, if you change Testing codeimport numpy as np
from numba import njit, prange, set_num_threads
import concurrent.futures
from time import perf_counter
@njit(nogil=True, cache=False)
def regular_laplacian_numba_jit_serial(a, out=None):
# Does handle boundaries correctly
ny, nx = a.shape
last_j, last_i = ny - 1, nx - 1
out = np.empty_like(a) if out is None else out
for j in range(ny):
for i in range(nx):
out[j, i] = (
-4 * a[j, i]
+ a[j - 1 if j != 0 else last_j, i]
+ a[j + 1 if j != last_j else 0, i]
+ a[j, i + 1 if i != last_i else 0]
+ a[j, i - 1 if i != 0 else last_i]
)
return out
@njit(nogil=True, parallel=True, cache=False)
def regular_laplacian_numba_jit_parallel(a, out=None):
# Does handle boundaries correctly
ny, nx = a.shape
last_j, last_i = ny - 1, nx - 1
out = np.empty_like(a) if out is None else out
for j in prange(ny):
for i in range(nx):
out[j, i] = (
-4 * a[j, i]
+ a[j - 1 if j != 0 else last_j, i]
+ a[j + 1 if j != last_j else 0, i]
+ a[j, i + 1 if i != last_i else 0]
+ a[j, i - 1 if i != 0 else last_i]
)
return out
shape = (1024, 1024)
data = np.random.rand(*shape)
out = np.empty_like(data)
# call everything once to trigger compilation
# verify results identical outside the boundary
np.testing.assert_allclose(
regular_laplacian_numba_jit_serial(data, out=out),
regular_laplacian_numba_jit_parallel(data, out=out),
)
n_repeats = 1000
concurrency = 4
n_ops = n_repeats * concurrency
data_copies = [data.copy() for _ in range(concurrency)]
# Naive parallelism, each thread operating on a separate copy
with concurrent.futures.ThreadPoolExecutor() as exc:
start = perf_counter()
for _ in range(n_repeats):
fs = [
exc.submit(regular_laplacian_numba_jit_serial, data, out=out)
for data in data_copies
]
concurrent.futures.wait(fs)
elapsed = perf_counter() - start
print(
f"Serial in {concurrency} threads, {concurrency} data copies: {elapsed:.2f} sec, {n_ops / elapsed:.1f} ops/sec"
)
# Naive parallelism, all threads on the same copy
with concurrent.futures.ThreadPoolExecutor() as exc:
start = perf_counter()
for _ in range(n_repeats):
fs = [
exc.submit(regular_laplacian_numba_jit_serial, data, out=out)
for _ in range(concurrency)
]
concurrent.futures.wait(fs)
elapsed = perf_counter() - start
print(
f"Serial in {concurrency} threads, one data copy: {elapsed:.2f} sec, {n_ops / elapsed:.1f} ops/sec"
)
# Numba parallelism in serial
set_num_threads(concurrency)
start = perf_counter()
for _ in range(n_ops):
regular_laplacian_numba_jit_parallel(data, out=out)
elapsed = perf_counter() - start
print(f"Numba parallel: {elapsed:.2f} sec, {n_ops / elapsed:.1f} ops/sec") The results I initially got were that whichever of these cases I ran first was the fastest. Commenting the other two out and running one at a time: (env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 4.58 sec, 873.3 ops/sec
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, one data copy: 4.02 sec, 993.8 ops/sec
(env) gabe dev/dask-playground » python test.py
Numba parallel: 2.11 sec, 1895.6 ops/sec vs all at once: (env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 4.55 sec, 879.5 ops/sec
Serial in 4 threads, one data copy: 5.42 sec, 738.2 ops/sec
Numba parallel: 5.11 sec, 782.8 ops/sec Notice "Numba parallel" takes 2.5x longer when run last versus when run alone. The order-dependence made me think caching/memory access patterns, so I tried running under jemalloc: (env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, 4 data copies: 8.95 sec, 446.9 ops/sec
(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, one data copy: 8.20 sec, 487.8 ops/sec
(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Numba parallel: 12.48 sec, 320.6 ops/sec (env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, 4 data copies: 8.62 sec, 464.0 ops/sec
Serial in 4 threads, one data copy: 8.15 sec, 490.6 ops/sec
Numba parallel: 12.85 sec, 311.3 ops/sec On macOS, jemalloc removes the order-dependence effect. However, it also makes naive parallelism 2x slower and numba parallelism 6x slower. This seems to confirm that something related to memory is very important here. In particular, I guessed that switching memory allocators more affects memory allocation performance than memory access (though allocators certainly could have different strategies for optimizing cache performance, etc.). Thinking about memory allocation, I noticed we were generating a new output array for each call with Then I added an (env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, 4 data copies: 2.30 sec, 1740.4 ops/sec
Serial in 4 threads, one data copy: 2.19 sec, 1829.7 ops/sec
Numba parallel: 1.34 sec, 2977.4 ops/sec
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 2.56 sec, 1561.3 ops/sec
Serial in 4 threads, one data copy: 2.18 sec, 1835.8 ops/sec
Numba parallel: 1.37 sec, 2929.7 ops/sec (env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 2.48 sec, 1613.6 ops/sec
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, one data copy: 2.26 sec, 1768.6 ops/sec
(env) gabe dev/dask-playground » python test.py
Numba parallel: 1.34 sec, 2980.7 ops/sec In the end, this tells us pretty much what we already guessed from looking at the code: the operation itself is computationally very simple, so memory bandwidth is the limiting factor. Anything you can do to reduce new memory allocations and reuse existing arrays will have the biggest performance wins. |
This is exactly what I shared in #45 (comment) 😄
This is a really useful insight. It would be great to get some insight into the numba best practices for avoiding unnecessary memory allocation. For example, what's the right way to provide an In a similar vein, when using the stencil code path, I had to manually pad the array in order to deal with the boundary conditions: @njit
def regular_laplacian_numba_stencil_fix_boundary(a):
padded = pad_array(a)
b = _regular_laplacian_numba_stencil(padded)
return b[1:-1, 1:-1] This involves a copy of the whole array. Is there a better way? |
Aha! Sorry I missed that. I might suggest using conditionals like
I think the better way is exactly what both you and I did, with writing the loop explicitly in In general, this is a reversal of mindset from NumPy/Python. With NumPy, we are constantly altering the data to make it work with the functions we have available to us so we don't have to write for-loops. With Numba, we want to alter the data as little as possible, and write exactly the function we need to handle it as is. Any time you can use conditionals/logic to solve a problem instead of memory, it'll pretty much always be faster. |
Posting an update with some results from casper with 36 cores. import numpy as np
from numba import njit
from scipy.ndimage import uniform_filter
def numpy_laplacian(field):
return (
-4 * field
+ np.roll(field, -1, axis=-1)
+ np.roll(field, 1, axis=-1)
+ np.roll(field, -1, axis=-2)
+ np.roll(field, 1, axis=-2)
)
@njit(nogil=True, cache=False)
def regular_laplacian_numba_jit_serial(a, out=None):
# Does handle boundaries correctly
ny, nx = a.shape
last_j, last_i = ny - 1, nx - 1
out = np.empty_like(a) if out is None else out
for j in range(ny):
for i in range(nx):
out[j, i] = (
-4 * a[j, i]
+ a[j - 1 if j != 0 else last_j, i]
+ a[j + 1 if j != last_j else 0, i]
+ a[j, i + 1 if i != last_i else 0]
+ a[j, i - 1 if i != 0 else last_i]
)
return out
shape = (2048, 2048)
data = np.random.rand(*shape)
regular_laplacian_numba_jit_serial(data)
print("numpy laplacian")
%timeit -n10 numpy_laplacian(data)
%ptime -n36 numpy_laplacian(data)
print("ndimage.uniform_filter")
%timeit -n10 uniform_filter(data)
%ptime -n36 uniform_filter(data)
print("numba laplacian")
%timeit -n10 regular_laplacian_numba_jit_serial(data)
%ptime -n36 regular_laplacian_numba_jit_serial(data) results
So the numba version is about 5x faster and has slightly better parallel scaling than the numpy one, but not quite as good as ndimage. What we have to decide now is whether these performance improvements are worth refactoring the package to use numba. |
Thanks for the update, @rabernat!
To me, this performance improvement makes refactoring seem worthwhile. What do you think? |
The complicating factor is GPU support. If we go with numba, supporting GPUs becomes more complicated (but not impossible). Whereas now, we basically get cupy-based GPU support for free. It would be good to assess how difficult it would be to make the numba-cuda implementation of the kernels. I don't really know where to start with that. |
Did you happen to see the numba-cuda example at the end of
https://examples.dask.org/applications/stencils-with-numba.html#GPU-Version
?
…On Tue, Aug 17, 2021 at 7:32 AM Ryan Abernathey ***@***.***> wrote:
The complicating factor is GPU support. If we go with numba, supporting
GPUs becomes more complicated (but not impossible). Whereas now, we
basically get cupy-based GPU support for free.
It would be good to assess how difficult it would be to make the
numba-cuda implementation of the kernels. I don't really know where to
start with that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTE4PBPG2JWQWMYEBRDT5JJD7ANCNFSM44FKBUCA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Thanks for the reminder about that Matt, very helpful. Our current implementation requires only function for each Laplacian. Those functions automatically select either cupy or numpy based on the input data, e.g.: gcm-filters/gcm_filters/kernels.py Lines 108 to 109 in d5ea257
The Dask example brings to mind several questions.
I wonder if there are any examples out there of numba functions that are jit compiled to both CPU and GPU. |
Today yes, probably. GPUs and CPUs are different enough that different programming techniques are often needed. The counter-example is if you're doing something very simple, like something entirely vectorizable such that you can use
That's a great question, and there isn't a good general answer I don't think unfortunately today. cc'ing @gmarkall from the numba/rapids team. I don't think that he'll have a general purpose answer for you, but we've spoken a bit about this before and I think that he'll appreciate seeing the context of where this comes up.
You can use if statements in numba.cuda, which is maybe how you would handle the boundary. if i < 5:
out[i, j] = 0
Outside of things like numba.vectorize or guvectorize I don't think that this is really possible with the CUDA programming model. I would be happy to be wrong here though. |
What we need is more like this out[j, i] = (
-4 * a[j, i]
+ a[j - 1 if j != 0 else last_j, i]
+ a[j + 1 if j != last_j else 0, i]
+ a[j, i + 1 if i != last_i else 0]
+ a[j, i - 1 if i != 0 else last_i]
) ...where the i, j indexes are global. My concern is that the cuda kernel only has access to a local block (comparable to an MPI rank for the Fortran folks) and therefore can't trivially do wrap boundary conditions. I'm sure it's possible to do, just more complicated. |
My CUDA is rusty enough that I no longer trust myself to talk about its memory model, but I think that it would be worth verifying the assumption about lack of global access. |
@mrocklin is right -- each thread does have access to all of global memory. That being said, stencil kernels often benefit from the use of block-local "shared" memory (which numba exposes). This article covers how to use shared memory in a very similar use-case to the above (3-d finite difference with periodic boundary conditions): https://developer.nvidia.com/blog/finite-difference-methods-cuda-cc-part-1/ |
FWIW, I think we do not have to resolve this before submitting the JOSS paper. We can leave this issue open and continue to work on performance. |
The numba documentation just got expanded with a bunch of more detail on writing CUDA kernels: https://numba.readthedocs.io/en/latest/cuda/examples.html This should be very useful if we choose to go that route. |
JOSS askes for
If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)
What should/can we do to document performance of the package?
The text was updated successfully, but these errors were encountered: