Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance evaluation #45

Open
jbusecke opened this issue May 5, 2021 · 68 comments
Open

Performance evaluation #45

jbusecke opened this issue May 5, 2021 · 68 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@jbusecke
Copy link
Collaborator

jbusecke commented May 5, 2021

JOSS askes for If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

What should/can we do to document performance of the package?

@jbusecke jbusecke added documentation Improvements or additions to documentation question Further information is requested labels May 5, 2021
@NoraLoose
Copy link
Member

We started the discussion on performance evaluation here and here.

@jbusecke
Copy link
Collaborator Author

jbusecke commented May 5, 2021

Apologies for missing those. Should I close this issue?

@NoraLoose
Copy link
Member

Not at all! I just posted these, so we don't need to start the discussion from scratch. :)

@jbusecke
Copy link
Collaborator Author

jbusecke commented May 5, 2021

Perfect!

@NoraLoose
Copy link
Member

Sorry, I didn't mean to kill this discussion. I will try to summarize the aspects that we had discussed to include in a performance evaluation.

  • Compare performance of gcm-filters on CPUs and GPUs against fortran implementations of spatial filters, e.g., the ones implemented by @iangrooms and @arthurBarthe. Here is a list compiled by @rabernat how such a comparison could be done.
  • Look at strong and weak scaling, see @rabernat's comment here.
  • Compare computational speed when using increasingly complex Laplacians in gcm-filters, see for example the table here.
  • ...

I am sure this list is incomplete, so please feel free to add/delete items. Maybe @iangrooms also saved the computational section that was originally part of the JAMES paper.

Question: Do we want to include the performance evaluation

  1. in the JOSS paper,
  2. in the gcm-filters documentation,
  3. both (in some kind of combination)?

I think 3. would maybe work best.

@rabernat
Copy link
Contributor

Thanks for the writeup. I think your list is perfect, and I favor option 3.

For me, the big question is the following. Most HPC system nodes have either ~24 CPU cores or 1 GPU. By using parallelism appropriately (e.g. parallelize across non-spatial dimensions), can the 24-core CPU speed get close to the 1 GPU speed?

I find this problem fun and would like to work on it a bit.

@rabernat
Copy link
Contributor

I know I volunteered to do this. I still hope to work on it. Will try to squeeze it in over the next few days.

@rabernat
Copy link
Contributor

Ok, so I started the benchmarking exercise in a a standalone repo:

https://github.com/ocean-eddy-cpt/gcm-filters-benchmarks

The repo is organized as follows:

One central goal is to figure out how many CPUs you need to use to replicate the GPU performance.

What I did so far

What I did was run the basic tutorial example on Casper with various numbers of CPUs and look at strong and weak scaling. I bypass questions about i/o performance by using random data. When x and y are contiguous (not chunked) the GCM filters approach is embarrassingly parallel, so in theory should scale perfectly (runtime inversely proportional to the number of CPUs used).

What I found was a bit different. Here is the weak scaling (from the plot_results.ipynb notebook). I used a single node of Casper, varying the number of CPUS from 1 to 36, with various combinations of workers and threads.

image

What the graph shows is decent scaling for low core counts, then a saturation at high core counts. As a reminder, this is weak scaling, which means we actually we use more cores, we make the problem bigger. The fact that the throughput saturates at a relatively low rate (200 MB/s) indicates that scaling is breaking down for some reason. Instead, as we add more cores, we should keep processing data faster. It would be great to dig into this a bit deeper to understand why this is happening. Perhaps that is something that Dask experts @gjoseph92 and @jrbourbeau could help us debug?

A useful comparison point is simply taking the mean of the same array, which exposes Dask's own scaling properties:
image

Although it is somewhat sub-linear, it is better than gcm-filters.

Next steps

Compare to GPU performance

We should compare these single-node CPU results to the GPU results. Unfortunately I couldn't do this because I broke my cupy installation in my Casper python environment 😢. Ideally someone with a working cupy could rerun the basic benchmark with the two different types of Casper GPUs. Note that, with a single GPU, we can't really look at scaling proper, since we can't use multiple GPUs. But we can still locate the GPU performance on the same axis (MB/s) as the CPU performance.

If we wanted to use a multi-GPU cluster, we could get into dask-cuda, but IMO this is overkill for the JOSS paper.

Look at distributed scaling on multiple nodes

It's a mystery why gcm-filters is not scaling linearly up to 36 cores on a single node. We should debug this and try to understand why. Perhaps it's related to memory usage? (There is no disk i/o in the example, so that can't be an issue.)

We could also look at scaling across multiple different nodes. Maybe if we keep the core count low on each node (e.g. use only 4 cores per node), but spread over several nodes, we could get better scaling. We would need to figure out the best way to launch a multi-node Dask cluster on Casper or Cheyenne (probably Cheyenne). Pining @andersy005, software engineer at CISL, for help with this.

Vary some of the array parameters

I used an array with chunks of shape (10, 1024, 1024) and dtype float64. Some other things to try

  • Use single precision
  • Use different chunk shape in the time dimension (e.g. 5 or 20 instead of 10)
  • Use different chunk shape in the x, y dimensions (e.g. 256 or 2048 instead of 1024)
  • Try different kernels. I just used REGULAR_WITH_LAND. Obviously other kernels are slower in absolute terms, but I hope that they have broadly similar scaling behavior, which is the focus here. This should be tested.

My hope is that, by getting this started and giving a structure to the benchmark code, this is easy to hand off to other folks. Please feel free to make PRs to the gcm-filters-benchmarks repo, updating the notebook and / or adding more CSV results to the data/ directory.

We are flying to Italy tonight, so I will be offline for a day or two while in transit. Will check back in later in the week.

Thanks for your patience, as I have been very slow with my task.

@gjoseph92
Copy link

Thanks @rabernat! I think @jrbourbeau will look into this soon. Two things that would help us investigate this faster:

  1. Could you collect performance reports for some of runs? (Shameless plug: you can now use coiled.performance_report to capture and host these for you, which saves the annoyance of uploading to a gist and getting the githack link.)
  2. Could you try letting threads_per_worker go up to 36? I'm curious if this could be a case of Co-assign neighboring tasks to neighboring workers dask/distributed#4892.

@mrocklin
Copy link

Checking in here. It would be useful to get a performance report of one anecdotal run. That might help us to better understand the cause of the sub-linear scaling. I would strongly recommend doing this before investing time in running larger parameter sweeps.

@rabernat
Copy link
Contributor

Sorry for the delayed response. We just transitioned to Italy and I have fallen behind on all notifications.

@NoraLoose - since this is likely to take a while, do you think we should move forward with the JOSS submission without the detailed performance assessment?

@NoraLoose
Copy link
Member

@rabernat I think we are not in a big rush. We still need to work on other things (like restructuring the documentation). So there may be time to get the performance improvement / assessment done by the time everything else is ready.

@rabernat
Copy link
Contributor

rabernat commented Jun 30, 2021

I am having trouble with performance_report (both the coiled and dask.distributed versions).

with performance_report("unfiltered_mean_36_cores_4_workers.html"):
    unfiltered.data.mean().compute()

I am getting this error with both. I am using dask version 2021.05.01

distributed.core - ERROR - Exception while handling op performance_report
Traceback (most recent call last):
  File "/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/core.py", line 498, in handle_comm
    result = await result
  File "/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/scheduler.py", line 6663, in performance_report
    sysmon = SystemMonitor(self, last_count=last_count, sizing_mode="stretch_both")
  File "/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/dashboard/components/shared.py", line 466, in __init__
    names = worker.monitor.range_query(start=last_count)
  File "/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/system_monitor.py", line 123, in range_query
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/system_monitor.py", line 123, in <dictcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
  File "/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/system_monitor.py", line 123, in <listcomp>
    d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
IndexError: deque index out of range
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-12-f054ab3a58e9> in <module>
      1 with performance_report("unfiltered_mean_36_cores_4_workers.html"):
----> 2     unfiltered.data.mean().compute()

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/client.py in __exit__(self, typ, value, traceback)
   4726         except Exception:
   4727             code = ""
-> 4728         get_client().sync(self.__aexit__, type, value, traceback, code=code)
   4729 
   4730 

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    851         else:
    852             return sync(
--> 853                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    854             )
    855 

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    352     if error[0]:
    353         typ, exc, tb = error[0]
--> 354         raise exc.with_traceback(tb)
    355     else:
    356         return result[0]

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/utils.py in f()
    335             if callback_timeout is not None:
    336                 future = asyncio.wait_for(future, callback_timeout)
--> 337             result[0] = yield future
    338         except Exception as exc:
    339             error[0] = sys.exc_info()

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/client.py in __aexit__(self, typ, value, traceback, code)
   4712                 code = ""
   4713         data = await get_client().scheduler.performance_report(
-> 4714             start=self.start, last_count=self.last_count, code=code
   4715         )
   4716         with open(self.filename, "w") as f:

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/core.py in send_recv_from_rpc(**kwargs)
    862             name, comm.name = comm.name, "ConnectionPool." + key
    863             try:
--> 864                 result = await send_recv(comm=comm, op=key, **kwargs)
    865             finally:
    866                 self.pool.reuse(self.addr, comm)

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/core.py in send_recv(comm, reply, serializers, deserializers, **kwargs)
    661         if comm.deserialize:
    662             typ, exc, tb = clean_exception(**response)
--> 663             raise exc.with_traceback(tb)
    664         else:
    665             raise Exception(response["text"])

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/core.py in handle_comm()
    496                             result = asyncio.ensure_future(result)
    497                             self._ongoing_coroutines.add(result)
--> 498                             result = await result
    499                     except (CommClosedError, CancelledError) as e:
    500                         if self.status == Status.running:

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/scheduler.py in performance_report()
   6661         from distributed.dashboard.components.shared import SystemMonitor
   6662 
-> 6663         sysmon = SystemMonitor(self, last_count=last_count, sizing_mode="stretch_both")
   6664         sysmon.update()
   6665 

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/dashboard/components/shared.py in __init__()
    464         self.last_count = 0
    465         if last_count is not None:
--> 466             names = worker.monitor.range_query(start=last_count)
    467             self.last_count = last_count
    468         self.source = ColumnDataSource({name: [] for name in names})

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/system_monitor.py in range_query()
    121         seq = [i for i in range(istart, len(self.cpu))]
    122 
--> 123         d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
    124         return d

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/system_monitor.py in <dictcomp>()
    121         seq = [i for i in range(istart, len(self.cpu))]
    122 
--> 123         d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
    124         return d

/glade/work/rpa/my_npl_clone/lib/python3.7/site-packages/distributed/system_monitor.py in <listcomp>()
    121         seq = [i for i in range(istart, len(self.cpu))]
    122 
--> 123         d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
    124         return d

IndexError: deque index out of range

@gjoseph92
Copy link

Can you try upgrading distributed to the latest version?

@rabernat
Copy link
Contributor

I originally had to stick with that version because of a weird conflict with pynvml. However, after uninstalling pynvml, I was able to update to 2021.06.02 and the problem resolved.

Performance report (36 cores spread over 4 workers): https://rawcdn.githack.com/ocean-eddy-cpt/gcm-filters-benchmarks/8b03462e777a052cb6f00f7f71a3f4320c4f2d8a/performance_reports/filtered_mean_36_cores_4_workers.html

@rabernat
Copy link
Contributor

In contrast, here is the weak scaling report for a smaller cluster (9 cores) and proportionally smaller problem: https://rawcdn.githack.com/ocean-eddy-cpt/gcm-filters-benchmarks/1251584c735dff55ccd1633a3f07b8ee8b8daff3/performance_reports/filtered_mean_9_cores_1_workers.html

Even though the chunk size is the same, the filter_func tasks are running much faster! (8 s vs 14). Tho What could be the cause of that?!? I am puzzled. But this fully explains the failure to scale out linearly.

@rabernat
Copy link
Contributor

Just for due diligence, here is the report for the smaller cluster with the bigger array: https://rawcdn.githack.com/ocean-eddy-cpt/gcm-filters-benchmarks/6ebd46258c97e48f8ad5dad1dba5f32022b9c5c1/performance_reports/filtered_mean_9_cores_1_workers_bigger.html

The tasks are still running faster (8s).

If this were GIL related, wouldn't it run faster using processes rather than threads? but that's not what we saw above.

@gjoseph92
Copy link

I don't have an immediate explanation for this. It's also odd to me that the worker profiles look almost exactly the same, just the times are larger with more processes.

Can you try 1 worker, 36 processes for comparison? Also 36 processes with 1 thread each? Not sure it'll tell us that much, but I'm just curious.

@rabernat
Copy link
Contributor

And just for fun, here is a report with 4 workers, 1 thread each: https://rawcdn.githack.com/ocean-eddy-cpt/gcm-filters-benchmarks/2dfc4118ad20b7c903d4ab05672645b068737301/performance_reports/filtered_mean_4_cores_4_workers.html

filter_func clocks in at a reliable 6.7 s.

I suppose another possibility is that the system is not actually giving me the 36 cores I am asking for. I will investigate this.

@gjoseph92
Copy link

Yeah, these reports certainly look like the problem is significant, consistent resource contention (versus, say, bad scheduling). The question is, what resource? If it's possible you don't have the cores you think, that would line up very well with what we're seeing.

I do notice that the amount of available memory changes a lot between reports. Not that there's memory pressure, but just that it might indicate something is inconsistent about the system.

@andersy005
Copy link

andersy005 commented Jun 30, 2021

I suppose another possibility is that the system is not actually giving me the 36 cores I am asking for. I will investigate this.

@rabernat, I'm curious... at the beginning of the notebook you mention "Run on Casper full node (36 cores)", is this running on a compute node? If so, how are these resources being requested i.e. the PBS job submission?

@andersy005
Copy link

andersy005 commented Jun 30, 2021

is this running on a compute node? If so, how are these resources being requested i.e. the PBS job submission?

Nevermind :)... Found your most recent job:

Job ID    User            Queue      Nodes NCPUs NGPUs Finish Time  Req Mem Used Mem(GB)  Avg CPU (%)  Elapsed (h) Job Name
509908    rpa             htc            1    36     0 06-30T12:09    360.0         37.5          8.4         4.02 jhcrbatch

If it's possible you don't have the cores you think, that would line up very well with what we're seeing.

Provided I am looking at the right job, the PBS report shows that your job has the requested 36 CPUs. However, it appears that these are underutilized (see the Avg CPU column above). not sure this relates to the performance issues you are noticing in your benchmarks...

@rabernat
Copy link
Contributor

rabernat commented Jul 1, 2021

Thanks for checking on this Anderson!

I launched the job via https://jupyterhub.hpc.ucar.edu/ on the "Casper PBS Batch" option.

...it appears that these are underutilized (see the Avg CPU column above).

Since this is an interactive job (JupyterLab), I wouldn't read too much into the PBS job stats. Most of my walltime was spent typing code. More useful is the CPU usage pane from the dask performance report, which shows about 20% efficiency. We should be trying to hit 100 for compute-bound code such as this.

image

@rabernat
Copy link
Contributor

rabernat commented Jul 21, 2021

I have been working on this today. What I did was step back a bit and strip away some of the layers of the problem. I have a new notebook - https://github.com/ocean-eddy-cpt/gcm-filters-benchmarks/blob/master/benchmark_kernels.ipynb - which just focuses on benchmarking the kernels: their raw speed (with %timeit) and their single-machine scaling (with %ptime).

Some relevant results:

  • Filter speed is very nearly equal to kernel speed * n_steps_total. This means we can focus on the kernels themselves.
  • Kernel speed is actually not bad at all compared to the baseline scipy.ndimage.uniform_filter suggested by @mrocklin. Our RegularLaplacian takes about 14 ms on a 1024x1024 array while scipy.ndimage.uniform_filter takes 12 ms. The other kernels are slower due to their increased computational complexity, but not unreasonably so.
    image.
  • However, the poor parallel scaling problem persists at the kernel level. scipy.ndimage.uniform_filter with a pre-allocated ouput array scales nearly perfectly on my 6-core laptop. Our kernels are stuck very close to a speedup factor of 1. This supports @gjoseph92's hypothesis that memory management issues are hurting our scalability.
    image

I'm going to push forward a bit with numba, using this new, simpler, benchmarking framework.

@rabernat
Copy link
Contributor

rabernat commented Jul 21, 2021

I did some simple, self-contained numba experiments with reimplementing the regular_laplacian kernel and am getting some puzzling results. This cell can easily be copy-pasted by anyone looking to give it a try.

import numpy as np
from numba import stencil, njit
%load_ext ptime

@stencil
def regular_laplacian_numba_stencil(a):
    # Does not handle boundaries correctly
    return -4 * a[0, 0] + a[0, 1] + a[0, -1] + a[1, 0] + a[-1, 0]

# equivalent to @jit(nopython=True)
@njit
def regular_laplacian_numba_jit(a):
    # Does handle boundaries correctly
    ny, nx = a.shape
    out = np.zeros_like(a)
    for j in range(ny):
        for i in range(nx):
            out[j, i] = -4 * a[j, i] + a[j-1, i] + a[j+1, i] + a[j, i+1] + a[j, i-1]
    return out

shape = (1024, 1024)
data = np.random.rand(*shape)

# call everything once to trigger compilation
# verify results identical outside the boundary
np.testing.assert_allclose(
    regular_laplacian_numba_stencil(data)[1:-1, 1:-1],
    regular_laplacian_numba_jit(data)[1:-1, 1:-1]
)

%timeit -n10 regular_laplacian_numba_stencil(data)
%ptime -n4 regular_laplacian_numba_stencil(data)

%timeit -n10 regular_laplacian_numba_jit(data)
%ptime -n4 regular_laplacian_numba_jit(data)

stencil results

255 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/ptime/__init__.py", line 14, in _exec
    exec(stmt, glob, local)
  File "<string>", line 1, in <module>
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/numba/stencils/stencil.py", line 762, in __call__
    (real_ret, typemap, calltypes) = self.get_return_type(array_types)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/numba/stencils/stencil.py", line 337, in get_return_type
    typemap, return_type, calltypes, _ = typed_passes.type_inference_stage(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/numba/core/typed_passes.py", line 67, in type_inference_stage
    with typingctx.callstack.register(infer, interp.func_id, args):
  File "/srv/conda/envs/notebook/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/numba/core/typing/context.py", line 66, in register
    raise RuntimeError(msg)
RuntimeError: compiler re-entrant to the same function signature
Total serial time:   1.18 s
Total parallel time: 0.93 s
For a 1.26X speedup across 4 threads

(not sure what is causing the error or whether it matters)

njit results

2.53 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.01 s
Total parallel time: 0.01 s
For a 0.87X speedup across 4 threads

Conclusions

  • The @numba.stencil approach (recommended in Performance evaluation #45 (comment)) is about 20x slower than our current implementation (and scipy.ndimage.uniform_filter). It has no real parallel scalability.
  • The @numba.njit approach (which looks basically like the Fortran code we are trying to replicate! 😆 ) is very fast, about 5x faster than scipy.ndimage.uniform_filter. 🚀 However, it has a speedup factor < 1 as measured by %ptime. 😬 This was surprising.

I am truly puzzled why the numba functions don't parallelize well. I though the whole point of @njit (or equivalently @jit(nopython=True) was to release the GIL an enable these sorts of speedup. Would greatly appreciate if @gjoseph92 or @mrocklin could follow up, now that we have tried your suggestion, and help us understand the results.

It's very likely I have made a n00b mistake here, so thanks for your patience...

@NoraLoose
Copy link
Member

NoraLoose commented Jul 21, 2021

Have you tried to JIT compile your regular_laplacian_numba_stencil function with numba?

@numba.njit
def regular_laplacian_numba_stencil_jit(a):
    return regular_laplacian_numba_stencil(a)

In this example, this additional step gives a speedup of factor 1000.

@rabernat
Copy link
Contributor

Good catch Nora! Definitely did not read the documentation clearly enough 🤦 . Redefining the stencil function as

@stencil
def _regular_laplacian_numba_stencil(a):
    # Does not handle boundaries correctly
    return -4 * a[0, 0] + a[0, 1] + a[0, -1] + a[1, 0] + a[-1, 0]

@njit
def regular_laplacian_numba_stencil(a):
    return _regular_laplacian_numba_stencil(a)

Definitely brings the stencil performance on par with the standalone njit performance.

stencil + njit

2.38 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.01 s
Total parallel time: 0.01 s
For a 0.95X speedup across 4 threads

njit standalone

1.59 ms ± 35 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.01 s
Total parallel time: 0.01 s
For a 0.86X speedup across 4 threads

However, the core problem of negative parallel scaling remains!

@mrocklin
Copy link

cc @seibert in case he or Siu want to get engaged

@mrocklin
Copy link

(no expectation though)

@rabernat
Copy link
Contributor

Final update for the day: boundary conditions. It turns out that my njit standalone version did not handle boundaries properly because numba performs no range checking on indexing.

Here is some code that properly handles the wrap boundary conditions for both stencil and njit standalone.

@stencil
def _regular_laplacian_numba_stencil(a):
    # Does not handle boundaries correctly
    return -4 * a[0, 0] + a[0, 1] + a[0, -1] + a[1, 0] + a[-1, 0]

@njit
def regular_laplacian_numba_stencil(a):
    return _regular_laplacian_numba_stencil(a)

@njit
def pad_array(a):
    ny, nx = a.shape
    padded = np.empty((ny+2, nx+2), a.dtype)
    padded[1:-1, 1:-1] = a
    padded[0, 1:-1] = a[-1, :]
    padded[-1, 1:-1] = a[0, :]
    padded[1:-1, 0] = a[:, -1,]
    padded[1:-1, -1] = a[:, 0]
    return padded

@njit
def regular_laplacian_numba_stencil_fix_boundary(a):
    padded = pad_array(a)
    b = _regular_laplacian_numba_stencil(padded)
    return b[1:-1, 1:-1]

@njit
def regular_laplacian_numba_jit(a):
    # Does handle boundaries correctly
    ny, nx = a.shape
    out = np.zeros_like(a)
    for j in range(ny):
        for i in range(nx):
            # make sure boundaries are handled right
            ileft = (i - 1) % nx
            iright = (i + 1) % nx
            jleft = (j - 1) % ny
            jright = (j + 1) % ny
            out[j, i] = -4 * a[j, i] + a[jleft, i] + a[jright, i] + a[j, ileft] + a[j, iright]
    return out

Timing

%timeit -n10 regular_laplacian_numba_stencil(data)
%ptime -n4 regular_laplacian_numba_stencil(data)
# 2.99 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# For a 0.98X speedup across 4 threads

%timeit -n10 regular_laplacian_numba_stencil_fix_boundary(data)
%ptime -n4 regular_laplacian_numba_stencil_fix_boundary(data)
# 3.79 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# For a 0.77X speedup across 4 threads

%timeit -n10 regular_laplacian_numba_jit(data)
%ptime -n4 regular_laplacian_numba_jit(data)
# 7.13 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# For a 1.01X speedup across 4 threads

@gjoseph92
Copy link

Have you tried @njit(nogil=True)?

Also, I'd be curious how @njit(parallel=True) and @njit(parallel=False) compare under plain serial %timeit.

@rabernat
Copy link
Contributor

Have you tried @njit(nogil=True)?

This was probably the trick I was looking for! With @njit(nogil=True) on all the functions above, I get:

2.77 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.01 s
Total parallel time: 0.01 s
For a 1.33X speedup across 4 threads
5.14 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.02 s
Total parallel time: 0.01 s
For a 1.49X speedup across 4 threads
7.11 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.03 s
Total parallel time: 0.01 s
For a 3.35X speedup across 4 threads

Adding parallel=True just worked with the stencil functions. For regular_laplacian_numba_jit, I replaced the outer j loop with prange, and got the following timings

1.97 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.01 s
Total parallel time: 0.01 s
For a 0.90X speedup across 4 threads
4.46 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.02 s
Total parallel time: 0.02 s
For a 0.85X speedup across 4 threads
4.01 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.01 s
Total parallel time: 0.01 s
For a 1.28X speedup across 4 threads

So parallel does seem to speed some things up, but at the expense of other layers of parallel scaling.

Any advice on where our parallelism would best be spent? We could either be using dask or numba to achieve single-machine parallel scaling. Using both would probably not be the right choice. Is there a best practice here?

@mrocklin
Copy link

mrocklin commented Jul 26, 2021 via email

@gjoseph92
Copy link

This was probably the trick I was looking for!

Great to confirm our theory that it was indeed the GIL. "3.35X speedup across 4 threads" sounds like the performance we were hoping for.

To note for the future, if you don't want to deal with perf to run giltracer, I like py-spy top -- python my_script.py and watching the GIL vs active percentages as a quick way to get a sense of the impact the GIL is having using py-spy. For example I modified your script from #45 (comment) and replaced the %timeits with

with concurrent.futures.ThreadPoolExecutor() as exc:
    for i in range(5000):
        fs = [exc.submit(regular_laplacian_numba_jit, data) for _ in range(4)]
        concurrent.futures.wait(fs)

and ran sudo py-spy top -- python test.py. Without nogil=True I'd see something like GIL: 100.00%, Active: 107.00%, Threads: 5. With nogil=True I'd see GIL: 23.00%, Active: 314.00%, Threads: 5. The GIL: 100.00% gives a strong indication that the GIL is the problem (and indeed, it's still probably the thing keeping us from a 4x speedup across 4 threads), so knowing that you might invest the time to do more detailed profiling with giltracer.

So parallel does seem to speed some things up, but at the expense of other layers of parallel scaling.

This is what we'd expect. I wasn't thinking "we should parallelize parallel=True"; I wanted to look specifically at the performance of Numba's lower-level parallelism vs the naive parallelism of running the same serial Numba operation in 4 threads at once. I was curious if Numba was able to be smarter about parallelism (particularly memory access) in some way.

@gjoseph92
Copy link

We could either be using dask or numba to achieve single-machine parallel scaling.

I've tested this locally on macOS on intel, and Numba's parallelism seems ~50% faster than naive parallelism. However, you should test on a setup representative of what you'll actually run this on. There are many small details that affect this.

Is there a best practice here?

When it doesn't make a performance difference, I'd recommend fewer knobs to turn. So letting dask be the sole concurrency layer is simpler to reason about than layering a dask threadpool and a Numba threadpool.

However, Numba may be faster, so the complexity is probably worth it.


First a sidenote: I believe the above code doesn't actually handle boundary conditions correctly. For example, if you change shape to 1025, it will segfault on my machine. At the last row, a[j + 1, i] will be an out-of-bounds index. I have a version below that correctly wraps around at the edges. What you want is basically a[(j - 1) % ny, i], but using a conditional is a bit faster in my tests, since % is a relatively expensive CPU operation.

Testing code
import numpy as np
from numba import njit, prange, set_num_threads
import concurrent.futures

from time import perf_counter


@njit(nogil=True, cache=False)
def regular_laplacian_numba_jit_serial(a, out=None):
    # Does handle boundaries correctly
    ny, nx = a.shape
    last_j, last_i = ny - 1, nx - 1
    out = np.empty_like(a) if out is None else out
    for j in range(ny):
        for i in range(nx):
            out[j, i] = (
                -4 * a[j, i]
                + a[j - 1 if j != 0 else last_j, i]
                + a[j + 1 if j != last_j else 0, i]
                + a[j, i + 1 if i != last_i else 0]
                + a[j, i - 1 if i != 0 else last_i]
            )
    return out


@njit(nogil=True, parallel=True, cache=False)
def regular_laplacian_numba_jit_parallel(a, out=None):
    # Does handle boundaries correctly
    ny, nx = a.shape
    last_j, last_i = ny - 1, nx - 1
    out = np.empty_like(a) if out is None else out
    for j in prange(ny):
        for i in range(nx):
            out[j, i] = (
                -4 * a[j, i]
                + a[j - 1 if j != 0 else last_j, i]
                + a[j + 1 if j != last_j else 0, i]
                + a[j, i + 1 if i != last_i else 0]
                + a[j, i - 1 if i != 0 else last_i]
            )
    return out


shape = (1024, 1024)
data = np.random.rand(*shape)
out = np.empty_like(data)

# call everything once to trigger compilation
# verify results identical outside the boundary
np.testing.assert_allclose(
    regular_laplacian_numba_jit_serial(data, out=out),
    regular_laplacian_numba_jit_parallel(data, out=out),
)

n_repeats = 1000
concurrency = 4
n_ops = n_repeats * concurrency
data_copies = [data.copy() for _ in range(concurrency)]

# Naive parallelism, each thread operating on a separate copy
with concurrent.futures.ThreadPoolExecutor() as exc:
    start = perf_counter()
    for _ in range(n_repeats):
        fs = [
            exc.submit(regular_laplacian_numba_jit_serial, data, out=out)
            for data in data_copies
        ]
        concurrent.futures.wait(fs)
    elapsed = perf_counter() - start
    print(
        f"Serial in {concurrency} threads, {concurrency} data copies: {elapsed:.2f} sec, {n_ops / elapsed:.1f} ops/sec"
    )

# Naive parallelism, all threads on the same copy
with concurrent.futures.ThreadPoolExecutor() as exc:
    start = perf_counter()
    for _ in range(n_repeats):
        fs = [
            exc.submit(regular_laplacian_numba_jit_serial, data, out=out)
            for _ in range(concurrency)
        ]
        concurrent.futures.wait(fs)
    elapsed = perf_counter() - start
    print(
        f"Serial in {concurrency} threads, one data copy: {elapsed:.2f} sec, {n_ops / elapsed:.1f} ops/sec"
    )

# Numba parallelism in serial
set_num_threads(concurrency)
start = perf_counter()
for _ in range(n_ops):
    regular_laplacian_numba_jit_parallel(data, out=out)
elapsed = perf_counter() - start
print(f"Numba parallel: {elapsed:.2f} sec, {n_ops / elapsed:.1f} ops/sec")

The results I initially got were that whichever of these cases I ran first was the fastest. Commenting the other two out and running one at a time:

(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 4.58 sec, 873.3 ops/sec
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, one data copy: 4.02 sec, 993.8 ops/sec
(env) gabe dev/dask-playground » python test.py
Numba parallel: 2.11 sec, 1895.6 ops/sec

vs all at once:

(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 4.55 sec, 879.5 ops/sec
Serial in 4 threads, one data copy: 5.42 sec, 738.2 ops/sec
Numba parallel: 5.11 sec, 782.8 ops/sec

Notice "Numba parallel" takes 2.5x longer when run last versus when run alone.

The order-dependence made me think caching/memory access patterns, so I tried running under jemalloc:

(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, 4 data copies: 8.95 sec, 446.9 ops/sec
(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, one data copy: 8.20 sec, 487.8 ops/sec
(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Numba parallel: 12.48 sec, 320.6 ops/sec
(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, 4 data copies: 8.62 sec, 464.0 ops/sec
Serial in 4 threads, one data copy: 8.15 sec, 490.6 ops/sec
Numba parallel: 12.85 sec, 311.3 ops/sec

On macOS, jemalloc removes the order-dependence effect. However, it also makes naive parallelism 2x slower and numba parallelism 6x slower. This seems to confirm that something related to memory is very important here. In particular, I guessed that switching memory allocators more affects memory allocation performance than memory access (though allocators certainly could have different strategies for optimizing cache performance, etc.).

Thinking about memory allocation, I noticed we were generating a new output array for each call with np.zeros_like(a). Just switching to np.empty_like boosted performance 25%-50%.

Then I added an out= argument, and re-used the same preallocated output array for every call. This gets naive parallelism to 2x our starting point, but interestingly didn't help Numba parallelism much compared to np.zeros_like. Most importantly, it mostly eliminated the order effect and the jemalloc slowdown.

(env) gabe dev/dask-playground » DYLD_INSERT_LIBRARIES=$(brew --prefix jemalloc)/lib/libjemalloc.dylib python test.py
Serial in 4 threads, 4 data copies: 2.30 sec, 1740.4 ops/sec
Serial in 4 threads, one data copy: 2.19 sec, 1829.7 ops/sec
Numba parallel: 1.34 sec, 2977.4 ops/sec
(env) gabe dev/dask-playground » python test.py                                                                      
Serial in 4 threads, 4 data copies: 2.56 sec, 1561.3 ops/sec
Serial in 4 threads, one data copy: 2.18 sec, 1835.8 ops/sec
Numba parallel: 1.37 sec, 2929.7 ops/sec
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, 4 data copies: 2.48 sec, 1613.6 ops/sec
(env) gabe dev/dask-playground » python test.py
Serial in 4 threads, one data copy: 2.26 sec, 1768.6 ops/sec
(env) gabe dev/dask-playground » python test.py
Numba parallel: 1.34 sec, 2980.7 ops/sec

In the end, this tells us pretty much what we already guessed from looking at the code: the operation itself is computationally very simple, so memory bandwidth is the limiting factor. Anything you can do to reduce new memory allocations and reuse existing arrays will have the biggest performance wins.

@rabernat
Copy link
Contributor

I have a version below that correctly wraps around at the edges.

This is exactly what I shared in #45 (comment) 😄

the operation itself is computationally very simple, so memory bandwidth is the limiting factor.

This is a really useful insight. It would be great to get some insight into the numba best practices for avoiding unnecessary memory allocation. For example, what's the right way to provide an out= option for a numba function?

In a similar vein, when using the stencil code path, I had to manually pad the array in order to deal with the boundary conditions:

@njit
def regular_laplacian_numba_stencil_fix_boundary(a):
    padded = pad_array(a)
    b = _regular_laplacian_numba_stencil(padded)
    return b[1:-1, 1:-1]

This involves a copy of the whole array. Is there a better way?

@rabernat rabernat mentioned this issue Jul 27, 2021
3 tasks
@gjoseph92
Copy link

This is exactly what I shared in #45 (comment) 😄

Aha! Sorry I missed that. I might suggest using conditionals like a[j - 1 if j != 0 else last_j, i] instead of (j - 1) % ny; IIRC it was a tiny bit faster. You should test though, that may also not be true.

I had to manually pad the array in order to deal with the boundary conditions...This involves a copy of the whole array. Is there a better way?

I think the better way is exactly what both you and I did, with writing the loop explicitly in regular_laplacian_numba_jit. Stencil seems like a handy but limited tool. If it doesn't meet your needs, you'd be better off writing the code yourself then trying to adjust the input data to match how stencil thinks about it.

In general, this is a reversal of mindset from NumPy/Python. With NumPy, we are constantly altering the data to make it work with the functions we have available to us so we don't have to write for-loops. With Numba, we want to alter the data as little as possible, and write exactly the function we need to handle it as is. Any time you can use conditionals/logic to solve a problem instead of memory, it'll pretty much always be faster.

@rabernat
Copy link
Contributor

rabernat commented Aug 11, 2021

Posting an update with some results from casper with 36 cores.

import numpy as np
from numba import njit
from scipy.ndimage import uniform_filter

def numpy_laplacian(field):
     return ( 
         -4 * field 
         + np.roll(field, -1, axis=-1) 
         + np.roll(field, 1, axis=-1) 
         + np.roll(field, -1, axis=-2) 
         + np.roll(field, 1, axis=-2) 
     ) 
    
@njit(nogil=True, cache=False)
def regular_laplacian_numba_jit_serial(a, out=None):
    # Does handle boundaries correctly
    ny, nx = a.shape
    last_j, last_i = ny - 1, nx - 1
    out = np.empty_like(a) if out is None else out
    for j in range(ny):
        for i in range(nx):
            out[j, i] = (
                -4 * a[j, i]
                + a[j - 1 if j != 0 else last_j, i]
                + a[j + 1 if j != last_j else 0, i]
                + a[j, i + 1 if i != last_i else 0]
                + a[j, i - 1 if i != 0 else last_i]
            )
    return out

shape = (2048, 2048)
data = np.random.rand(*shape)
regular_laplacian_numba_jit_serial(data)

print("numpy laplacian")
%timeit -n10 numpy_laplacian(data)
%ptime -n36 numpy_laplacian(data)

print("ndimage.uniform_filter")
%timeit -n10 uniform_filter(data)
%ptime -n36 uniform_filter(data)

print("numba laplacian")
%timeit -n10 regular_laplacian_numba_jit_serial(data)
%ptime -n36 regular_laplacian_numba_jit_serial(data)

results

numpy laplacian
74 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   2.67 s
Total parallel time: 0.38 s
For a 6.94X speedup across 36 threads
ndimage.uniform_filter
106 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   3.80 s
Total parallel time: 0.29 s
For a 13.01X speedup across 36 threads
numba laplacian
15.6 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Total serial time:   0.57 s
Total parallel time: 0.05 s
For a 10.44X speedup across 36 threads

So the numba version is about 5x faster and has slightly better parallel scaling than the numpy one, but not quite as good as ndimage.

What we have to decide now is whether these performance improvements are worth refactoring the package to use numba.

@NoraLoose
Copy link
Member

Thanks for the update, @rabernat!

So the numba version is about 5x faster and has slightly better parallel scaling than the numpy one, but not quite as good as ndimage.
What we have to decide now is whether these performance improvements are worth refactoring the package to use numba.

To me, this performance improvement makes refactoring seem worthwhile. What do you think?

@rabernat
Copy link
Contributor

The complicating factor is GPU support. If we go with numba, supporting GPUs becomes more complicated (but not impossible). Whereas now, we basically get cupy-based GPU support for free.

It would be good to assess how difficult it would be to make the numba-cuda implementation of the kernels. I don't really know where to start with that.

@mrocklin
Copy link

mrocklin commented Aug 17, 2021 via email

@rabernat
Copy link
Contributor

Thanks for the reminder about that Matt, very helpful.

Our current implementation requires only function for each Laplacian. Those functions automatically select either cupy or numpy based on the input data, e.g.:

def __call__(self, field: ArrayType):
np = get_array_module(field)

The Dask example brings to mind several questions.

  • Would we have to write a separate version of each function for CPU and GPU? We need to support both.
  • How do we choose threadsperblock and blockspergrid?
  • Would boundary conditions be hard to handle in numba.cuda.jit? Your example basically ignores the array boundary, but we would need to exchange data across the boundaries of the array in complex ways (starting with simple "wrap" boundary conditions but also more specialized things like the tripolar grid)

I wonder if there are any examples out there of numba functions that are jit compiled to both CPU and GPU.

@mrocklin
Copy link

Would we have to write a separate version of each function for CPU and GPU? We need to support both.

Today yes, probably. GPUs and CPUs are different enough that different programming techniques are often needed. The counter-example is if you're doing something very simple, like something entirely vectorizable such that you can use numba.vectorize. I don't think that you're in this regime though.

How do we choose threadsperblock and blockspergrid?

That's a great question, and there isn't a good general answer I don't think unfortunately today. cc'ing @gmarkall from the numba/rapids team. I don't think that he'll have a general purpose answer for you, but we've spoken a bit about this before and I think that he'll appreciate seeing the context of where this comes up.

Would boundary conditions be hard to handle in numba.cuda.jit? Your example basically ignores the array boundary, but we would need to exchange data across the boundaries of the array in complex ways (starting with simple "wrap" boundary conditions but also more specialized things like the tripolar grid)

You can use if statements in numba.cuda, which is maybe how you would handle the boundary.

if i < 5:
    out[i, j] = 0

I wonder if there are any examples out there of numba functions that are jit compiled to both CPU and GPU.

Outside of things like numba.vectorize or guvectorize I don't think that this is really possible with the CUDA programming model. I would be happy to be wrong here though.

@rabernat
Copy link
Contributor

You can use if statements in numba.cuda, which is maybe how you would handle the boundary.

What we need is more like this

            out[j, i] = (
                -4 * a[j, i]
                + a[j - 1 if j != 0 else last_j, i]
                + a[j + 1 if j != last_j else 0, i]
                + a[j, i + 1 if i != last_i else 0]
                + a[j, i - 1 if i != 0 else last_i]
            )

...where the i, j indexes are global. My concern is that the cuda kernel only has access to a local block (comparable to an MPI rank for the Fortran folks) and therefore can't trivially do wrap boundary conditions. I'm sure it's possible to do, just more complicated.

@mrocklin
Copy link

My CUDA is rusty enough that I no longer trust myself to talk about its memory model, but I think that it would be worth verifying the assumption about lack of global access.

@shwina
Copy link

shwina commented Aug 23, 2021

@mrocklin is right -- each thread does have access to all of global memory. That being said, stencil kernels often benefit from the use of block-local "shared" memory (which numba exposes).

This article covers how to use shared memory in a very similar use-case to the above (3-d finite difference with periodic boundary conditions): https://developer.nvidia.com/blog/finite-difference-methods-cuda-cc-part-1/

@rabernat
Copy link
Contributor

FWIW, I think we do not have to resolve this before submitting the JOSS paper. We can leave this issue open and continue to work on performance.

@rabernat rabernat mentioned this issue Sep 2, 2021
@rabernat
Copy link
Contributor

The numba documentation just got expanded with a bunch of more detail on writing CUDA kernels: https://numba.readthedocs.io/en/latest/cuda/examples.html

This should be very useful if we choose to go that route.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants