xclim.sdba - Comparison for different multithreading configuration #4

aulemahal · 2020-05-29T18:56:43Z

Not an issue or even a real benchmarking, but I thought it would be interesting to share this comparison with xclim people.

Surtout : @RondeauG, @huard et @tlogan2000

I ran a basic DetrendedQuantileMapping adjustment on the infamous 'air_temperature' tutorial dataset of xarray (code below). I am using the DQM code in PR Ouranosinc/xclim#467, which is more efficient than what is on master right now.

Initially, I was trying this because Gabriel and I had a doubt about xclim/dask/numpy respecting the "OMP_NUM_THREADS" environment variable. Turns out the flag is respected, at least on my home setup... So the following are all the exact same calculations, but with different dask / OMP configurations. There are 8 cores on my laptop and the data size is around 7 MB.

Default dask (implicit 8 threads), OMP=1

Distributed dask, 1 worker, 8 threads, OMP=1

Distributed dask, 2 workers, 4 threads, OMP = 1

Distributed dask, 1 worker, 4 threads, OMP = 2

No Dask, OMP=8

import xarray as xr
from xclim.sdba.adjustment import DetrendedQuantileMapping
from xclim.sdba.utils import *
from xclim.sdba.processing import normalize
from xclim.sdba.detrending import PolyDetrend
from distributed import Client


if __name__ == '__main__':

    client = Client(n_workers=2, threads_per_worker=4,  memory_limit="4GB")
    print(client)

    ds = xr.tutorial.open_dataset('air_temperature', chunks={'time':-1, 'lat': 5, 'lon': 5}).resample(time='D').mean().load()
    ref = ds.air
    hist = fut = (ds.air * 1.1)
    print(ref.chunks, hist.chunks)
    DQM = DetrendedQuantileMapping(nquantiles=50, kind='+', group='time.month')
    DQM.train(ref, hist)
    DQM.ds.load()
    scen = DQM.adjust(fut, interp='linear', extrapolation='constant', detrend=1)
    print('done')
    scen.load()
    print(scen.sum().values)

The text was updated successfully, but these errors were encountered:

RondeauG · 2020-05-29T19:02:58Z

I use export OMP_NUM_THREADS=12 in my .bashrc, which I would have thought limited my load to 1200% CPU. This was set up when I used Matlab and Julia.

However, I see from your numbers that I don't quite understand the relationship between # workers/# threads/OMP, so maybe I'm wrong.

tlogan2000 · 2020-05-29T19:05:30Z

since dask is also multithreaded/multiprocess it might cause some mutliplication of the omp setting?

tlogan2000 · 2020-05-29T19:09:44Z

Some guidance here maybe : https://docs.dask.org/en/latest/array-best-practices.html

tlogan2000 · 2020-05-29T19:11:57Z

Some guidance here maybe : https://docs.dask.org/en/latest/array-best-practices.html

Dask devs seem to suggest setting OMP_NUM_THREADS=1

RondeauG · 2020-05-29T19:13:06Z

I had 2 workers and 4 threads per worker during my tests, and I reached more than 3000% CPU on 2 python processes.

I would have to test it, but that could suggest that each worker might be allowed to reach thread x OMP. That would mean 2 times 4800% CPU, which would fit my results?

aulemahal · 2020-05-29T19:15:45Z

Yes @RondeauG , I believe that this is what happens. The OMP_NUM_THREADS is per threads. So 4 dask threads with OMP=12, means a maximum of 48 threads!

aulemahal · 2020-05-29T19:17:25Z

Also, I expected a greater difference between 2 workers x4 and 1 worker x8... 1 worker does use (slightly) less memory, but the 2x4 seems somewhat faster.

tlogan2000 · 2020-05-29T19:17:39Z

Not totally sure if all of these apply to our servers/config but there are MKL and OPENBLAS settings as well in the dask 'best practice' guidance
e.g.
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

aulemahal · 2020-05-29T19:20:42Z

They might apply for other numpy processes, but griddata and QHull underneath it seem to only depend on OpenMP. With dask I agree, setting all to 1. However, without, for smaller datasets, gridata's parallelism is far more efficient than dask's.

RondeauG · 2020-06-05T13:36:48Z

After a few tests on my end with xclim.sdba, I seem to get better results (and reasonable CPU use) with:

export OMP_NUM_THREADS=10
client = Client(n_workers=2, threads_per_worker=1, memory_limit="10GB")

rather than :

export OMP_NUM_THREADS=1
client = Client(n_workers=2, threads_per_worker=10, memory_limit="10GB")

My guess is that for non-dask-optimized tasks, OMP_NUM_THREADS=10 is more useful, therefore allowing the whole script to run faster?

aulemahal · 2020-06-05T14:09:04Z

That's what I understand. With EQM or DQM, the main bottleneck is the griddata calls that are apply_ufunc-wrapped. Underneath is qhull. So the dask-parallelization is hard-coded by us while scipy is using highly-optimized external code. I will try to explain this in an upcoming sdba notebook...

RondeauG · 2020-06-05T14:42:40Z

Small update, for xclim.sdba at least.

Having chunks alongside a single dimension (lat, for example) gives me a better performance than alongside two dimensions (lat and lon). With only lat, I get closer to that 1000% CPU more often.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xclim.sdba - Comparison for different multithreading configuration #4

xclim.sdba - Comparison for different multithreading configuration #4

aulemahal commented May 29, 2020

RondeauG commented May 29, 2020

tlogan2000 commented May 29, 2020

tlogan2000 commented May 29, 2020

tlogan2000 commented May 29, 2020

RondeauG commented May 29, 2020

aulemahal commented May 29, 2020

aulemahal commented May 29, 2020

tlogan2000 commented May 29, 2020

aulemahal commented May 29, 2020 •

edited

Loading

RondeauG commented Jun 5, 2020

aulemahal commented Jun 5, 2020

RondeauG commented Jun 5, 2020

xclim.sdba - Comparison for different multithreading configuration #4

xclim.sdba - Comparison for different multithreading configuration #4

Comments

aulemahal commented May 29, 2020

Default dask (implicit 8 threads), OMP=1

Distributed dask, 1 worker, 8 threads, OMP=1

Distributed dask, 2 workers, 4 threads, OMP = 1

Distributed dask, 1 worker, 4 threads, OMP = 2

No Dask, OMP=8

RondeauG commented May 29, 2020

tlogan2000 commented May 29, 2020

tlogan2000 commented May 29, 2020

tlogan2000 commented May 29, 2020

RondeauG commented May 29, 2020

aulemahal commented May 29, 2020

aulemahal commented May 29, 2020

tlogan2000 commented May 29, 2020

aulemahal commented May 29, 2020 • edited Loading

RondeauG commented Jun 5, 2020

aulemahal commented Jun 5, 2020

RondeauG commented Jun 5, 2020

aulemahal commented May 29, 2020 •

edited

Loading