Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xclim.sdba - Comparison for different multithreading configuration #4

Open
aulemahal opened this issue May 29, 2020 · 12 comments
Open

Comments

@aulemahal
Copy link
Contributor

Not an issue or even a real benchmarking, but I thought it would be interesting to share this comparison with xclim people.

Surtout : @RondeauG, @huard et @tlogan2000

I ran a basic DetrendedQuantileMapping adjustment on the infamous 'air_temperature' tutorial dataset of xarray (code below). I am using the DQM code in PR Ouranosinc/xclim#467, which is more efficient than what is on master right now.

Initially, I was trying this because Gabriel and I had a doubt about xclim/dask/numpy respecting the "OMP_NUM_THREADS" environment variable. Turns out the flag is respected, at least on my home setup... So the following are all the exact same calculations, but with different dask / OMP configurations. There are 8 cores on my laptop and the data size is around 7 MB.

Default dask (implicit 8 threads), OMP=1

Dask_default_8cpus_Omp1

Distributed dask, 1 worker, 8 threads, OMP=1

Dask_1wk_8th_Omp1

Distributed dask, 2 workers, 4 threads, OMP = 1

Dask_2wk_4th_Omp1

Distributed dask, 1 worker, 4 threads, OMP = 2

Dask_1wk_4th_Omp2

No Dask, OMP=8

NoDask_8cpus_Omp8

import xarray as xr
from xclim.sdba.adjustment import DetrendedQuantileMapping
from xclim.sdba.utils import *
from xclim.sdba.processing import normalize
from xclim.sdba.detrending import PolyDetrend
from distributed import Client


if __name__ == '__main__':

    client = Client(n_workers=2, threads_per_worker=4,  memory_limit="4GB")
    print(client)

    ds = xr.tutorial.open_dataset('air_temperature', chunks={'time':-1, 'lat': 5, 'lon': 5}).resample(time='D').mean().load()
    ref = ds.air
    hist = fut = (ds.air * 1.1)
    print(ref.chunks, hist.chunks)
    DQM = DetrendedQuantileMapping(nquantiles=50, kind='+', group='time.month')
    DQM.train(ref, hist)
    DQM.ds.load()
    scen = DQM.adjust(fut, interp='linear', extrapolation='constant', detrend=1)
    print('done')
    scen.load()
    print(scen.sum().values)
@RondeauG
Copy link

I use export OMP_NUM_THREADS=12 in my .bashrc, which I would have thought limited my load to 1200% CPU. This was set up when I used Matlab and Julia.

However, I see from your numbers that I don't quite understand the relationship between # workers/# threads/OMP, so maybe I'm wrong.

@tlogan2000
Copy link
Contributor

since dask is also multithreaded/multiprocess it might cause some mutliplication of the omp setting?

@tlogan2000
Copy link
Contributor

Some guidance here maybe : https://docs.dask.org/en/latest/array-best-practices.html

@tlogan2000
Copy link
Contributor

Some guidance here maybe : https://docs.dask.org/en/latest/array-best-practices.html

Dask devs seem to suggest setting OMP_NUM_THREADS=1

@RondeauG
Copy link

I had 2 workers and 4 threads per worker during my tests, and I reached more than 3000% CPU on 2 python processes.

I would have to test it, but that could suggest that each worker might be allowed to reach thread x OMP. That would mean 2 times 4800% CPU, which would fit my results?

@aulemahal
Copy link
Contributor Author

Yes @RondeauG , I believe that this is what happens. The OMP_NUM_THREADS is per threads. So 4 dask threads with OMP=12, means a maximum of 48 threads!

@aulemahal
Copy link
Contributor Author

Also, I expected a greater difference between 2 workers x4 and 1 worker x8... 1 worker does use (slightly) less memory, but the 2x4 seems somewhat faster.

@tlogan2000
Copy link
Contributor

Not totally sure if all of these apply to our servers/config but there are MKL and OPENBLAS settings as well in the dask 'best practice' guidance
e.g.
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1

@aulemahal
Copy link
Contributor Author

aulemahal commented May 29, 2020

They might apply for other numpy processes, but griddata and QHull underneath it seem to only depend on OpenMP. With dask I agree, setting all to 1. However, without, for smaller datasets, gridata's parallelism is far more efficient than dask's.

@RondeauG
Copy link

RondeauG commented Jun 5, 2020

After a few tests on my end with xclim.sdba, I seem to get better results (and reasonable CPU use) with:

export OMP_NUM_THREADS=10
client = Client(n_workers=2, threads_per_worker=1, memory_limit="10GB")

rather than :

export OMP_NUM_THREADS=1
client = Client(n_workers=2, threads_per_worker=10, memory_limit="10GB")

My guess is that for non-dask-optimized tasks, OMP_NUM_THREADS=10 is more useful, therefore allowing the whole script to run faster?

@aulemahal
Copy link
Contributor Author

That's what I understand. With EQM or DQM, the main bottleneck is the griddata calls that are apply_ufunc-wrapped. Underneath is qhull. So the dask-parallelization is hard-coded by us while scipy is using highly-optimized external code. I will try to explain this in an upcoming sdba notebook...

@RondeauG
Copy link

RondeauG commented Jun 5, 2020

Small update, for xclim.sdba at least.

Having chunks alongside a single dimension (lat, for example) gives me a better performance than alongside two dimensions (lat and lon). With only lat, I get closer to that 1000% CPU more often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants