-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xclim.sdba - Comparison for different multithreading configuration #4
Comments
I use However, I see from your numbers that I don't quite understand the relationship between # workers/# threads/OMP, so maybe I'm wrong. |
since dask is also multithreaded/multiprocess it might cause some mutliplication of the omp setting? |
Some guidance here maybe : https://docs.dask.org/en/latest/array-best-practices.html |
Dask devs seem to suggest setting |
I had 2 workers and 4 threads per worker during my tests, and I reached more than 3000% CPU on 2 python processes. I would have to test it, but that could suggest that each worker might be allowed to reach |
Yes @RondeauG , I believe that this is what happens. The OMP_NUM_THREADS is per threads. So 4 dask threads with OMP=12, means a maximum of 48 threads! |
Also, I expected a greater difference between 2 workers x4 and 1 worker x8... 1 worker does use (slightly) less memory, but the 2x4 seems somewhat faster. |
Not totally sure if all of these apply to our servers/config but there are MKL and OPENBLAS settings as well in the dask 'best practice' guidance |
They might apply for other numpy processes, but |
After a few tests on my end with xclim.sdba, I seem to get better results (and reasonable CPU use) with:
rather than :
My guess is that for non-dask-optimized tasks, |
That's what I understand. With EQM or DQM, the main bottleneck is the griddata calls that are apply_ufunc-wrapped. Underneath is |
Small update, for Having chunks alongside a single dimension ( |
Not an issue or even a real benchmarking, but I thought it would be interesting to share this comparison with xclim people.
Surtout : @RondeauG, @huard et @tlogan2000
I ran a basic DetrendedQuantileMapping adjustment on the infamous 'air_temperature' tutorial dataset of xarray (code below). I am using the DQM code in PR Ouranosinc/xclim#467, which is more efficient than what is on master right now.
Initially, I was trying this because Gabriel and I had a doubt about xclim/dask/numpy respecting the "OMP_NUM_THREADS" environment variable. Turns out the flag is respected, at least on my home setup... So the following are all the exact same calculations, but with different dask / OMP configurations. There are 8 cores on my laptop and the data size is around 7 MB.
Default dask (implicit 8 threads), OMP=1
Distributed dask, 1 worker, 8 threads, OMP=1
Distributed dask, 2 workers, 4 threads, OMP = 1
Distributed dask, 1 worker, 4 threads, OMP = 2
No Dask, OMP=8
The text was updated successfully, but these errors were encountered: