DGX Nightly Benchmark run 20210504 #139

quasiben · 2021-05-04T17:54:34Z

Benchmark history

Raw Data

<Client: 'tcp://127.0.0.1:37895' processes=10 threads=10, memory=503.79 GiB>
Distributed Version: 2021.04.1+9.g233ec884
simple
6.015e-01 +/- 4.029e-02
shuffle
2.293e+01 +/- 6.89e-01
rand_access
5.897e-03 +/- 3.28e-03
anom_mean
1.064e+02 +/- 1.488e+00

Raw Values

simple
[0.61249677 0.55235898 0.62748533 0.55971711 0.61404489 0.58228994
0.68462942 0.57052608 0.56799022 0.6434607 ]
shuffle
[22.32830022 22.36646918 22.03776295 22.9777403 22.39722683 22.7571989
22.76775758 24.11153089 23.46421872 24.04703911]
rand_access
[0.0078597 0.00342729 0.00465147 0.00341666 0.00960192 0.00371119
0.00467234 0.01374343 0.00334072 0.00454751]
anom_mean
[106.24792361 105.50825132 105.92475792 110.57412884 105.40724264
105.72732002 106.89931533 105.76425968 106.86963659 105.23619132]

Dask Profiles

Scheduler Execution Graph

jakirkham · 2021-05-04T18:17:55Z

This includes Rick's recent HLG timeseries PR ( dask/dask#7615 )

jakirkham · 2021-05-05T19:26:44Z

It's worth comparing this to the results in issue ( #137 ) where we profiled with a workload that uses very little communication (details in that issue and its references). What sticks out is that transitions takes barely any time once communication doesn't play a significant role. In fact there really are not any obvious slow parts in that result.

While the result here suggests more time is spent in communication (see read at 13.67% and write at 7.17%) vs. _transition at 12.39%. I think the previous result indicates that might be underselling how much time is eaten up in communication itself.

IOW things like replacing communication with asyncio (to leverage uvloop) ( dask/distributed#4513 ) or even just using UCX and improving serialization are likely more important at this stage. There's probably still some value to be gained from things like using C APIs for individual transitions ( dask/distributed#4650 ) (as ~3.57% is spent exclusively in _transition and likely due to the current Python call overhead), but expect this is less than the former two items.

cc @jrbourbeau @madsbk @rjzamora @quasiben @mrocklin

jakirkham · 2021-05-06T19:54:08Z

Built on Rick's non-shuffle example in PR ( #141 ). The results now look analogous to what is seen with shuffle. Namely time is spent mostly in communication followed by transitions.

jakirkham mentioned this issue May 6, 2021

[NO MRG] Try Rick's example in the benchmark #141

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DGX Nightly Benchmark run 20210504 #139

DGX Nightly Benchmark run 20210504 #139

quasiben commented May 4, 2021

jakirkham commented May 4, 2021

jakirkham commented May 5, 2021

jakirkham commented May 6, 2021

DGX Nightly Benchmark run 20210504 #139

DGX Nightly Benchmark run 20210504 #139

Comments

quasiben commented May 4, 2021

Benchmark history

Raw Data

Raw Values

Dask Profiles

Scheduler Execution Graph

jakirkham commented May 4, 2021

jakirkham commented May 5, 2021

jakirkham commented May 6, 2021