Methodology for our comparisons vs. Groute #31

jowens · 2017-04-03T20:17:21Z

As we noted in our email communications, we think the fairest comparisons to make between two graph frameworks are those that offer the best available performance for each at the time the comparisons were made. For Gunrock today, that would be the 0.4 release (10 November 2016). We recognize this version was not available at the time the Groute paper was submitted (although it would have been appropriate for camera-ready), so we ran comparisons against a Gunrock version dated July 11, 2016 (6eb6db5d09620701bf127c5acb13143f4d8de394). Yuechao notes that to build this version, we "need to comment out the lp related includes in tests/pr/test_pr.cu, line 33 to line 35, otherwise the build will fail".

In our group, we generally run primitives multiple times within a single binary launch and report the average time (Graph500 does this, for instance). We think the most important aspect is simply to run it more than once to mitigate any startup effects. In our comparisons, we use --iteration-num=32.

By default, we use a source vertex of 0, and depending on the test, we have used both 0-source and random-source in our publications. Getting good performance on a randomized source is harder, but avoids overtuning. In our comparisons, we use source 0, as Groute does.

The text was updated successfully, but these errors were encountered:

jowens · 2017-04-03T20:22:17Z

This methodology applies to the performance measurements we summarize in #32 and #33.

sree314 · 2017-04-06T22:57:37Z

A comment on your methodology (which is, of course, different from what we do):

"we generally run primitives multiple times within a single binary launch and report the average time"

I generally avoid this because it is a common cause of systematic errors in measurement.

Now, running multiple times and taking the average is an estimation procedure for the population mean. This procedure assumes the samples are i.i.d.

Here are the runtimes of the individual samples from the K80+METIS, non-idempotent, non-DO (https://github.com/gunrock/io/blob/master/gunrock-output/20170303/bfs_k80x2_metis_soc-LiveJournal1.txt) data in #32, in order of iterations:

61.02, 46.54, 46.41, 43.97, 42.10, 42.20, 42.07, 40.22, 38.91, 38.85, 38.92, 37.24, 36.25, 36.10, 36.08, 36.05, 35.41, 35.07, 35.03, 35.06, 34.77, 34.43, 34.47, 34.42, 34.41, 34.49, 34.30, 34.40, 33.80, 34.49, 34.49, 34.46

The trend of clearly decreasing runtimes for what should be random samples from the same population is worrying. You'll see this pattern in all of your data (it's more evident in your multi-GPU runs).

Is your procedure estimating the population mean correctly? I.e. is the average you compute using this procedure comparable to the population mean?

sgpyc · 2017-04-10T23:44:25Z

Thanks for pointing that out, and I do see such variances in running time.
Further investigation shows that it's more of a variance (may with small decreasing trend) than a decreasing.
20170410.xlsx

shows the normalized (against the avg. running time without the first run) running time for different {GPU generation, number of GPUs, primitives, graph, partitioner}. From what I observed:

The timings do go up and down;
Some experiments show a decreasing running average, but the decrease is mostly bounded within -10%;
Some running conditions (say, {single GPU, K40 or P100, PR, road_usa}) may give more stable running times;
Partitioning method may not change the trend;
It's almost incorrect to only take the first run, as it shows the warmup effect (on avg., the first run takes 37% longer than the rest);
The decreasing trend is more likely on {K80, M60} than on {K40, P100}.

I still don't know the actual condition(s) and the reason(s) of this trend, and here is my guess:

the power management of GPUs. The GPU hardware + driver may dynamically change its running speed, to protect it from overheat and to get better performance when possible. The fact that dual-chip GPUs (K80 & M60) is far more likely than single-chip GPUs (K40 and P100), to show the decreasing trend let me think this maybe the main reason;
cache effects on the GPU.
CPU side optimizations, especially tread-binding or memory-binding to die / core. The experiments are all run on dual-CPU machines, and the running time may reduce when things are cached on or allocated to the CPU that's closer to the GPU of specific workload.

Currently I think 2) and 3) are less likely than 1), but all of them point to lower level optimizations. From what I can tell, it's far less possible that systematic errors are the cause.

sree314 · 2017-04-11T23:30:02Z

Hi sgpyc,

Things like power management, cache effects and optimizations are systematic errors. It may help to think of "systematic bias" if it is helpful.

In general, if behaviour of later runs is affected by earlier runs then your observations are not independent of each other. Their average is not a good estimator of the population mean.

I would advise figuring out exactly why the trend exists and controlling for it (for example, nvcc -gencode to avoid JIT overhead, or using nvidia-smi to disable power management, pinning threads to CPUs manually, etc.).

If you do this, the average computed by running (say) BFS n times from the shell should not be significantly different from running n repetitions of BFS from within breadth_first_search.

Hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methodology for our comparisons vs. Groute #31

Methodology for our comparisons vs. Groute #31

jowens commented Apr 3, 2017

jowens commented Apr 3, 2017

sree314 commented Apr 6, 2017

sgpyc commented Apr 10, 2017 •

edited

Loading

sree314 commented Apr 11, 2017

Methodology for our comparisons vs. Groute #31

Methodology for our comparisons vs. Groute #31

Comments

jowens commented Apr 3, 2017

jowens commented Apr 3, 2017

sree314 commented Apr 6, 2017

sgpyc commented Apr 10, 2017 • edited Loading

sree314 commented Apr 11, 2017

sgpyc commented Apr 10, 2017 •

edited

Loading