Report and compare benchmark runs against two branches #5561

alamb · 2023-03-12T12:22:33Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When we make PRs like @jaylmiller 's #5292 or #3463 we often want to know "does this make existing benchmarks faster / slower". To answer this question we would like to:

Run benchmarks on main
Run benchmarks on the PR
Compare the results

This workflow is supported well for the criterion based microbenchmarks in https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/benches (by using criterion directly or using the https://github.com/BurntSushi/critcmp)

However, for the "end to end" benchmarks in https://github.com/apache/arrow-datafusion/tree/main/benchmarks there is no easy way I know of to do two runs and compare results.

Describe the solution you'd like
There is a "machine readable" output format generated with the -o parameter (as shown below)

I would like a script that that compares the output of two benchmark runs. Ideally written either in bash or python.
Instructions on how to run the script added to https://github.com/apache/arrow-datafusion/tree/main/benchmarks

So the workflow would be

Step 1: to create two or more output files using `-o`:

alamb@aal-dev:~/arrow-datafusion2/benchmarks$ cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ~/tpch_data/parquet_data_SF1 --format parquet -o main

This produces files like in benchmarks.zip. Here is an example

{
  "context": {
    "benchmark_version": "19.0.0",
    "datafusion_version": "19.0.0",
    "num_cpus": 8,
    "start_time": 1678622986,
    "arguments": [
      "benchmark",
      "datafusion",
      "--iterations",
      "5",
      "--path",
      "/home/alamb/tpch_data/parquet_data_SF1",
      "--format",
      "parquet",
      "-o",
      "main"
    ]
  },
  "queries": [
    {
      "query": 1,
      "iterations": [
        {
          "elapsed": 1555.030709,
          "row_count": 4
        },
        {
          "elapsed": 1533.61753,
          "row_count": 4
        },
        {
          "elapsed": 1551.0951309999998,
          "row_count": 4
        },
        {
          "elapsed": 1539.953467,
          "row_count": 4
        },
        {
          "elapsed": 1541.992357,
          "row_count": 4
        }
      ],
      "start_time": 1678622986
    },
    ...

Step 2: Compare the two files and prepare a report

benchmarks/compare_results branch.json main.json

Which would produce an output report of some type. Here is an example of an output output (from @korowa on #5490 (comment)). Maybe they have a script they could share

Query               branch         main
----------------------------------------------
Query 1 avg time:   1047.93 ms     1135.36 ms
Query 2 avg time:   280.91 ms      286.69 ms
Query 3 avg time:   323.87 ms      351.31 ms
Query 4 avg time:   146.87 ms      146.58 ms
Query 5 avg time:   482.85 ms      463.07 ms
Query 6 avg time:   274.73 ms      342.29 ms
Query 7 avg time:   750.73 ms      762.43 ms
Query 8 avg time:   443.34 ms      426.89 ms
Query 9 avg time:   821.48 ms      775.03 ms
Query 10 avg time:  585.21 ms      584.16 ms
Query 11 avg time:  247.56 ms      232.90 ms
Query 12 avg time:  258.51 ms      231.19 ms
Query 13 avg time:  899.16 ms      885.56 ms
Query 14 avg time:  300.63 ms      282.56 ms
Query 15 avg time:  346.36 ms      318.97 ms
Query 16 avg time:  198.33 ms      184.26 ms
Query 17 avg time:  4197.54 ms     4101.92 ms
Query 18 avg time:  2726.41 ms     2548.96 ms
Query 19 avg time:  566.67 ms      535.74 ms
Query 20 avg time:  1193.82 ms     1319.49 ms
Query 21 avg time:  1027.00 ms     1050.08 ms
Query 22 avg time:  120.03 ms      111.32 ms

Describe alternatives you've considered
Another possibility might be to move the specialized benchmark binaries into criterion (so they look like "microbench"es but I think this is non ideal because of the number of parameters supported by the benchmarks

Additional context

The text was updated successfully, but these errors were encountered:

jaylmiller · 2023-03-12T16:36:27Z

I could take this one up in the next few days if a new contributor does not end up picking this up as their first issue.

korowa · 2023-03-12T18:52:15Z

Here is an example of an output output (from @korowa on #5490 (comment)). Maybe they have a script they could share

Unfortunately it was just

cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ./parquet --format parquet | grep "avg time"

and "multiline cursor" feature of IDE 🥲

Taza53 · 2023-03-12T23:34:46Z

Hi, I would like to take a crack at it. I will try to do it in python.

alamb · 2023-03-13T10:17:23Z

That would be great @Taza53 -- thank you.

I spent some time gathering data (into benchmarks.zip ) so hopefully you don't have to actually make the datasets or run the benchmarks to make this script.

alamb · 2023-03-13T11:06:51Z

BTW the first thing I hope/plan to do with this script is gather enough data to do #4085

alamb · 2023-03-13T11:51:36Z

@isidentical had a script they shared here: https://gist.github.com/isidentical/4e3fff1350e9d49672e15d54d9e8299f

Taza53 · 2023-03-13T16:52:44Z

I spent some time gathering data (into benchmarks.zip ) so hopefully you don't have to actually make the datasets or run the benchmarks to make this script.

Thank you for gathering data, it's very helpful.

@isidentical had a script they shared here: https://gist.github.com/isidentical/4e3fff1350e9d49672e15d54d9e8299f

I will take a look at it

BTW the first thing I hope/plan to do with this script is gather enough data to do #4085

I am a bit unsure, can you elaborate on this.

alamb · 2023-03-13T16:55:00Z

BTW the first thing I hope/plan to do with this script is gather enough data to do #4085

I am a bit unsure, can you elaborate on this.

Yes -- sorry -- all I was trying to say is that I am excited to use the script and will try it likely as soon as you have it available for a "real" usecase (basically to test #4085)

jaylmiller · 2023-03-15T17:54:01Z

I think this should support all benches in benchmarks/src/bin: right now just tpch has the -o opt.

This means the other benches would be modified to have the -o option and made so that they output json in the same structure as tpch (so the same script that @Taza53 is working on can be used) . I can work on a PR for this if you think that would make sense @alamb ?

alamb · 2023-03-15T18:53:12Z

This means the other benches would be modified to have the -o option and made so that they output json in the same structure as tpch (so the same script that @Taza53 is working on can be used) . I can work on a PR for this if you think that would make sense @alamb ?

I think that sounds like a great idea -- thank you

Taza53 · 2023-03-19T23:44:41Z

python compare.py path1 path2

https://gist.github.com/Taza53/beb42c5918d352f9b760befaa87baef9
I have made some changes to the original. Any recommendations that I can do.

jaylmiller · 2023-03-20T14:00:30Z

That output looks awesome! 🚀

alamb · 2023-03-20T15:31:14Z

Thanks @Taza53 -- I am testing it out now.

alamb · 2023-03-20T18:33:09Z

I tried it out and it worked great (see #5099 (comment)). I will prepare a PR with the script and some instructions.

alamb · 2023-03-20T18:41:14Z

Created #5655 -- thanks @Taza53 ❤️

jaylmiller · 2023-03-21T01:54:39Z

I've added -o functionality to all the e2e benches in #5658 so this new script should be usable with every bench.

alamb added the enhancement New feature or request label Mar 12, 2023

alamb mentioned this issue Mar 12, 2023

[Epic] Improved DataFusion Benchmarking #5505

Open

12 tasks

alamb added good first issue Good for newcomers help wanted Extra attention is needed labels Mar 12, 2023

jaylmiller mentioned this issue Mar 14, 2023

use row encoding for SortExec #5292

Closed

alamb mentioned this issue Mar 20, 2023

Add compare.py to compare the output of multiple benchmarks #5655

Merged

jaylmiller mentioned this issue Mar 20, 2023

Add -o option to all e2e benches #5658

Merged

alamb closed this as completed in #5655 Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report and compare benchmark runs against two branches #5561

Report and compare benchmark runs against two branches #5561

alamb commented Mar 12, 2023

jaylmiller commented Mar 12, 2023

korowa commented Mar 12, 2023

Taza53 commented Mar 12, 2023

alamb commented Mar 13, 2023

alamb commented Mar 13, 2023

alamb commented Mar 13, 2023

Taza53 commented Mar 13, 2023

alamb commented Mar 13, 2023

jaylmiller commented Mar 15, 2023 •

edited

Loading

alamb commented Mar 15, 2023

Taza53 commented Mar 19, 2023

jaylmiller commented Mar 20, 2023

alamb commented Mar 20, 2023

alamb commented Mar 20, 2023

alamb commented Mar 20, 2023

jaylmiller commented Mar 21, 2023

Report and compare benchmark runs against two branches #5561

Report and compare benchmark runs against two branches #5561

Comments

alamb commented Mar 12, 2023

Step 1: to create two or more output files using -o:

Step 2: Compare the two files and prepare a report

jaylmiller commented Mar 12, 2023

korowa commented Mar 12, 2023

Taza53 commented Mar 12, 2023

alamb commented Mar 13, 2023

alamb commented Mar 13, 2023

alamb commented Mar 13, 2023

Taza53 commented Mar 13, 2023

alamb commented Mar 13, 2023

jaylmiller commented Mar 15, 2023 • edited Loading

alamb commented Mar 15, 2023

Taza53 commented Mar 19, 2023

jaylmiller commented Mar 20, 2023

alamb commented Mar 20, 2023

alamb commented Mar 20, 2023

alamb commented Mar 20, 2023

jaylmiller commented Mar 21, 2023

Step 1: to create two or more output files using `-o`:

jaylmiller commented Mar 15, 2023 •

edited

Loading