Add DataFusion solution #18

Dandandan · 2023-06-13T08:57:53Z

No description provided.

Tmonster · 2023-06-30T07:42:13Z

Hi @Dandandan the GitHub actions script has been merged. If you merge with master, push the last changes and get a green check mark, this PR is good to merge!

Dandandan · 2023-06-30T07:45:29Z

Hi @Dandandan the GitHub actions script has been merged. If you merge with master, push the last changes and get a green check mark, this PR is good to merge!

Awesome!

Dandandan · 2023-07-01T08:45:58Z

@Tmonster can we run it again?

Dandandan · 2023-07-01T11:53:03Z

@Tmonster CI passed 🥳

Tmonster · 2023-07-03T08:23:31Z

@Dandandan Great! I looked at some of the .err files myself. It seems like some of the queries were killed (groupie Q10, join Q4). I'm going to merge anyway, since most solutions don't pass all queries. Just wanted to let you know 👍

Dandandan · 2023-07-03T08:30:18Z

@Tmonster awesome 👍 let me know when results are published

jangorecki · 2023-07-21T22:59:02Z

@Dandandan do you think you could provide data fussion script for rolling statistics task that is being developed in #9 ? I looked at data fussion docs and it seems to support those.

jangorecki · 2023-07-23T14:39:30Z

_report/history.Rmd

@@ -728,3 +728,45 @@ Report was generated on: `r format(Sys.time(), usetz=TRUE)`.
 ```{r status_set_success}
 cat("history\n", file=get_report_status_file(), append=TRUE)
 ```
+


This chunk of code has been added in the wrong place. It needs to be added after the last solution, not at the end of the document.

jangorecki · 2023-07-23T15:43:14Z

_benchplot/benchplot-dict.R

@@ -235,7 +248,8 @@ groupby.query.exceptions = {list(
 "polars" = list(),
 "arrow" = list("Expression row_number() <= 2L not supported in Arrow; pulling data into R" = "max v1 - min v2 by id3", "Expression cor(v1, v2, ... is not supported in arrow; pulling data into R" = "regression v1 v2 by id2 id4"),
 "duckdb" = list(),
- "duckdb-latest" = list()
+ "duckdb-latest" = list(),
+ "datafusion" = list(),


trailing comma makes this R script not parse-able

jangorecki · 2023-07-23T15:43:35Z

_benchplot/benchplot-dict.R

@@ -445,7 +468,7 @@ join.data.exceptions = {list(
 "J1_1e9_NA_5_0","J1_1e9_NA_0_1") # q1 r1
 )},
 "polars" = {list(
- "out of memory" = c("J1_1e9_NA_0_0","J1_1e9_NA_5_0","J1_1e9_NA_0_1")
+ "out of memory" = c("J1_1e9_NA_0_0","J1_1e9_NA_5_0","J1_1e9_NA_0_1"),


again, trailing comma makes this R script not parse-able

Dandandan · 2023-07-23T15:48:07Z

Yes, this would probably be possible using window functions

alamb · 2023-08-05T16:50:10Z

I noticed that the https://duckdblabs.github.io/db-benchmark/ does not yet have the results of the datafusion run

Perhaps due to jangorecki's comments

@jangorecki is there something datafusion specific that we can help with? I looked at the instructions on https://github.com/duckdblabs/db-benchmark#reproduce but I am not familiar with R

jangorecki · 2023-08-06T07:27:01Z

not because of my comment. duck's team said they are going to run benchmark in september.

ensuring it runs smoothly after datafusion merge is likely to reduce waiting time. otherwise duck's team need to debug and fix missing/incorrect pieces, potentially if it takes too much time, then excluding problematic solution (this is what I used to do when I needed benchmark timings but didn't have time to fix breaking changes in some of the tools. although I run benchmark more frequently so skipping a software once was less a problem).

benchmark can be easily run on a laptop using only 1e7 data sizes (config in _control/data/csv, column active). Steps to reproduce are included in the repo. Knowledge of R is not necessary to run it.

Tmonster · 2023-08-06T14:14:49Z

Exactly what Jan said. Duckdb Is planning a release for september 11. At that point we will run the benchmarks again for all solutions

https://duckdb.org/dev/release-dates (Duckdb is working on making this more visible)

If you are wondering about how datafusion will compare, you can take a look at reproducing the environment using the (https://github.com/duckdblabs/db-benchmark/blob/master/.github/workflows/regression.yml)[regression.yml] file.

This can be run on most amazon ubuntu 22.04 boxes. Then you just need to generate the data and create the report.
One thing I have noticed about datafusion, is that it is not finishing the benchmarks for join or groupby in the github actions.
see https://github.com/duckdblabs/db-benchmark/actions/runs/5769687734/job/15641984997 under validate benchmarks, the datafusion output doesn't print "[joining|grouping] finished, took X s" like the other solutions. This usually indicated that the process was killed by the OOM reaper.

@Dandandan maybe you would like to take a look at this as well. Since all of the other solutions complete, I imagine datafusion should be able to complete as well.

jangorecki · 2023-08-06T20:21:21Z

Few issues I spotted and mentioned in this PR here have been fixed by me in #9, so if you want to run whole benchmark, then it may be easier using #9 (or just cherry pick those changes). #9 should be good to merge, runs fine on a laptop, next week will run it on aws, and then confirm it is ready to merge.

Dandandan · 2023-08-07T08:46:15Z

Thanks @jangorecki @Tmonster for the comments

I checked the DataFusion solution to be runnable (and it passed CI) when implementing this solution but I missed the R syntax. Thanks @jangorecki for fixing this in #9 !

I probably won't have time to run & compare solutions yet, so I think we can either wait on @Tmonster to run the new solution or someone else has to step up.

@Tmonster DataFusion 28 (Python bindings) were released yesterday, which improves memory usage in case of high cardinality grouping, maybe that resolves the OOM situation for DataFusion.

jangorecki · 2023-08-07T15:53:33Z

BTW. If one wants to stress memory usage then G1_1e7_2e0_0_0 (k=2) dataset can be used. AFAIK duck's team runs only default G1_1e7_1e2_0_0 (k=100). Multiple solutions failed (1e8 and 1e9 rows) with k=2 while passing with the default k=100.

MrPowers · 2023-09-23T16:41:05Z

@jangorecki - so happy to see you contributing to this repo. Your work on this initiative has been so helpful to me over the years.

I am reading through this comment thread and it seems like we're good to go now and all issues have been resolved.

Duckdb Is planning a release for september 11

Do you know if this was released? Will DataFusion be included the next time the benchmarks are run?

jangorecki · 2023-09-23T20:04:37Z

@MrPowers no idea, I am unfortunately not associated with duckdb. AFAIK coming duckdb release is a big milestone so delays will be quite natural.

Dandandan and others added 22 commits January 17, 2021 14:41

Datafusion solution

3a983fd

Datafusion solution

b1f613e

Query fix

51ce127

Undo change

3343428

Increase batch size

d1e7ff3

Rename to ans

58be012

Fix

d87c92d

Add q7/q10

2b67e2a

Use multiple threads better

d217e37

Add exec script

5a3e5ec

Some cleanup

f839050

Rename

6cb14f5

Fix disabled snmalloc

63fe38b

Use arrow master again

cbecfbc

Update benchmark code

88ba391

Make queries work again

fbb50dc

Add join queries

20978b7

group by q8

4042b3c

Add python bindings

2cca309

Fix join and utils

c345446

Remove rust impl and update utilities

82f34fb

Add datafusion

969bf8e

Dandandan mentioned this pull request Jun 13, 2023

Add datafusion to the DuckDB flavored h2o.ai benchmark apache/datafusion#6584

Closed

Tmonster self-requested a review June 13, 2023 09:01

Daniël Heres added 6 commits June 13, 2023 11:37

Use version

852c9bf

Correct

403f7cc

Add datafusion to csv

8dbfb97

Adapt

456799c

Undo change

0e0ebc2

Fix version script

333f360

Daniël Heres added 5 commits June 30, 2023 09:55

Unroll

c1f2ce0

Merge remote-tracking branch 'origin/master' into datafusion

cbc28ae

Rev

7b91cf6

WIP

e85542c

Join bench

ae0a21f

Dandandan requested a review from Tmonster July 1, 2023 12:18

Tmonster approved these changes Jul 3, 2023

View reviewed changes

Tmonster merged commit 6577d9d into duckdblabs:master Jul 3, 2023
1 check passed

Dandandan mentioned this pull request Jul 3, 2023

Include datafusion in the benchmark #5

Closed

jangorecki reviewed Jul 23, 2023

View reviewed changes

jangorecki mentioned this pull request Jul 23, 2023

misplaced datafussion code in history.Rmd #29

Open

jangorecki reviewed Jul 23, 2023

View reviewed changes

alamb mentioned this pull request Aug 5, 2023

Add H2O.ai Database-like Ops benchmark to dfbench apache/datafusion#7209

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataFusion solution #18

Add DataFusion solution #18

Dandandan commented Jun 13, 2023

Tmonster commented Jun 30, 2023

Dandandan commented Jun 30, 2023

Dandandan commented Jul 1, 2023

Dandandan commented Jul 1, 2023

Tmonster commented Jul 3, 2023

Dandandan commented Jul 3, 2023

jangorecki commented Jul 21, 2023

jangorecki Jul 23, 2023

jangorecki Jul 23, 2023

jangorecki Jul 23, 2023

Dandandan commented Jul 23, 2023

alamb commented Aug 5, 2023

jangorecki commented Aug 6, 2023

Tmonster commented Aug 6, 2023

jangorecki commented Aug 6, 2023

Dandandan commented Aug 7, 2023

jangorecki commented Aug 7, 2023 •

edited

Loading

MrPowers commented Sep 23, 2023

jangorecki commented Sep 23, 2023

Add DataFusion solution #18

Add DataFusion solution #18

Conversation

Dandandan commented Jun 13, 2023

Tmonster commented Jun 30, 2023

Dandandan commented Jun 30, 2023

Dandandan commented Jul 1, 2023

Dandandan commented Jul 1, 2023

Tmonster commented Jul 3, 2023

Dandandan commented Jul 3, 2023

jangorecki commented Jul 21, 2023

jangorecki Jul 23, 2023

Choose a reason for hiding this comment

jangorecki Jul 23, 2023

Choose a reason for hiding this comment

jangorecki Jul 23, 2023

Choose a reason for hiding this comment

Dandandan commented Jul 23, 2023

alamb commented Aug 5, 2023

jangorecki commented Aug 6, 2023

Tmonster commented Aug 6, 2023

jangorecki commented Aug 6, 2023

Dandandan commented Aug 7, 2023

jangorecki commented Aug 7, 2023 • edited Loading

MrPowers commented Sep 23, 2023

jangorecki commented Sep 23, 2023

jangorecki commented Aug 7, 2023 •

edited

Loading