Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFusion solution #18

Merged
merged 46 commits into from
Jul 3, 2023
Merged

Add DataFusion solution #18

merged 46 commits into from
Jul 3, 2023

Conversation

Dandandan
Copy link

No description provided.

@Tmonster
Copy link
Collaborator

Hi @Dandandan the GitHub actions script has been merged. If you merge with master, push the last changes and get a green check mark, this PR is good to merge!

@Dandandan
Copy link
Author

Hi @Dandandan the GitHub actions script has been merged. If you merge with master, push the last changes and get a green check mark, this PR is good to merge!

Awesome!

@Dandandan
Copy link
Author

@Tmonster can we run it again?

@Dandandan
Copy link
Author

@Tmonster CI passed 🥳

@Dandandan Dandandan requested a review from Tmonster July 1, 2023 12:18
@Tmonster
Copy link
Collaborator

Tmonster commented Jul 3, 2023

@Dandandan Great! I looked at some of the .err files myself. It seems like some of the queries were killed (groupie Q10, join Q4). I'm going to merge anyway, since most solutions don't pass all queries. Just wanted to let you know 👍

@Tmonster Tmonster merged commit 6577d9d into duckdblabs:master Jul 3, 2023
1 check passed
@Dandandan
Copy link
Author

@Tmonster awesome 👍 let me know when results are published

@jangorecki
Copy link

@Dandandan do you think you could provide data fussion script for rolling statistics task that is being developed in #9 ? I looked at data fussion docs and it seems to support those.

@@ -728,3 +728,45 @@ Report was generated on: `r format(Sys.time(), usetz=TRUE)`.
```{r status_set_success}
cat("history\n", file=get_report_status_file(), append=TRUE)
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chunk of code has been added in the wrong place. It needs to be added after the last solution, not at the end of the document.

@@ -235,7 +248,8 @@ groupby.query.exceptions = {list(
"polars" = list(),
"arrow" = list("Expression row_number() <= 2L not supported in Arrow; pulling data into R" = "max v1 - min v2 by id3", "Expression cor(v1, v2, ... is not supported in arrow; pulling data into R" = "regression v1 v2 by id2 id4"),
"duckdb" = list(),
"duckdb-latest" = list()
"duckdb-latest" = list(),
"datafusion" = list(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing comma makes this R script not parse-able

@@ -445,7 +468,7 @@ join.data.exceptions = {list(
"J1_1e9_NA_5_0","J1_1e9_NA_0_1") # q1 r1
)},
"polars" = {list(
"out of memory" = c("J1_1e9_NA_0_0","J1_1e9_NA_5_0","J1_1e9_NA_0_1")
"out of memory" = c("J1_1e9_NA_0_0","J1_1e9_NA_5_0","J1_1e9_NA_0_1"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, trailing comma makes this R script not parse-able

@Dandandan
Copy link
Author

Yes, this would probably be possible using window functions

@alamb
Copy link

alamb commented Aug 5, 2023

I noticed that the https://duckdblabs.github.io/db-benchmark/ does not yet have the results of the datafusion run

Perhaps due to jangorecki's comments

@jangorecki is there something datafusion specific that we can help with? I looked at the instructions on https://github.com/duckdblabs/db-benchmark#reproduce but I am not familiar with R

@jangorecki
Copy link

not because of my comment. duck's team said they are going to run benchmark in september.

ensuring it runs smoothly after datafusion merge is likely to reduce waiting time. otherwise duck's team need to debug and fix missing/incorrect pieces, potentially if it takes too much time, then excluding problematic solution (this is what I used to do when I needed benchmark timings but didn't have time to fix breaking changes in some of the tools. although I run benchmark more frequently so skipping a software once was less a problem).

benchmark can be easily run on a laptop using only 1e7 data sizes (config in _control/data/csv, column active). Steps to reproduce are included in the repo. Knowledge of R is not necessary to run it.

@Tmonster
Copy link
Collaborator

Tmonster commented Aug 6, 2023

Exactly what Jan said. Duckdb Is planning a release for september 11. At that point we will run the benchmarks again for all solutions

https://duckdb.org/dev/release-dates (Duckdb is working on making this more visible)

If you are wondering about how datafusion will compare, you can take a look at reproducing the environment using the (https://github.com/duckdblabs/db-benchmark/blob/master/.github/workflows/regression.yml)[regression.yml] file.

This can be run on most amazon ubuntu 22.04 boxes. Then you just need to generate the data and create the report.
One thing I have noticed about datafusion, is that it is not finishing the benchmarks for join or groupby in the github actions.
see https://github.com/duckdblabs/db-benchmark/actions/runs/5769687734/job/15641984997 under validate benchmarks, the datafusion output doesn't print "[joining|grouping] finished, took X s" like the other solutions. This usually indicated that the process was killed by the OOM reaper.

@Dandandan maybe you would like to take a look at this as well. Since all of the other solutions complete, I imagine datafusion should be able to complete as well.

@jangorecki
Copy link

Few issues I spotted and mentioned in this PR here have been fixed by me in #9, so if you want to run whole benchmark, then it may be easier using #9 (or just cherry pick those changes). #9 should be good to merge, runs fine on a laptop, next week will run it on aws, and then confirm it is ready to merge.

@Dandandan
Copy link
Author

Thanks @jangorecki @Tmonster for the comments

I checked the DataFusion solution to be runnable (and it passed CI) when implementing this solution but I missed the R syntax. Thanks @jangorecki for fixing this in #9 !

I probably won't have time to run & compare solutions yet, so I think we can either wait on @Tmonster to run the new solution or someone else has to step up.

@Tmonster DataFusion 28 (Python bindings) were released yesterday, which improves memory usage in case of high cardinality grouping, maybe that resolves the OOM situation for DataFusion.

@jangorecki
Copy link

jangorecki commented Aug 7, 2023

BTW. If one wants to stress memory usage then G1_1e7_2e0_0_0 (k=2) dataset can be used. AFAIK duck's team runs only default G1_1e7_1e2_0_0 (k=100). Multiple solutions failed (1e8 and 1e9 rows) with k=2 while passing with the default k=100.

@MrPowers
Copy link

@jangorecki - so happy to see you contributing to this repo. Your work on this initiative has been so helpful to me over the years.

I am reading through this comment thread and it seems like we're good to go now and all issues have been resolved.

Duckdb Is planning a release for september 11

Do you know if this was released? Will DataFusion be included the next time the benchmarks are run?

@jangorecki
Copy link

@MrPowers no idea, I am unfortunately not associated with duckdb. AFAIK coming duckdb release is a big milestone so delays will be quite natural.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants