-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible error in setup #5
Comments
hi! I hear the all the comments on data.table result and I am surprised myself. I agree when I run a smaller dataset (eg. 1e6) I get analysis similiar to yours but for some reason when I run it greater 500M rows the DT performance drops. However others who have tried the 1e9 dataset have not been able to reproduce my findings with specific regard to data.table. I need to understand why I am getting my data.table -- I tried to create a clean environment for each test with system2() but it still isn't working well. I'm going to update my post on twitter / readme to clarify the data.table results are in dispute as I don't want to spread misinformation. I'll update the polars with the code you suggested and I'll keep working the data.table angle of this. |
I saw the tweet to you by @vincentarelbundock, and I also had a chance to hop onto a larger machine to validate my more traditional and standard use of benchmarking (ie directly write the expressions to measure in the call, let the benchmarking package and function do the householding) -- which complements Vincent re-run of you scriot. In this large case of 1e9 rows, the difference between data.table and dplyr narrows, but both gain on collapse. So in short I am with Vincent here, and iterate my concern from the top of this issue ticket. You likely want to build this up more carefully. Coderes <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> collapse::fungroup(),
dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
avg(measurement) from stable group by state"),
times = 10)
print(res)
ggplot2::autoplot(res) Results Table> res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
+ collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> collapse::fungroup(),
+ dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
+ times = 10)
> res
Unit: seconds
expr min lq mean median uq max neval
datatable 26.47838 28.13759 30.19073 29.81849 31.90281 35.00091 10
collapse 50.19573 50.93365 53.09633 52.23915 56.23458 57.41016 10
dplyr 26.84754 27.93581 30.34774 30.00457 30.68157 35.96263 10
> Results Plot |
Hi! thanks for taking the time to run the analysis and I really value yours and @vincentarelbundock feedback -- I am working on how to better run the analysis to avoid set up issues. To clarify, when looking at your results, your DT and dplyr results are actually consistent with mine (eg. both are around 30 seconds). For your collapse result -- if you run the analysis with fmean, fmax, fmin rather than the regular mean,min, and max do you still see the same outcomes? If you update with the collapse version of the aggregation metrics, I suspect you may get similiar metrics as what I originally posted -- are you able to confirm? In my latest sourcing, I converted each file to a bash file and run it separately with system2() because when running it directly as a call in microbenchmark I was getting feedback that each simulation may be altering the current session which I wanted to avoid. |
Good catch on me having dropped the > res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
+ collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement), max=collapse::fmax(measurement), mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+ dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
+ times = 10)
> res
Unit: seconds
expr min lq mean median uq max neval
datatable 29.06584 29.25456 30.59570 30.37740 31.55003 33.28734 10
collapse 51.01506 51.94234 53.87729 53.78601 54.91867 59.91626 10
dplyr 27.72498 29.45540 31.21465 30.65568 33.63867 35.84741 10
> As for using I also used |
DuckDB registers R data.frame as a virtual table in the database and should not actually load the data1, so there should be little cost to register the data in the table. Line 20 in 83615d3
Of course, SQLite does not have such a feature, so copying data will always take time in SQLite. Footnotes |
You may have misread / misinterpreted what I wrote and meant and I did not share my code. In my exploratory tests I had an explicit My objection, really, is about measuring the comparables, ie direct by group summaries in |
I see. By the way, as a complete aside, Polars has SQL support, so such a query can be handled completely in SQL. If you use the |
Interesting. We may be starting to misuse this thread but when I try that (with the Jammy build of > lf <- polars::pl$LazyFrame(D)
> res <- polars::pl$SQLContext(frame = lf)$execute("select min(measurement), max(measurement), avg(measurement) from frame group by state")
Error: Execution halted with the following contexts
0: In R: in $schema():
0: During function call [.rs.describeObject(<environment>, "res", TRUE)]
1: Encountered the following error in Rust-Polars:
duplicate: column with name 'measurement' has more than one occurrences
Error originated just after this operation:
DF ["measurement", "state"]; PROJECT */2 COLUMNS; SELECTION: "None"
Error originated just after this operation:
ErrorStateSync(AlreadyEncountered(duplicate: column with name 'measurement' has more than one occurrences
Error originated just after this operation:
DF ["measurement", "state"]; PROJECT */2 COLUMNS; SELECTION: "None"))
DF ["measurement", "state"]; PROJECT */2 COLUMNS; SELECTION: "None"
> |
My guess is that unlike most RDBMSs, Polars uses |
Oh, for completeness, I am running the 1e9 examples on a large server on campus having 512gb ram (!!) and 48 (slow-ish) Xeon cores at 2.1gb. If memory serves, |
Most excellent: > res <- polars::pl$SQLContext(frame = lf)$execute("select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state")
> res
[1] "polars LazyFrame naive plan: (run ldf$describe_optimized_plan() to see the optimized plan)"
SELECT [col("min_m"), col("max_m"), col("mean_m")] FROM
AGGREGATE
[col("measurement").min().alias("min_m"), col("measurement").max().alias("max_m"), col("measurement").mean().alias("mean_m")] BY [col("state")] FROM
DF ["measurement", "state"]; PROJECT */2 COLUMNS; SELECTION: "None"
> and wicked fast (same as > system.time(res$collect())
user system elapsed
70.505 19.169 2.140
> And still very respectable and even best in class when we do it all in once: > system.time({lf <- polars::pl$LazyFrame(D); res <- polars::pl$SQLContext(frame = lf)$execute("select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state")$collect()})
user system elapsed
82.638 29.134 22.918
> I will stick this into the benchmark proper. Big thank you, as always, @eitsupi ! |
We have a new champion: > res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
+ collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement), max=collapse::fmax(measurement), mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+ dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
+ polars = {{lf <- polars::pl$LazyFrame(D); polars::pl$SQLContext(frame = lf)$execute("select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state")$collect()}},
+ times = 10)
> res
Unit: seconds
expr min lq mean median uq max neval
datatable 27.04999 27.57103 28.90611 29.30950 29.95487 30.76303 10
collapse 51.02113 51.96048 53.08097 52.93228 53.61110 56.55661 10
dplyr 28.85432 29.22231 31.12528 30.57937 33.38361 34.94457 10
polars 23.64507 23.92012 25.25210 24.77925 26.39392 28.33606 10
> ggplot2::autoplot(res)
> |
Very impressively > res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
+ collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement), max=collapse::fmax(measurement), mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+ dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
+ polars_sql = {{lf <- polars::pl$LazyFrame(D); polars::pl$SQLContext(frame = lf)$execute("select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state")$collect()}},
+ polars = polars::pl$DataFrame(D)$group_by("state")$agg(polars::pl$col("measurement")$min()$alias("min_m"),polars::pl$col("measurement")$max()$alias("max_m"),polars::pl$col("measurement")$mean()$alias("mean_m")),
+ times = 10)
> res
Unit: seconds
expr min lq mean median uq max neval
datatable 26.40483 27.36088 28.79502 28.61824 30.08184 32.24413 10
collapse 47.62153 49.16795 50.97285 50.07759 51.46359 60.20597 10
dplyr 28.40911 29.34691 30.85193 30.55275 31.97368 34.53324 10
polars_sql 22.37270 22.51855 24.28000 23.36114 26.76574 27.04358 10
polars 22.41094 22.66564 23.44019 22.94000 23.88544 25.63082 10
> ggplot2::autoplot(res)
> |
And as the old 'it all depends' still holds here I what I get back on my (slightly aged) computer with 1e6 and 1e7: 1e6> D <- data.table::fread("measurements.1e6.csv")
+ sqltxt <- "select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state"
+ res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),
+ min=min(measurement),
+ max=max(measurement)),by=state],
+ collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement),
+ max=collapse::fmax(measurement),
+ mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+ dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement),
+ max=max(measurement),
+ mean=mean(measurement)) |> dplyr::ungroup(),
+ polars_sql = {lf <- polars::pl$LazyFrame(D); polars::pl$SQLContext(frame = lf)$execute(sqltxt)$collect()},
+ polars = polars::pl$DataFrame(D)$group_by("state")$agg(polars::pl$col("measurement")$min()$alias("min_m"),
+ polars::pl$col("measurement")$max()$alias("max_m"),
+ polars::pl$col("measurement")$mean()$alias("mean_m")),
+ times = 10)
+
+ print(res)
> + > Unit: milliseconds
expr min lq mean median uq max neval cld
datatable 13.3638 13.7731 15.0158 14.9566 15.8077 17.4406 10 a
collapse 17.3992 18.0630 25.3196 23.7119 30.3520 42.5892 10 b
dplyr 20.3400 22.9261 25.4593 26.2902 26.8203 31.4192 10 b
polars_sql 32.3368 33.0572 34.8059 34.9782 35.3903 39.2898 10 c
polars 33.0885 33.2309 35.2959 35.1821 36.4092 38.9956 10 c
> 1e7> D <- data.table::fread("measurements.1e7.csv")
+ sqltxt <- "select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state"
+ res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),
+ min=min(measurement),
+ max=max(measurement)),by=state],
+ collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement),
+ max=collapse::fmax(measurement),
+ mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+ dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement),
+ max=max(measurement),
+ mean=mean(measurement)) |> dplyr::ungroup(),
+ polars_sql = {lf <- polars::pl$LazyFrame(D); polars::pl$SQLContext(frame = lf)$execute(sqltxt)$collect()},
+ polars = polars::pl$DataFrame(D)$group_by("state")$agg(polars::pl$col("measurement")$min()$alias("min_m"),
+ polars::pl$col("measurement")$max()$alias("max_m"),
+ polars::pl$col("measurement")$mean()$alias("mean_m")),
+ times = 10)
+
+ print(res)
+ p1e7 <- ggplot2::autoplot(res) + ggplot2::ggtitle("N = 1e7") + tinythemes::theme_ipsum_rc()
> > Unit: milliseconds
expr min lq mean median uq max neval cld
datatable 138.280 153.186 179.441 163.553 172.607 325.932 10 a
collapse 387.709 401.260 437.516 426.304 475.471 526.429 10 b
dplyr 226.344 239.860 268.107 268.591 281.579 326.171 10 c
polars_sql 287.727 291.987 301.697 297.225 299.797 346.353 10 c
polars 291.714 292.332 298.029 297.448 301.144 307.805 10 c
> With |
We can be a little cute and use library(data.table)
fun <- function(x) {
range <- range(x)
list(mean = mean(x), min = range[1], max = range[2])
}
DT <- fread("measurements.csv")
(res <- microbenchmark::microbenchmark(
dt = DT[ ,.(mean=mean(measurement), min=min(measurement), max=max(measurement)), by=state],
dt_range = DT[, fun(measurement), by = state],
times = 1,
check = "equal"
))
#> Unit: seconds
#> expr min lq mean median uq max neval
#> dt 17.52600 17.52600 17.52600 17.52600 17.52600 17.52600 1
#> dt_range 10.23352 10.23352 10.23352 10.23352 10.23352 10.23352 1 |
That's good. Likeky nixes out one pass over the data. |
For completeness, when I posted this on social media, @schochastics made two good points:
I am also starting to think that these should really be presented / aggregated over 1e6, 1e7, 1e8, and 1e9 rows. The latter is interesting (if alone for the comparison with 1brc) but that size seems to flatten some differences as the core time seems to be spend running over the data vector to compute the statistics (three times as @TimTaylor observed -- which can improve upon too). |
Here is my base R benchmark excluding polars (and two versions of tapply, because I was wondering if do.call is faster) D <- data.table::fread("measurements1e6.csv", stringsAsFactors = TRUE)
sum_stats_dplyr <- function(x) {
rg <- range(x)
tibble::tibble(mean = mean(x), min = rg[1], max = rg[2])
}
sum_stats_list <- function(x) {
rg <- range(x)
list(mean = mean(x), min = rg[1], max = rg[2])
}
sum_stats_vec <- function(x) {
rg <- range(x)
c(mean = mean(x), min = rg[1], max = rg[2])
}
microbenchmark::microbenchmark(
dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(sum_stats_dplyr(measurement)) |> dplyr::ungroup(),
datatable = D[, .(sum_stats_list(measurement)), by = state],
collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(sum_stats_dplyr(measurement)) |> collapse::fungroup(),
aggregate = aggregate(measurement ~ state, data = D, FUN = sum_stats_vec),
tapply1 = do.call("rbind", tapply(D$measurement, D$state, sum_stats_vec)),
tapply2 = sapply(tapply(D$measurement, D$state, sum_stats_vec), rbind),
lapply = {
results <- lapply(split(D, D$state), function(x) {
rg <- range(x$measurement)
data.frame(
state = unique(x$state),
min = rg[1],
max = rg[2],
mean = mean(x$measurement)
)
})
do.call(rbind, results)
},
by = by(D$measurement, D$state, sum_stats_vec), times = 10
) detailed for N=1e8
|
ok there is an interesting contender: reduce = {
state_list <- split(D$measurement, D$state)
Reduce(function(x, y) {
res <- sum_stats_vec(state_list[[y]])
rbind(x, c(state = y, mean = res[1], min = res[2], max = res[3]))
}, names(state_list), init = NULL)
} for N=1e8
|
BTW I found sorting by, say, median time helps. I found an open issue at |
sorry for spamming this issue so much. But I think this is the last for me: rfast = {
lev_int <- as.numeric(D$state)
minmax <- Rfast::group(D$measurement, lev_int, method = "min.max")
data.frame(
state = levels(D$state),
mean = Rfast::group(D$measurement, lev_int, method = "mean"),
min = minmax[1, ],
max = minmax[2, ]
)
} again, N=1e8 (ordered by mean, thanks @eddelbuettel)
Unfortunately I am failing to install |
you can actually gain a few secs on 1e9 if you apply an additional SAC to DT:
|
For completeness, and as @TimTaylor had pointed out to me in DM and at the Results> res
Unit: seconds
expr min lq mean median uq max neval
datatable 11.15210 12.11368 13.50438 13.47416 14.62594 16.71623 10
polars 22.73818 22.84256 24.19483 23.97759 24.61460 26.98668 10
polars_sql 22.65298 22.89407 24.85488 24.19280 26.52703 28.26696 10
dplyr 27.50917 30.36789 30.73162 31.01001 31.30612 33.00118 10
collapse 48.79312 50.55301 53.45232 52.58460 54.01983 61.88502 10
tapply 55.46752 56.28768 57.49995 57.24879 58.14358 61.21602 10
lapply 71.10101 73.53757 75.22146 74.20503 77.62213 80.32287 10
by 71.32586 72.49985 74.28661 74.38054 75.79082 78.18177 10
> PlotThis uses the current development branch of data.table, the respective changes have been in that branch for a while as far as I know. |
Awesome - I hadn't compared with polars so that's an impressive surprise. |
@eddelbuettel I hear you on your reasons for not including DuckDB. But since Polars is using the same OLAP-type engine, albeit here embedded directly in R "dialect", I feel it's only fair that it gets a showing too. (Also to show R users, specifically, just how freaking fast it is.) As @eitsupi says, registering a DuckDB table in memory from R is pretty quick and painless. No real difference IMO to setting/scanning a lazy Polars frame. Otherwise, very impressive speed improvements in the latest version of data.table. I believe an official 1.15.0 release is just around the corner... |
duckdb memory vs duckdb disk vs data.table (dev)
|
I had these too (using a memory db), but hesitated showing them. Of course Not accounting for creating the |
on another note, having a big machine does not seem to be such a big deal with data.table (unless I got something wrong)
|
@eddelbuettel - I think the closest comparison (if you wanted to go that route) would be using the That said - constraints are good... For my own timings I've made use of {inline} and a simple compiled loop ... 'tis all still R code but doesn't fit with the spirit so 🤷♂️ |
Setting up an index beforehand on the
|
I think what you see here is the Linux kernel helping you turning disk into memory by caching ... |
The threading example is good. As is using an index -- we'd get that via |
It generally should not, because range is not GForce optimized. Are you by any chance on an older version of DT? using verbose=T is quite useful in such cases |
Yes initially it was 1.14.10 but now switched to master and direct (min max) approach is much faster. |
Maybe a bit late to the party but here we go: Setup from lvalnegri library(datasets)
library(data.table)
n <- 1e8
set.seed(2024)
y <- data.table(
m = stats::rnorm(n),
s = base::sample(datasets::state.abb, size = n, replace = TRUE)
)
setkey(y, 's')
setDTT <- \(x) { setDTthreads(x); y[ , .(mean(m), min(m), max(m)), s] }
yr <- microbenchmark::microbenchmark(
'dt02' = setDTT(2),
'dt04' = setDTT(4),
'dt06' = setDTT(6),
'dt08' = setDTT(8),
'dt10' = setDTT(10),
times = 10
)
yr
# 1.14.10
# Unit: seconds
# expr min lq mean median uq max neval cld
# dt02 3.003172 3.074757 3.148601 3.138230 3.240320 3.311348 10 a
# dt04 2.751539 2.840537 2.906491 2.877485 2.963776 3.119828 10 b
# dt06 2.699096 2.802055 2.850967 2.848263 2.914460 3.045719 10 b
# dt08 2.663448 2.682338 2.805176 2.787870 2.860383 3.063182 10 bc
# dt10 2.621527 2.682646 2.703065 2.705433 2.739840 2.802580 10 c
# 1.14.99 aka 1.15.0
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dt02 1401.196 1515.518 1546.238 1542.442 1600.611 1670.684 10 a
# dt04 1106.861 1290.204 1318.007 1303.570 1365.536 1450.535 10 b
# dt06 1074.459 1087.036 1143.726 1128.395 1223.271 1235.099 10 c
# dt08 1036.661 1042.637 1116.963 1127.266 1176.382 1195.730 10 cd
# dt10 967.258 1021.693 1041.920 1026.675 1043.143 1143.767 10 d
# gmin parallelized per group
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# dt02 1221.1177 1264.3856 1316.9177 1311.3068 1392.7166 1416.7932 10 a
# dt04 844.3419 910.7867 1144.4065 941.2959 1035.7934 2937.8885 10 ab
# dt06 767.6377 787.4808 833.9695 812.1181 842.2414 1011.9838 10 b
# dt08 785.2025 802.7681 847.5161 814.6925 848.3238 1024.4026 10 b
# dt10 697.5232 753.0673 809.0413 799.7735 883.6270 907.5998 10 b With more groups or more cores the speedup from gmin parallelized to 1.15.0 should be even higher |
hey everyone, especially @eddelbuettel, @eitsupi, @TimTaylor and @schochastics for all the discussion! I had business travel to Thailand and couldn't work this and quite frankly the skill set of the conversation is way over my skill level. What I basically have concluded that it is a bit meaningless to call something a "benchmark" unless there are consistent or clear standards of said benchmark. Seeing the various results, I noticed there can be considerable variability in running the platforms which suggests that in addition to R environment / package issues, the underlying machine and configuration can add significant variability to results. The lesson for me is that different packages can perform differently depending on how you set them up, the datasets structure and your machine's configurations -- however trying to declare a "winner" is futile unless there the benchmark standards are clear. Thank you everyone for spending time and effort in helping to understand the issue. I do eventually want to do write up on the learning here but I won't have time any time soon. |
btw. this is good read: https://rdatatable.gitlab.io/data.table/articles/datatable-benchmarking.html basics mostly but still important |
Thanks for picking up the task and trying this. As the outcome 'does not pass the smell test' I had a really brief look in which I
Rscript generate_data.R 1e6
data.table
(an outlier in your results),collapse
anddplyr
.My result are closer to what is expected:
We can also plot this thanks to
ggplot2::autoplot()
:I am also not sure if going via
tidypolars
is fair torpolars
. It would be better to do this directly, and maybe add thetidypolars
as an alternate.The (very short) script follows.
The text was updated successfully, but these errors were encountered: