-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark question #50
Comments
Speaking of benchmarks.... It would be interesting to see MonetDBLite included: https://www.monetdb.org/blog/monetdblite-r |
Hi @st-pasha , thanks for the praise, I posted the benchmark code here, your evaluation is much appreciated! The code includes the figures published on the fstpackage.org website. A few remarks on the benchmark:
If you have anything to share on your evaluations later on, please do! |
Hi Mark, Thanks for all the info. I did a talk yesterday and while preparing for it I discovered the command |
Hi @mattdowle , thanks for tip on measuring the real drive speed, that's a great tool by the way. But the RevoDrive is at a server in my company which is stuck with Windows, so it will take some effort getting that measurement. The first thing that comes to mind on the benchmark results is that I am using the in-memory size (from R) as the base size for speed calculations. That size is equal to the uncompressed I will take a closer look at the exact performance measurements and will get back on that, thanks! |
I did some further benchmarking on packages Additionally:
# estimating total speed from column speeds
speed_tot <- 1 / ( ( 1 / speed_col1) + (1 / speedcol2) + ...) so basically taking the inverse of the sum of the inverted speeds for each column type used (the effect of average string length not included here). These results show that when comparing performances between various solutions, the chosen data set is critically important and it would be very nice to have a set of type-specific benchmark data sets which can be used as a baseline! I've posted the script for this uncompressed benchmark here (it's a bit raw, apologies for that :-)) |
Wow - this is awesome! |
Hello again. Do you use Windows on your laptop like the server? The reason I ask is that there has been a problem reported on Windows with |
Relief! Thanks for info. I've borrowed a Windows 8.1 machine and managed to reproduce similar. Investigating. |
Found and fixed the problem. I confirmed it's fixed for me on Windows 8.1. Please try again. |
Hi @mattdowle , multithreaded Then I checked multiple-column csv files and suddenly the threads kicked in. Working on the 1e6 dataset from fwrite(DT[, .(a)], "singlecol.csv")
fwrite(DT[, .(a, b)], "dualcol.csv")
microbenchmark( fread("singlecol.csv"), times = 1)
Unit: milliseconds
expr min lq mean median uq max neval
fread("singlecol.csv") 52.72078 52.72078 52.72078 52.72078 52.72078 52.72078 1
> microbenchmark( fread("dualcol.csv"), times = 1)
Unit: milliseconds
expr min lq mean median uq max neval
fread("dualcol.csv") 18.77348 18.77348 18.77348 18.77348 18.77348 18.77348 1 So, the single column file, while exactly half in length, took 3 times longer to load. Apparently, the threads only kick in with more than 1 column (and my benchmark was using single column files :-)) |
I'm guessing the |
Excellent - thanks for working this out!! Yes there's no reason it shouldn't go MT on a single column, so hopefully a simple logic bug to fix somewhere. It also doesn't do the progress meter when ST - same area to fix. Will do ... |
Thanks, impressive work on the multi-threading! And interesting to see that OpenMP works very well now within the R tool chain. I was thinking TinyThreads++ for a more fine-grained solution for 'fst', but I think OpenMP might do the job just fine for |
@MarcusKlik and @mattdowle, I really like both of your packages. Hopefully you guys will work closely together going forward. |
Hi @wei-wu-nyc , thanks. First, in the interface milestone (#48) I have planned a 'simple' It would be interesting to add a few sample use-cases for working with large data sets down the road. Common tasks such as:
Some of the Kaggle competitions have (open) data which would be very suitable for these use-cases, and they would represent real-life problems, so it would be interesting to explore the use of |
Ok ... single column input now goes multi-threaded and a few other problems fixed too. |
Hi @mattdowle , just a quick benchmark to confirm, all looks fine now! I will test with more observations and larger files soon. Thanks for the quick fix! |
Only using 4 cores here, but the performance is already very impressive. And I expect much more of a boost in performance for |
And with the single/dual column test: nrOfRows = 2e8
DT <- data.table(a = 1:nrOfRows, b = 1:nrOfRows)
fwrite(DT[, .(a)], "singlecol.csv")
fwrite(DT[, .(a, b)], "dualcol.csv") we get: > microbenchmark( fread("singlecol.csv"), times = 1)
Read 200000000 rows x 1 columns from 1.945 GB file in 00:02.681 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Unit: seconds
expr min lq mean median uq max neval
fread("singlecol.csv") 2.847271 2.847271 2.847271 2.847271 2.847271 2.847271 1
> microbenchmark( fread("dualcol.csv"), times = 1)
Read 200000000 rows x 2 columns from 3.705 GB file in 00:06.439 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Unit: seconds
expr min lq mean median uq max neval
fread("dualcol.csv") 7.027425 7.027425 7.027425 7.027425 7.027425 7.027425 1 for a 200 million row data set with integers |
Great, impressive results! |
Hi @mattdowle, @st-pasha and @arunsrinivasan, the next release of |
@MarcusKlik Great -- yes please. Do you have a blog-preview link? |
Hi Marcus,
I've just learned about your package, and its performance on the benchmarks looks absolutely impressive!
However could you please clarify some details about the test environment:
Thanks, and keep up the good work!
The text was updated successfully, but these errors were encountered: