Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark question #50

Open
st-pasha opened this issue Apr 12, 2017 · 24 comments
Open

Benchmark question #50

st-pasha opened this issue Apr 12, 2017 · 24 comments

Comments

@st-pasha
Copy link

Hi Marcus,

I've just learned about your package, and its performance on the benchmarks looks absolutely impressive!
However could you please clarify some details about the test environment:

  • What was the hard drive on the machine where the benchmarks were ran?
  • Which versions of the software packages (fst, read_feather, readRDS, fread) were tested?
  • What dataset was used? Is it public? Or if it is randomly-generated, could you publish the script to create the same dataset?
  • In general, could you please publish the code you used to run the benchmark, so that we can evaluate it against newest versions of packages or against various datasets?

Thanks, and keep up the good work!

@phillc73
Copy link

Speaking of benchmarks.... It would be interesting to see MonetDBLite included:

https://www.monetdb.org/blog/monetdblite-r
https://github.com/hannesmuehleisen/MonetDBLite

@MarcusKlik
Copy link
Collaborator

Hi @st-pasha , thanks for the praise, I posted the benchmark code here, your evaluation is much appreciated! The code includes the figures published on the fstpackage.org website. A few remarks on the benchmark:

  • I use the microbenchmark package to measure performance but use only a single iteration per benchmark. I found that using more iterations resulted in unrealistically high performance measurements due to caching of the SSD disk used (which is very effective).
  • The published figures refer to the benchmark script run on a Xeon E5 CPU @2.5GHz. It has a lot of cores, but only a single core is used by the fst package currently (that will change however when multi-threading is implemented, see also Currently planned milestones for fst #48 )
  • I will update the results for fread and fwrite in the near future, now that your colleague @mattdowle and @arunsrinivasan have implemented multi-threading in the data.table package :-)
  • New and more extensive benchmarks are planned in the near future. Those benchmarks will include separate performance measurements for each specific type of column. Performance is greatly dependent on the column type (up to an order of magnitude). For example, character vectors in R are implemented in a very computational expensive manner and show poor performance (although there are some ideas to circumvent that).
  • The data-set is randomly generated and the code is included in the gist.
  • The SSD drive used was a OCZ RevoDrive 350
  • In general I found that Xeon processors show very good performance on the blocked format of fst (probably due to effective branch prediction)

If you have anything to share on your evaluations later on, please do!

@MarcusKlik
Copy link
Collaborator

@phillc73 , great tip! I noticed MonetDB being mentioned in a data.table issue on serialization as well. I will make sure it is included in the new benchmarks, thanks!

@MarcusKlik MarcusKlik added this to the Benchmarking Suite milestone Apr 12, 2017
@mattdowle
Copy link

Hi Mark, Thanks for all the info. I did a talk yesterday and while preparing for it I discovered the command hdparm -t /dev/sda to measure the true sustained read speed of the device. Looking up your OCZ RevoDrive 350 the manufacturer stated max read appears to be 1800MBps (1.8GB/s) which fits with it being MLC (somewhere between SSD and NVMe), iiuc. Does hdparm -t /dev/sda agree? However, the speed of read.fst on the first row of the benchmark table is stated as 3271.9 MB/s. That's much higher than that device speed is capable of (1800 MB/s). Could it be that that timing was reading from RAM cache not the device? Or maybe we have our B's and b's mixed up! I can't see use of sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches' or similar in the benchmark code. I can see that it uses one iteration so the intent is there, but if caches haven't been dropped the file is likely in RAM cache from a previous run. Thanks, Matt

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Apr 13, 2017

Hi @mattdowle , thanks for tip on measuring the real drive speed, that's a great tool by the way. But the RevoDrive is at a server in my company which is stuck with Windows, so it will take some effort getting that measurement. The first thing that comes to mind on the benchmark results is that I am using the in-memory size (from R) as the base size for speed calculations. That size is equal to the uncompressed rds file. But the fst files are usually smaller than that because there is some compression being done even with compress = 0 (for example bit-packing for logicals). So the reported speed is the in-memory size divided by the time for a read or write. That can be higher than the maximum drive speed if 'base compression' already brings the file size down. I see for example that the uncompressed fst file is only 66 percent of the uncompressed rds file (that would more or less explain the difference).

I will take a closer look at the exact performance measurements and will get back on that, thanks!

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Apr 14, 2017

I did some further benchmarking on packages data.table, fst, feather and the base RDS methods. As expected, there are large differences in performance for serialization of the various column-types as can be seen in the figure below. In that figure, you can see the tested column types (horizontal) versus the mode (read/write) for all 4 serializers. The performance of the multi-threaded fwrite really stands out for character columns, very impressive! The second figure shows the same data with the individual measurements and different axis (to get some idea of the performance stability). By the way, these benchmarks where not computed on the Xeon / RevoDrive (benchmarks on the fstpackage.org site), but on my very modest laptop (4/8 cores, i7-4710HQ @ 2.5GHz) with a Samsung EVO 850 SSD.

Additionally:

  • The read and write performance of logical columns is very high for the fst package because of an effective bit packing algorithm. The actual file size for logical columns is a factor 16 smaller than that of the rds file (1 logical is packed into 2 bits instead of the 32 bits that R uses). This explains the 'larger than drive speed' performance for logicals.
  • Serialization of character columns is a difficult task for all serializers except for fwrite. Getting strings in and out of R's global string pool takes a lot of CPU power and the multi-threaded fwrite really has a large advantage here (the task is CPU bound).
  • The benchmark has 2470 observations (about 1 hour of computational time). To be sure all disk and RAM caching is excluded, it would be better to generate a unique data set for each observation. That will make the benchmark take more time however.
  • In general, the total measured speed for serializing a data set to disk can be estimated by:
# estimating total speed from column speeds
speed_tot <- 1 / ( ( 1 / speed_col1) + (1 / speedcol2) + ...)

so basically taking the inverse of the sum of the inverted speeds for each column type used (the effect of average string length not included here). These results show that when comparing performances between various solutions, the chosen data set is critically important and it would be very nice to have a set of type-specific benchmark data sets which can be used as a baseline!

I've posted the script for this uncompressed benchmark here (it's a bit raw, apologies for that :-))

image

image

@mattdowle
Copy link

Wow - this is awesome!

@mattdowle
Copy link

Hello again. Do you use Windows on your laptop like the server? The reason I ask is that there has been a problem reported on Windows with fread in dev where either the parallel threads aren't kicking in, or they are but badly and the performance is worse than single threaded. Did it look as though fread was using all the laptop's cores efficiently when it ran?

@MarcusKlik
Copy link
Collaborator

image

Hi, the laptop has Windows 10, and the threads are kicking in during the fwrite, but I can only see a single core working with fread, so indeed they are not kicking in for fread it seems!

@mattdowle
Copy link

Relief! Thanks for info. I've borrowed a Windows 8.1 machine and managed to reproduce similar. Investigating.
Aside: did you compile data.table dev 1.10.5 yourself with Rtools (mingw) or download the AppVeyor .zip ? There was another windows issue (unrelated I think) that needed latest rtools.exe 3.4 (which AppVeyor uses) just to double check.

@mattdowle
Copy link

Found and fixed the problem. I confirmed it's fixed for me on Windows 8.1. Please try again.
data.table 1.10.5 IN DEVELOPMENT built 2017-04-15 11:11:18 UTC; appveyor
install.packages("https://ci.appveyor.com/api/buildjobs/1txx94ruv769jc42/artifacts/data.table_1.10.5.zip", repos=NULL)

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Apr 15, 2017

Hi @mattdowle , multithreaded fread almost works fine now on Windows :-). I still only saw the single thread at work in the benchmark with fread. I tried the appveyor precompiled build and I also build data.table from your latest commit. Then I reinstalled RTools 3.4 and tried again. Also, manually setting nThread = 8 had no effect on CPU load.

Then I checked multiple-column csv files and suddenly the threads kicked in. Working on the 1e6 dataset from ?fread:

fwrite(DT[, .(a)], "singlecol.csv")
fwrite(DT[, .(a, b)], "dualcol.csv")

microbenchmark( fread("singlecol.csv"), times = 1)

Unit: milliseconds
                   expr      min       lq     mean   median       uq      max neval
 fread("singlecol.csv") 52.72078 52.72078 52.72078 52.72078 52.72078 52.72078     1

> microbenchmark( fread("dualcol.csv"), times = 1)
Unit: milliseconds
                 expr      min       lq     mean   median       uq      max neval
 fread("dualcol.csv") 18.77348 18.77348 18.77348 18.77348 18.77348 18.77348     1

So, the single column file, while exactly half in length, took 3 times longer to load. Apparently, the threads only kick in with more than 1 column (and my benchmark was using single column files :-))

@MarcusKlik
Copy link
Collaborator

I'm guessing the data.table benchmark speeds will go up with a factor of 6 for all investigated columns types using a 4/8 core machine. When the single column boundary case is fixed, I'll make sure the benchmark is updated and repeated on a machine with more cores!

@mattdowle
Copy link

Excellent - thanks for working this out!! Yes there's no reason it shouldn't go MT on a single column, so hopefully a simple logic bug to fix somewhere. It also doesn't do the progress meter when ST - same area to fix. Will do ...

@MarcusKlik
Copy link
Collaborator

Thanks, impressive work on the multi-threading! And interesting to see that OpenMP works very well now within the R tool chain. I was thinking TinyThreads++ for a more fine-grained solution for 'fst', but I think OpenMP might do the job just fine for fst as well.

@wei-wu-nyc
Copy link

@MarcusKlik and @mattdowle, I really like both of your packages. Hopefully you guys will work closely together going forward.
I am really looking forward to have very good serialization solutions in R for very big data set. And one of the feature I mentioned to @MarcusKlik is the ability to work on a very large data.table off and in memory on demand depending on the available resources. That would be really helpful.
Thanks.

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Apr 17, 2017

Hi @wei-wu-nyc , thanks. First, in the interface milestone (#48) I have planned a 'simple' data.table interface. From milestone advanced operations on, the plan is to implement a data.table interface to fst files to be able to effectively group, sort, merge, append (columns and rows) and select (all operations on-disk requiring very little memory). These operations can also be performed on a group of fst files without any difference to the interface (fst has a blocked format anyway). For the planned parallel 'merge sort' algorithm, the idea is to sort the individual chunks using data.table's very fast sorting algorithm. So you see, the fst package will depend heavily on data.table (not so much the other way around I'm sure :-)). Oh, and I would like to be able to convert a csv file directly to a fst file without memory overhead, so the code behind the new multi-threaded fread will be a key component for that.

It would be interesting to add a few sample use-cases for working with large data sets down the road. Common tasks such as:

  • I have 50 csv files 10 GB each, how can I calculate some method for each group in the data?
  • How do I sort such a large collection of files?
  • I have a 100 GB (fst) file, how can I calculate some statistics on that?
  • I only need to select a single year from my data, but I do not have enough memory to read the csv files, what to do?
  • I want to do calculations from R on data at the lowest level from my companies large database, but the whole data set doesn't fit into memory, how can I stream to a fst file and perform my calculations from there on?

Some of the Kaggle competitions have (open) data which would be very suitable for these use-cases, and they would represent real-life problems, so it would be interesting to explore the use of fst in solving some of these 'large-data' problems. When fst has a more mature interface, I could set up the wiki to collect some of these tasks (or let users add them).

@mattdowle
Copy link

Ok ... single column input now goes multi-threaded and a few other problems fixed too.
Please try again with latest dev. Fingers crossed!

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Apr 19, 2017

image
(vertical axis is in MB/s taking the object.size() of the column as base size)

Hi @mattdowle , just a quick benchmark to confirm, all looks fine now! I will test with more observations and larger files soon. Thanks for the quick fix!

@MarcusKlik
Copy link
Collaborator

Only using 4 cores here, but the performance is already very impressive. And I expect much more of a boost in performance for data.table when we use an order of magnitude more cores!

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Apr 19, 2017

And with the single/dual column test:

nrOfRows = 2e8
DT <- data.table(a = 1:nrOfRows, b = 1:nrOfRows)

fwrite(DT[, .(a)], "singlecol.csv")
fwrite(DT[, .(a, b)], "dualcol.csv")

we get:

> microbenchmark( fread("singlecol.csv"), times = 1)
Read 200000000 rows x 1 columns from 1.945 GB file in 00:02.681 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Unit: seconds
                   expr      min       lq     mean   median       uq      max neval
 fread("singlecol.csv") 2.847271 2.847271 2.847271 2.847271 2.847271 2.847271     1
> microbenchmark( fread("dualcol.csv"), times = 1)
Read 200000000 rows x 2 columns from 3.705 GB file in 00:06.439 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Unit: seconds
                 expr      min       lq     mean   median       uq      max neval
 fread("dualcol.csv") 7.027425 7.027425 7.027425 7.027425 7.027425 7.027425     1

for a 200 million row data set with integers

@MarcusKlik
Copy link
Collaborator

Great, impressive results!

@MarcusKlik
Copy link
Collaborator

Hi @mattdowle, @st-pasha and @arunsrinivasan,

the next release of fst is due in a few days and I've prepared a blog to post at the same time. In the blog I explore the performance of write_fst/read_fst and compare it against fread/fwrite, saveRDS/readRDS and write_feather/read_feather. I'm also looking at the multi-threaded enhancements (for data.table and fst), compression (fst) and the effects of file size (all). If you guys would like to glance over the results to see if you recognize the results for data.table, please let me know so I can sent you the blog-preview link!

@mattdowle
Copy link

@MarcusKlik Great -- yes please. Do you have a blog-preview link?

@MarcusKlik MarcusKlik removed this from the Multi-threading milestone Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants