-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use binary load for Dask and Pandas grouping tests #47
Comments
|
@st-pasha approach is actually what we use for dplyr, to use R data.table for reading data. In python it is a little bit more complicated because each python solution has own virtualenv. Will check binary format first. |
@jangorecki This is actually something that I was wondering about. Is it necessary to have a separate virtualenv for each python library? Are there any that cannot coexist in a single virtualenv? If we could remove the requirement for separate virtualenv for each process, it would greatly simplify the benchmarking process. |
Agree, but dask (and eventually other solutions) requires particular version of pandas. Initially I started to use single virtualenv for all stuff but then I got conflicts in versions. |
@st-pasha any suggestion on binary format to use for python? There doesn't seems to be any single one that will be good: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization |
I'm looking at Dask's setup.py, and they have only minimum required version for all packages, no maximum. So it should work with the most recent version of pandas. In fact, I just tried installing the most recent Dask (0.20.1), and it works fine with the most recent pandas (0.23.4). Do you have a different experience with the bleeding-edge versions of either pandas or dask? If there are any incompatibilities, then it is something worth raising an issue about upstream. Regarding binary format, there is no single-best solution (as evident from the link you posted). For example, if you want to avoid reading huge CSV file with pandas, then one possibility is to say |
@st-pasha OK I found it was not dask but modin: 19cc60c |
@jangorecki Ah, I see. Indeed, they specify an exact version of pandas in their setup.py. Luckily, it is the latest version (0.23.4). Or does it not work if you install the development version of pandas? In any case, we could probably ask them to change the version to |
Previously it required 0.22 when 0.23.2 was out already |
for references scripts used to produce binary formats library(data.table)
library(fst)
files = outer(c("1e7","1e8","1e9"), c("1e2","1e1","2e0"), paste, sep="_")
files = sprintf("G1_%s.csv", c(
paste(files, "0", "0", sep="_"),
paste(files[,1], "0", "1", sep="_")
))
# for python: cat(paste0("files=[",paste(paste0("'",files,"'"), collapse=","), "]"))
for (file in files) {
cat("fread", file, "\n")
print(system.time(df<-fread(file, stringsAsFactors=TRUE, data.table=FALSE, showProgress=FALSE)))
ofile = gsub("csv","fst",file,fixed=TRUE)
cat("write.fst", ofile, "\n")
write.fst(df, ofile)
}
cat("done\n")
if (!interactive()) q("no") import datatable as dt
import pandas as pd
import pickle
import re
files=['G1_1e7_1e2_0_0.csv','G1_1e8_1e2_0_0.csv','G1_1e9_1e2_0_0.csv','G1_1e7_1e1_0_0.csv','G1_1e8_1e1_0_0.csv','G1_1e9_1e1_0_0.csv','G1_1e7_2e0_0_0.csv','G1_1e8_2e0_0_0.csv','G1_1e9_2e0_0_0.csv','G1_1e7_1e2_0_1.csv','G1_1e8_1e2_0_1.csv','G1_1e9_1e2_0_1.csv']
for file in files:
print("fread %s" % file)
x = dt.fread(file).to_pandas()
x['id1'] = x['id1'].astype('category')
x['id2'] = x['id2'].astype('category')
x['id3'] = x['id3'].astype('category')
ofile = re.sub("csv", "pkl", file)
print("write %s" % ofile)
pd.to_pickle(x, ofile)
print("done") library(data.table)
library(feather)
files = outer(c("1e7","1e8"), c("1e2","1e1","2e0"), paste, sep="_") #,"1e9" is on another machine
#files = outer(c("1e9"), c("1e2","1e1","2e0"), paste, sep="_")
files = sprintf("G1_%s.csv", c(
paste(files, "0", "0", sep="_"),
paste(files[,1], "0", "1", sep="_")
))
for (file in files) {
cat("fread", file, "\n")
print(system.time(df<-fread(file, stringsAsFactors=TRUE, data.table=FALSE, showProgress=FALSE)))
ofile = gsub("csv","fea",file,fixed=TRUE)
cat("write_feather", ofile, "\n")
write_feather(df, ofile)
}
cat("done\n") |
Surprisingly there is no direct API for loading feather/arrow from most of the solutions. It is now currently added for data.table, dplyr and pandas. Other tools will follow, status for each can be looked up in Lines 58 to 65 in 9d59592
Closing this issue to not leave it stale blocked by other projects. What could have been accomplished now was done in 2b4d3bd |
For both for dplyr and data.table where |
Frustrating! |
Feather doesn't seems to be good idea... for R it doesn't even work for 1e7 rows data and in python it segfaults on 1e9
reading feather to pandas directly used to work but it got broken due to dependency update pandas-dev/pandas#23053 re-opening this issue, I will close when no further works will be planned. |
For pandas you can use
If virtual environments need to be kept separate, you can always pip-install the latest stable datatable version (0.7.0) into the pandas env. |
@st-pasha yes this is what I will have to do. But the plan was to use some format I can re-use in other tools, without going through pandas. So now loading data for dask will be |
@st-pasha unfortunately datatable will not help much >>> x = dt.open(os.path.join("data", src_grp)).to_pandas() # src_grp="G1_1e9_1e2_0_0.jay"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/datatable/frame.py", line 450, in to_pandas
x = srcdt.window(0, self.nrows, i, i + 1).data[0]
MemoryError So the next option is to try pickle... |
@jangorecki pickle won't help: the column simply doesn't fit into existing memory (at least when it is represented as pyobjects, which is what pandas uses for string columns). |
It appears that binary formats will not help with memory errors of pandas and dask on 50gb input data. Pandas use jay binary format from py datatable. Dask still uses csv, we could use jay-pandas-dask but importing from pandas to dask requires to provide number of partitions, unlike when reading from csv, which is, I believe, data dependent, thus I would prefer to stay away from data investigation but leave dask as is. Closing issue as there are not actions defined any more. Feel free to re-open and provide scenarios to be checked. |
Currently the Dask and Pandas grouping tests are shown as a fail at the 50GB size, where other products work. But this is only because reading the csv file fails, which isn't to do with grouping per se. These grouping tests could instead use pickle or feather to load the dataset. It's not like the time to load the data is included in the test anyway. It would also be faster to run the grouping tests since the time to read from csv would not need to happen first. Reading data from csv is due to be added to db-bench as a separate set of tests where the fail point would be fairly represented there separately.
Similarly, pydatatable could read the test data from its memory map before grouping, and data.table could read from fst. So long as the result of reading from these binary formats was just the same as if read from csv (so no pre-computed data like indexes or similar allowed (**)) then it would be faster for db-bench to run as well as getting a timing for Dask and Pandas which probably do in fact work at this size on this machine.
(**) separate tests to be added in future where pre-computed indexes and similar are allowed.
With this done, #45 could be enabled again.
The text was updated successfully, but these errors were encountered: