Use binary load for Dask and Pandas grouping tests #47

mattdowle · 2018-11-12T22:06:13Z

Currently the Dask and Pandas grouping tests are shown as a fail at the 50GB size, where other products work. But this is only because reading the csv file fails, which isn't to do with grouping per se. These grouping tests could instead use pickle or feather to load the dataset. It's not like the time to load the data is included in the test anyway. It would also be faster to run the grouping tests since the time to read from csv would not need to happen first. Reading data from csv is due to be added to db-bench as a separate set of tests where the fail point would be fairly represented there separately.

Similarly, pydatatable could read the test data from its memory map before grouping, and data.table could read from fst. So long as the result of reading from these binary formats was just the same as if read from csv (so no pre-computed data like indexes or similar allowed (**)) then it would be faster for db-bench to run as well as getting a timing for Dask and Pandas which probably do in fact work at this size on this machine.

(**) separate tests to be added in future where pre-computed indexes and similar are allowed.

With this done, #45 could be enabled again.

st-pasha · 2018-11-12T22:16:22Z

pandas_dataframe = dt.fread(srcfile).to_pandas()

jangorecki · 2018-11-13T14:24:16Z

@st-pasha approach is actually what we use for dplyr, to use R data.table for reading data. In python it is a little bit more complicated because each python solution has own virtualenv. Will check binary format first.

st-pasha · 2018-11-13T18:11:42Z

@jangorecki This is actually something that I was wondering about. Is it necessary to have a separate virtualenv for each python library? Are there any that cannot coexist in a single virtualenv? If we could remove the requirement for separate virtualenv for each process, it would greatly simplify the benchmarking process.

jangorecki · 2018-11-14T05:50:01Z

Agree, but dask (and eventually other solutions) requires particular version of pandas. Initially I started to use single virtualenv for all stuff but then I got conflicts in versions.

jangorecki · 2018-11-14T06:28:44Z

@st-pasha any suggestion on binary format to use for python? There doesn't seems to be any single one that will be good: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization
In R we have built-in RDS and it works great.

st-pasha · 2018-11-14T20:01:23Z

I'm looking at Dask's setup.py, and they have only minimum required version for all packages, no maximum. So it should work with the most recent version of pandas. In fact, I just tried installing the most recent Dask (0.20.1), and it works fine with the most recent pandas (0.23.4). Do you have a different experience with the bleeding-edge versions of either pandas or dask?

If there are any incompatibilities, then it is something worth raising an issue about upstream.

Regarding binary format, there is no single-best solution (as evident from the link you posted). pickle is the standard-library solution, but it works best for python-only objects. datatable uses Jay format to save binary files (frame.save(filename, format="jay")), pandas prefers feather files (or arrow). It all depends on what your goal is.

For example, if you want to avoid reading huge CSV file with pandas, then one possibility is to say datatable.fread(file).to_pandas(). Or, you can read the file beforehand with datatable, then save it into Jay, then open with datatable and convert to pandas. Or, you can read the file with datatable, convert into pandas, save into arrow, and later open the arrow file with pandas.

jangorecki · 2018-11-15T04:02:47Z

@st-pasha OK I found it was not dask but modin: 19cc60c
such incompatibility can arise at any point for any python solutions. We now use also pyspark, and soon hopefully rapids cudf (currently requires conda envs).
In R there is CRAN that prevents such issues.
The goal is to not run out of memory when reading 1e9 files, and ideally read it fast. So feather seems to be good option, will work for dask also, and those two solutions are currently failing to read 1e9 data.

st-pasha · 2018-11-15T21:08:02Z

@jangorecki Ah, I see. Indeed, they specify an exact version of pandas in their setup.py. Luckily, it is the latest version (0.23.4). Or does it not work if you install the development version of pandas?

In any case, we could probably ask them to change the version to pandas>=0.23.4.

jangorecki · 2018-11-16T04:18:32Z

Previously it required 0.22 when 0.23.2 was out already

jangorecki · 2018-11-30T14:37:45Z

for references scripts used to produce binary formats

library(data.table)
library(fst)
files = outer(c("1e7","1e8","1e9"), c("1e2","1e1","2e0"), paste, sep="_")
files = sprintf("G1_%s.csv", c(
  paste(files, "0", "0", sep="_"),
  paste(files[,1], "0", "1", sep="_")
))
# for python: cat(paste0("files=[",paste(paste0("'",files,"'"), collapse=","), "]"))

for (file in files) {
  cat("fread", file, "\n")
  print(system.time(df<-fread(file, stringsAsFactors=TRUE, data.table=FALSE, showProgress=FALSE)))
  ofile = gsub("csv","fst",file,fixed=TRUE)
  cat("write.fst", ofile, "\n")
  write.fst(df, ofile)
}
cat("done\n")
if (!interactive()) q("no")

import datatable as dt
import pandas as pd
import pickle
import re

files=['G1_1e7_1e2_0_0.csv','G1_1e8_1e2_0_0.csv','G1_1e9_1e2_0_0.csv','G1_1e7_1e1_0_0.csv','G1_1e8_1e1_0_0.csv','G1_1e9_1e1_0_0.csv','G1_1e7_2e0_0_0.csv','G1_1e8_2e0_0_0.csv','G1_1e9_2e0_0_0.csv','G1_1e7_1e2_0_1.csv','G1_1e8_1e2_0_1.csv','G1_1e9_1e2_0_1.csv']

for file in files:
    print("fread %s" % file)
    x = dt.fread(file).to_pandas()
    x['id1'] = x['id1'].astype('category')
    x['id2'] = x['id2'].astype('category')
    x['id3'] = x['id3'].astype('category')
    ofile = re.sub("csv", "pkl", file)
    print("write %s" % ofile)
    pd.to_pickle(x, ofile)

print("done")

library(data.table)
library(feather)

files = outer(c("1e7","1e8"), c("1e2","1e1","2e0"), paste, sep="_") #,"1e9" is on another machine
#files = outer(c("1e9"), c("1e2","1e1","2e0"), paste, sep="_")
files = sprintf("G1_%s.csv", c(
  paste(files, "0", "0", sep="_"),
  paste(files[,1], "0", "1", sep="_")
))

for (file in files) {
  cat("fread", file, "\n")
  print(system.time(df<-fread(file, stringsAsFactors=TRUE, data.table=FALSE, showProgress=FALSE)))
  ofile = gsub("csv","fea",file,fixed=TRUE)
  cat("write_feather", ofile, "\n")
  write_feather(df, ofile)
}
cat("done\n")

jangorecki · 2018-12-01T15:45:53Z

Surprisingly there is no direct API for loading feather/arrow from most of the solutions. It is now currently added for data.table, dplyr and pandas. Other tools will follow, status for each can be looked up in

db-benchmark/launcher.R

Lines 58 to 65 in 9d59592

    
           dask = list(format="csv"), # dask/dask#1277 
        
           data.table = list(format="fea"), 
        
           dplyr = list(format="fea"), 
        
           juliadf = list(format="csv"), # JuliaData/Feather.jl#97 
        
           modin = list(format="csv"), # modin-project/modin#278 
        
           pandas = list(format="fea"), 
        
           pydatatable = list(format="csv"), # h2oai/datatable#1461 
        
           spark = list(format="csv") # https://stackoverflow.com/questions/53569580/read-feather-file-into-spark

Closing this issue to not leave it stale blocked by other projects. What could have been accomplished now was done in 2b4d3bd

jangorecki · 2018-12-01T16:27:05Z

For both for dplyr and data.table where feather R package is used I am getting Error: C stack usage 7971012 is too close to the limit. Will try fst package instead 324724b.
Feather still works for pandas in python, not sure yet if for 1e9 too.

mattdowle · 2018-12-03T22:36:21Z

Frustrating!
Aside: I don't see an open issue(s) for reading from binary for the other tools. Ok to close this one, but then another needs to be open for those then?

jangorecki · 2018-12-04T14:42:32Z

Feather doesn't seems to be good idea... for R it doesn't even work for 1e7 rows data and in python it segfaults on 1e9

tmp = feather.read_dataframe('data/G1_1e9_1e2_0_0.fea')
#Segmentation fault (core dumped)

reading feather to pandas directly used to work but it got broken due to dependency update pandas-dev/pandas#23053

re-opening this issue, I will close when no further works will be planned.
for now R will use fst and python is TODO

st-pasha · 2018-12-04T18:24:54Z

For pandas you can use

import datatable as dt
pandas_frame = dt.open("data/G1_1e9_1e2_0_0.jay").to_pandas()

If virtual environments need to be kept separate, you can always pip-install the latest stable datatable version (0.7.0) into the pandas env.

jangorecki · 2018-12-05T05:36:22Z

@st-pasha yes this is what I will have to do. But the plan was to use some format I can re-use in other tools, without going through pandas. So now loading data for dask will be datatable-pandas-dask.

jangorecki · 2018-12-05T06:46:16Z

@st-pasha unfortunately datatable will not help much

>>> x = dt.open(os.path.join("data", src_grp)).to_pandas() # src_grp="G1_1e9_1e2_0_0.jay"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jan/git/db-benchmark/pandas/py-pandas/lib/python3.6/site-packages/datatable/frame.py", line 450, in to_pandas
    x = srcdt.window(0, self.nrows, i, i + 1).data[0]
MemoryError

So the next option is to try pickle...

…om but will speed up 1e8, #47

st-pasha · 2018-12-08T18:34:02Z

@jangorecki pickle won't help: the column simply doesn't fit into existing memory (at least when it is represented as pyobjects, which is what pandas uses for string columns).

jangorecki · 2018-12-22T07:17:06Z

It appears that binary formats will not help with memory errors of pandas and dask on 50gb input data. Pandas use jay binary format from py datatable. Dask still uses csv, we could use jay-pandas-dask but importing from pandas to dask requires to provide number of partitions, unlike when reading from csv, which is, I believe, data dependent, thus I would prefer to stay away from data investigation but leave dask as is. Closing issue as there are not actions defined any more. Feel free to re-open and provide scenarios to be checked.

jangorecki self-assigned this Nov 13, 2018

jangorecki mentioned this issue Nov 15, 2018

extend benchmarks for character fields fstpackage/fst#177

Open

jangorecki closed this as completed Dec 1, 2018

jangorecki reopened this Dec 4, 2018

jangorecki added a commit that referenced this issue Dec 5, 2018

use datatable for loading data for pandas again, will not solve 1e9 o…

ae6909d

…om but will speed up 1e8, #47

jangorecki mentioned this issue Dec 8, 2018

Add total time to summary table, and more. #55

Closed

5 tasks

jangorecki closed this as completed Dec 22, 2018

jangorecki mentioned this issue Oct 9, 2019

try arrow/feather format for speeding up data load #105

Closed

jangorecki added dask pandas labels Oct 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use binary load for Dask and Pandas grouping tests #47

Use binary load for Dask and Pandas grouping tests #47

mattdowle commented Nov 12, 2018 •

edited

Loading

st-pasha commented Nov 12, 2018

jangorecki commented Nov 13, 2018

st-pasha commented Nov 13, 2018

jangorecki commented Nov 14, 2018

jangorecki commented Nov 14, 2018

st-pasha commented Nov 14, 2018

jangorecki commented Nov 15, 2018 •

edited

Loading

st-pasha commented Nov 15, 2018

jangorecki commented Nov 16, 2018 •

edited

Loading

jangorecki commented Nov 30, 2018 •

edited

Loading

jangorecki commented Dec 1, 2018 •

edited

Loading

jangorecki commented Dec 1, 2018

mattdowle commented Dec 3, 2018 •

edited

Loading

jangorecki commented Dec 4, 2018

st-pasha commented Dec 4, 2018

jangorecki commented Dec 5, 2018 •

edited

Loading

jangorecki commented Dec 5, 2018 •

edited

Loading

st-pasha commented Dec 8, 2018

jangorecki commented Dec 22, 2018

Use binary load for Dask and Pandas grouping tests #47

Use binary load for Dask and Pandas grouping tests #47

Comments

mattdowle commented Nov 12, 2018 • edited Loading

st-pasha commented Nov 12, 2018

jangorecki commented Nov 13, 2018

st-pasha commented Nov 13, 2018

jangorecki commented Nov 14, 2018

jangorecki commented Nov 14, 2018

st-pasha commented Nov 14, 2018

jangorecki commented Nov 15, 2018 • edited Loading

st-pasha commented Nov 15, 2018

jangorecki commented Nov 16, 2018 • edited Loading

jangorecki commented Nov 30, 2018 • edited Loading

jangorecki commented Dec 1, 2018 • edited Loading

jangorecki commented Dec 1, 2018

mattdowle commented Dec 3, 2018 • edited Loading

jangorecki commented Dec 4, 2018

st-pasha commented Dec 4, 2018

jangorecki commented Dec 5, 2018 • edited Loading

jangorecki commented Dec 5, 2018 • edited Loading

st-pasha commented Dec 8, 2018

jangorecki commented Dec 22, 2018

mattdowle commented Nov 12, 2018 •

edited

Loading

jangorecki commented Nov 15, 2018 •

edited

Loading

jangorecki commented Nov 16, 2018 •

edited

Loading

jangorecki commented Nov 30, 2018 •

edited

Loading

jangorecki commented Dec 1, 2018 •

edited

Loading

mattdowle commented Dec 3, 2018 •

edited

Loading

jangorecki commented Dec 5, 2018 •

edited

Loading

jangorecki commented Dec 5, 2018 •

edited

Loading