-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File-backed data.tables #1336
Comments
Agreed. Probably for v2.0.0.. depending on how much time and motivation we've. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
The links in the original post of @zachmayer are not valid anymore. The GitHub repo of Graphlab/Dato/Turi can be found here. Because Graphlab/Dato/Turi has been acquired by Apple, this repo has been moved to here. It looks like it has evolved into a library for the development of machine learning models. In case above two links stop working, I've created a fork in my own profile. |
One potential implementation strategy is via R's custom allocator mechanism. I constructed a file-backed See this gist, where I create the 2B row dataset (~75GB) from the benchmarks and run some aggregations on my laptop (16GB ram). There's many missing pieces that make this far from a user-friendly solution though. Among them: R's custom allocator is used for the entire array object, so there is an R implementation specific header prepended to the data; can't share even read-only between R sessions due to the former; can't hook data.table allocations for new objects (columns/indices) so they won't be memory-mapped; no support for real string columns; requires manual persistence of column attributes. All those caveats aside, I've already found it to be quite useful when working with a large number of moderate sized datasets, where each is sequentially memory mapped, data.table is told they're already sorted ( |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@jonekeat disk.frame is possibly an alternative but I haven't tried it myself. |
|
As a current-day workaround, what about the use of |
This is currently out of scope https://github.com/Rdatatable/data.table/blob/master/GOVERNANCE.md#the-r-package and I don't think anyone has the time/interest/skill to implement, so I'm closing. |
I don't disagree, it's definitely big-scope. I offered my comment to illustrate alternative paths. |
to clarify I'd be glad to have scope expanded for this high-demand FR, but as noted current maintainer core has no time/ability to support this. outside contributions (and commitment to ownership) welcome. |
SFrames are graphlab create's version of data.frames, and have some impressive performance benchmarks on single machines.
I'd really love to see something similar for data.table that could use disk rather than RAM to store the data.
The text was updated successfully, but these errors were encountered: