memory-mapped DataFrame #25

HarlanH · 2012-07-16T11:12:17Z

There are several options here, from indexing CSV rows to allow random access to giant CSV files, to creating a new binary file format for storing DataVecs that supports mmap-style direct memory mapping.

ViralBShah · 2013-06-14T12:16:32Z

This would be really cool and useful. Cc: @tanmaykm

simonster · 2013-06-17T21:16:18Z

I've been thinking about trying to create a binary data format that could store any Julia data structure but mmaps Arrays/BitArrays. Doing this right could take some effort, but it would give you mmapped DataFrames for free.

johnmyleswhite · 2013-06-17T21:26:12Z

That would be great.

StefanKarpinski · 2013-06-17T23:36:07Z

Sounds cool. Any thoughts to share? Possibly related ideas: I've sometimes considered starting a language-agnostic Sane Formats Project™ for reasonable data formats. HDF5 seems like it started out with a reasonable core and then sprouted some insanity when some committee got its grubby mitts on the standard. Likewise, neither CSV nor TSV are actually standardized but are both very handy because of their simplicity. Another thought that @JeffBezanson and I were discussing the last week is the idea of making in-memory data-structures position-independent so that you can serialize them trivially. Might be relevant.

simonster · 2013-06-18T01:56:22Z

I think I'm going to start by trying to mmap contiguous datasets in HDF5/JLD files, which doesn't look too hard and would avoid the standards problem. I'm not sure it's possible to create a standard that's as flexible, expressive, and language-agnostic as HDF5 without importing most of its complexity. It's only worth creating a new format if performance can be appreciably improved or complexity appreciably reduced. The HDF5 API is kind of insane, but @timholy has done a great job turning it into something more Julian. I'm also interested in how http://symas.com/mdb/ approaches mmapped storage, although the goals of a database are very different from those of a file format.

Position-independent in-memory data structures seem very useful for parallel usage, but I'm not sure of the wisdom of using them for a data format. I don't see how it would be possible to mmap data structures that aren't bits types directly from disk. If they're mmapped read-only, an attempt to write would crash Julia, and if they're mmapped read/write, it would be possible to accidentally introduce pointers to data structures that aren't stored on disk. Reading into memory would require fewer cycles, but the CPU shouldn't be the bottleneck anyway. Changes in Julia's in-memory data format could also break forward/backward compatibility; as it stands this is one of the primary reasons to use JLD instead of serialize()/deserialize().

johnmyleswhite · 2013-06-20T17:52:00Z

Given that Base's readcsv is now using mmap, I have a much clearer sense of how to do this now. I may get to work on it soon, although I don't think it's the most pressing issue.

ViralBShah · 2013-06-21T09:25:57Z

Agree - this is not a pressing issue for the moment. Also, streaming would perhaps be a better way to handle this, since you need to do much more later with the large data frame than just read it. Even in the readcsv in Base, you cannot really read large files and work with them, because the data will not fit in memory. However, the mmap loads the data faster due to zero-copy, and that could certainly be useful here too. @tanmaykm should correct me if I am wrong.

johnmyleswhite · 2013-06-21T13:00:51Z

Next week I'm going to do a major pass through our streaming architecture. For data sets that are purely numeric, I think we can basically do exactly one memory allocation step at the start and then reuse memory very efficiently. If we can fit GLM's to 10-100 GB data sets in Julia with a nice API, I think we'll have produced the killer Julia statistics app.

timholy · 2013-06-21T13:24:36Z

@johnmyleswhite and @tanmaykm, your dramatic performance improvements to I/O are quite amazing!

timholy · 2013-06-21T13:29:53Z

I should also add that HDF5 lets you read and write chunks of huge arrays very easily (just using array subregion syntax, like with mmap). Great to see similar capabilities being implemented for other file formats where it's not done "for us" by an outside library.

johnmyleswhite · 2013-06-21T13:35:25Z

Thanks, @timholy!

We should try to extract some of the techniques we're using to provide a more general framework for doing this kind of array extraction.

ViralBShah · 2013-06-21T13:43:31Z

I think it would be useful to have a new issue for the DataStream overhaul that you mention.

Given that we have hooked up HDFS (not HDF5!) to julia, it should be pretty straightforward to add DataFrames-like capabilities to datasets that are stored in large distributed filesystems. For example, HDFS provides an API to consume large files in chunks over a distributed cluster. It would be nice to expose this through a DataStream like API to the user, except that this is really a DistributedDataStream, but many of the operations one will want to do will be similar.

johnmyleswhite · 2013-06-21T19:12:58Z

Viral, let's set up some time next week to strategize how we'll implement a simple type of DistributedDataStream.

skanskan · 2016-07-22T23:05:10Z

I agree, we need a platform able to work transparently with data of any size, at least bigger than memory, and ideally distributed on several computers.

Solutions such as Spark are not complete, they only offer the basic functionalities to build something else.

We don't just need to be able to get some summaries, as we do with databases, we need to be able to do all operations with big data, operations such as multiplying to big matrixes, fitting a mixed-effect models, MCMC, etc.

Bigmemory or ff allows you to do some simple things but they cannot be used by other packages like lme4.

quinnj · 2016-07-24T04:24:19Z

I think this can be closed in light of the new Feather.jl package, which is essentially a a cross-language binary storage format for dataframes (R, pandas, etc.). It's fast and efficient because the files are mmapped directly and the each column is unsafe_wrap(Array) into NullableArrays=>DataFrame. Obviously, NullableArray support is not quite 100% with DataFrames, but should soon be.

ViralBShah · 2016-07-24T04:37:43Z

Closing. Can be reopened in the context of Feather.jl.

This was causing failures on Julia 0.6, since -F was removed.

tshort mentioned this issue Oct 4, 2012

HDF5 support #64

Closed

simonster mentioned this issue Jun 24, 2013

RFC: Add option to mmap arrays from JLD files JuliaIO/HDF5.jl#22

Merged

ViralBShah closed this as completed Jul 24, 2016

nalimilan pushed a commit that referenced this issue May 26, 2022

Remove -F flag in Julia invocation on AppVeyor (#25)

e2b9b67

This was causing failures on Julia 0.6, since -F was removed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory-mapped DataFrame #25

memory-mapped DataFrame #25

HarlanH commented Jul 16, 2012

ViralBShah commented Jun 14, 2013

simonster commented Jun 17, 2013

johnmyleswhite commented Jun 17, 2013

StefanKarpinski commented Jun 17, 2013

simonster commented Jun 18, 2013

johnmyleswhite commented Jun 20, 2013

ViralBShah commented Jun 21, 2013

johnmyleswhite commented Jun 21, 2013

timholy commented Jun 21, 2013

timholy commented Jun 21, 2013

johnmyleswhite commented Jun 21, 2013

ViralBShah commented Jun 21, 2013

johnmyleswhite commented Jun 21, 2013

skanskan commented Jul 22, 2016

quinnj commented Jul 24, 2016

ViralBShah commented Jul 24, 2016

memory-mapped DataFrame #25

memory-mapped DataFrame #25

Comments

HarlanH commented Jul 16, 2012

ViralBShah commented Jun 14, 2013

simonster commented Jun 17, 2013

johnmyleswhite commented Jun 17, 2013

StefanKarpinski commented Jun 17, 2013

simonster commented Jun 18, 2013

johnmyleswhite commented Jun 20, 2013

ViralBShah commented Jun 21, 2013

johnmyleswhite commented Jun 21, 2013

timholy commented Jun 21, 2013

timholy commented Jun 21, 2013

johnmyleswhite commented Jun 21, 2013

ViralBShah commented Jun 21, 2013

johnmyleswhite commented Jun 21, 2013

skanskan commented Jul 22, 2016

quinnj commented Jul 24, 2016

ViralBShah commented Jul 24, 2016