Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory-mapped DataFrame #25

Closed
HarlanH opened this issue Jul 16, 2012 · 16 comments
Closed

memory-mapped DataFrame #25

HarlanH opened this issue Jul 16, 2012 · 16 comments
Labels

Comments

@HarlanH
Copy link
Contributor

HarlanH commented Jul 16, 2012

There are several options here, from indexing CSV rows to allow random access to giant CSV files, to creating a new binary file format for storing DataVecs that supports mmap-style direct memory mapping.

@tshort tshort mentioned this issue Oct 4, 2012
@ViralBShah
Copy link
Contributor

This would be really cool and useful. Cc: @tanmaykm

@simonster
Copy link
Contributor

I've been thinking about trying to create a binary data format that could store any Julia data structure but mmaps Arrays/BitArrays. Doing this right could take some effort, but it would give you mmapped DataFrames for free.

@johnmyleswhite
Copy link
Contributor

That would be great.

@StefanKarpinski
Copy link
Member

Sounds cool. Any thoughts to share? Possibly related ideas: I've sometimes considered starting a language-agnostic Sane Formats Project™ for reasonable data formats. HDF5 seems like it started out with a reasonable core and then sprouted some insanity when some committee got its grubby mitts on the standard. Likewise, neither CSV nor TSV are actually standardized but are both very handy because of their simplicity. Another thought that @JeffBezanson and I were discussing the last week is the idea of making in-memory data-structures position-independent so that you can serialize them trivially. Might be relevant.

@simonster
Copy link
Contributor

I think I'm going to start by trying to mmap contiguous datasets in HDF5/JLD files, which doesn't look too hard and would avoid the standards problem. I'm not sure it's possible to create a standard that's as flexible, expressive, and language-agnostic as HDF5 without importing most of its complexity. It's only worth creating a new format if performance can be appreciably improved or complexity appreciably reduced. The HDF5 API is kind of insane, but @timholy has done a great job turning it into something more Julian. I'm also interested in how http://symas.com/mdb/ approaches mmapped storage, although the goals of a database are very different from those of a file format.

Position-independent in-memory data structures seem very useful for parallel usage, but I'm not sure of the wisdom of using them for a data format. I don't see how it would be possible to mmap data structures that aren't bits types directly from disk. If they're mmapped read-only, an attempt to write would crash Julia, and if they're mmapped read/write, it would be possible to accidentally introduce pointers to data structures that aren't stored on disk. Reading into memory would require fewer cycles, but the CPU shouldn't be the bottleneck anyway. Changes in Julia's in-memory data format could also break forward/backward compatibility; as it stands this is one of the primary reasons to use JLD instead of serialize()/deserialize().

@johnmyleswhite
Copy link
Contributor

Given that Base's readcsv is now using mmap, I have a much clearer sense of how to do this now. I may get to work on it soon, although I don't think it's the most pressing issue.

@ViralBShah
Copy link
Contributor

Agree - this is not a pressing issue for the moment. Also, streaming would perhaps be a better way to handle this, since you need to do much more later with the large data frame than just read it. Even in the readcsv in Base, you cannot really read large files and work with them, because the data will not fit in memory. However, the mmap loads the data faster due to zero-copy, and that could certainly be useful here too. @tanmaykm should correct me if I am wrong.

@johnmyleswhite
Copy link
Contributor

Next week I'm going to do a major pass through our streaming architecture. For data sets that are purely numeric, I think we can basically do exactly one memory allocation step at the start and then reuse memory very efficiently. If we can fit GLM's to 10-100 GB data sets in Julia with a nice API, I think we'll have produced the killer Julia statistics app.

@timholy
Copy link
Contributor

timholy commented Jun 21, 2013

@johnmyleswhite and @tanmaykm, your dramatic performance improvements to I/O are quite amazing!

@timholy
Copy link
Contributor

timholy commented Jun 21, 2013

I should also add that HDF5 lets you read and write chunks of huge arrays very easily (just using array subregion syntax, like with mmap). Great to see similar capabilities being implemented for other file formats where it's not done "for us" by an outside library.

@johnmyleswhite
Copy link
Contributor

Thanks, @timholy!

We should try to extract some of the techniques we're using to provide a more general framework for doing this kind of array extraction.

@ViralBShah
Copy link
Contributor

I think it would be useful to have a new issue for the DataStream overhaul that you mention.

Given that we have hooked up HDFS (not HDF5!) to julia, it should be pretty straightforward to add DataFrames-like capabilities to datasets that are stored in large distributed filesystems. For example, HDFS provides an API to consume large files in chunks over a distributed cluster. It would be nice to expose this through a DataStream like API to the user, except that this is really a DistributedDataStream, but many of the operations one will want to do will be similar.

@johnmyleswhite
Copy link
Contributor

Viral, let's set up some time next week to strategize how we'll implement a simple type of DistributedDataStream.

@skanskan
Copy link

I agree, we need a platform able to work transparently with data of any size, at least bigger than memory, and ideally distributed on several computers.

Solutions such as Spark are not complete, they only offer the basic functionalities to build something else.

We don't just need to be able to get some summaries, as we do with databases, we need to be able to do all operations with big data, operations such as multiplying to big matrixes, fitting a mixed-effect models, MCMC, etc.

Bigmemory or ff allows you to do some simple things but they cannot be used by other packages like lme4.

@quinnj
Copy link
Member

quinnj commented Jul 24, 2016

I think this can be closed in light of the new Feather.jl package, which is essentially a a cross-language binary storage format for dataframes (R, pandas, etc.). It's fast and efficient because the files are mmapped directly and the each column is unsafe_wrap(Array) into NullableArrays=>DataFrame. Obviously, NullableArray support is not quite 100% with DataFrames, but should soon be.

@ViralBShah
Copy link
Contributor

Closing. Can be reopened in the context of Feather.jl.

nalimilan pushed a commit that referenced this issue May 26, 2022
This was causing failures on Julia 0.6, since -F was removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants