-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParquetLoader Exceptions Due to Block Size #118
Comments
I'd recommend improving the ParquetLoader to handle larger data by moving from standard arrays to our |
@justinormont Pestering a bit, would you consider using .NET ArrayPool help with GC and other performance related things? It looks like there might be issues with heap fragmentation on how BigArray allocates that structure. |
@veikkoeeva : As I'm a bit out of my element, my normal fall-back answer is running perf tests and see if GC perf is being impacted. Stepping back a bit, one larger question for the ParquetLoader is can we operate more in a streaming fashion and not store large blocks. The TransposeLoader is accomplishing a similar task as ParquetLoader by loading a column orientated dataset. |
@justinormont The working in chunks was actually what I was thinking also: load data in and process data chunks and release arrays that aren't needed anymore. Depending on how this particular type is used, I would imagine it to have serious fragmentation issues in "production systems", i.e. this code wouldn't be run in isolation. I'm still learning this and try not to hijack the issues for other discussions... But sometimes I just try to be quick with it. This goes with what I was thinking at #109 (comment) too. Have a few sets, column and row oriented and operate on them, preferably making the sources and sinks such that one can pass in and out chunks thereby relieving memory pressure (which is basically a blocker in many production loads). I think the idea is there, but I don't yet see how to pull all this together in my head and I do wonder if there's too many "specific types" instead of more common types building on .NET idioms and then extended a bit in specific ways (I might be wrong about this). :) I'll take a look at TransposeLoader. |
I think the loader isn't keeping block data around. The issue is: when coming up with a sequence of blocks to load, when using shuffling to see the data in a different order (important for some stochastic learners), the problem is the sequence. So the problem isn't that we're loading the file into memory and that's too large, this issue is meant to address a situation in which the total number of blocks or number of rows per block is so large you cannot allocate an array to describe what order you should be serving them in. My question would be: did we actually observe this happening, or is it more of a theoretical concern that it might happen, under wildly improbably conditions? I'm just trying to imagine what an actual failure would look like. In the past I've seen parquet files with block sizes in the hundreds of MBs, with there being in a typical dataset perhaps dozens to hundreds of such blocks. Even if somehow we configured the block size at 1 MB the idea of a parquet file with billions of blocks in it is many orders is a bit hard to swallow... I've seen petabyte sized datasets, but I don't think I've ever seen an attempt to store such a thing in a single Parquet file. Maybe I'm behind the times and usage has changed enormously, I haven't had occasion to use Spark in about a year or so... If it's something we think might happen, then @justinormont 's suggestion of |
@TomFinley I wonder if you're writing to me also. :) It'd be a separate issue to go internals of The reason I wrote the jagged array might be problematic is: The loader doesn't need to keep block data around, it's enough it just creates them. They're potentially large, there might be many loaders. Say, someone writes a tool to tune parameters and handle data in https://github.com/dotnet/orleans and uses this library (potentially many objects instantiated and removed frequently). It would ideal to have a benchmarking for potentially critical pieces, but in general terms I think the problem gets worse if one uses more loaders or if the system otherwise use large chunks of memory or operations that cause pinning (which production systems with Sockets or other native pointers do). One effective way to combat these particular problems is allocating a large slab of memory and then giving out chunks as needed, effectively pooling. Pinning inside of these large (this particular problem will also be mitigated), contiguous areas of memory isn't a problem since the memory doesn't need to be moved, similarily fragmentation doesn't become a problem. Poolers, such as My experience has been -- this is anedoctally and subjective as written, but searching Internet reveals more accurate accounts -- that code allocating large arrays (or jagged arrays) that are even a bit longer lived results into heap fragmentation which has then led to OOMs since there isn't sufficiently large contiguous block to allocate -- or just performance degration. How it might look like then? Some cases for perusal:
A nice case: https://www.infoq.com/presentations/bing-net-performance, some future discussion: https://github.com/dotnet/coreclr/issues/12426. <edit: For those of interested about the new memory handling classes: ... And |
These exceptions should be caught with suggestions to alter block size as a fix.
The text was updated successfully, but these errors were encountered: