-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns #5770
Comments
I have hopes that I will find time to work on this next week but am not sure If anyone knows of prior art in this area (measuring the actual metadata size) it would be great if they linked it |
One idea would be to create a dataset with Measure the size of the metadata with
|
10K columns by 10 row groups by 1M rows is 100B values (400GB with int32). I don't think anyone has data like that (this is presumptuous, I am probably wrong). My experience has been either:
|
Also, I did do some experimentation here (what feels like a long time ago now but was at least a year ago). Two things I observed:
I realize this second point is kind of what you are trying to show. I'd be interested in measurements of the first point, at least enough to prove I was missing something (or this problem is unique to parquet-c++) |
I don't have specific data by now. At my use case, usally row-group would be limit by size rather than row-count, and we would disable statistics for most of these column. And only write page index without column index in this case |
Yes, this is what I would expect best practice to be for large schemas (especially if the actual data is sparse as @westonpace says). In general I would expect the value for query engines of min/max statistics in parquet metadata to be quite low for columns where the data isn't sorted or clustered in some way. Thus, if the default parquet writer settings always write this metadata for all columns maybe we need to adjust the default and docuemntation or something |
It may also be worth pointing out that recently support was added to truncate column statistics, page indexes have always supported this, and spark was actually already doing this for columns statistics, with dubious correctness properties. The net result though, is that even is you do write column statistics for all columns, they need not be very large |
I've also filed #5775 to track something I suggested on the mailing list, and would likely drastically reduce the overheads from metadata parsing. |
So I created a toy benchmark (https://github.com/tustvold/arrow-rs/tree/thrift-bench) of a 10,000 column parquet file, with f32 columns, no statistics and 10 row groups. Default Config The file metadata for this comes to a chunky 11MB, and it takes the #5777 thrift decoder on 45ms to parse this thrift payload. If we drop the column count to 1000, the metadata drops to 1MB and parses in ~3.7ms. If we also drop the row group count to 1, the metadata drops to 100KB and parses in 460 us. So at a very rough level the cost is 10 bytes and ~450ns per column chunk Drop ColumnMetadata Dropping the ColumnMetadata from ColumnChunk drops the size down to 3.6MB and the parsing speed down to 23ms Drop ColumnChunk Dropping the columns drops the size down to 109KB and the parsing speed down to 718 us. At this point we have effectively dropped everything apart from the schema. Arrow Schema IPC Now for comparison, I encoded a similar schema using arrow IPC to a flatbuffer. This came to a still pretty chunky 500KB. Validating the offsets in this flatbuffer takes ~1ms, which is longer than decoding the equivalent data from thrift. Thoughts
Phrasing the above differently, assuming single page ColumnChunks of 1MB, we can parse the metadata for a column chunk in 450ns. This means for a very large 10GB file we can still parse the metadata within single-digit milliseconds. |
Nice work. I do think #5770 (comment) is probably the key, that statistics would make things considerably worse, adjusting defaults across implementations probably make sense. I'm not sure what Java is set at but I think C++ will always write all statistics. |
Hi, we use parquet files with 100 row groups and 50k columns (and this is after the dataset has been split into many individual parquet files). What is worse, our use case is reading individual row groups and only subset of columns. To deal with the problem we've implemented a tool that stores an index information in a separate file which allows for reading only a necessary subset of metadata. (https://github.com/G-Research/PalletJack).. |
How big are these files? At 1MB per page (the recommended size) this would come to 5TB?! For extremely sparse data, such as feature stores, I can't help wondering if MapArray or similar would be a better way to encode this. You would lose the ability to do projection pushdown in the traditional sense, but maybe this is ok? |
I think file size depends on the number of rows per row group. With 100 row groups, 50k columns and single row per row group the file has 700MB (no compression, default page size, no stats). |
Err... Is this a reasonable size for a single parquet file? I'm more accustomed to seeing parquet files on the order of 100MB to single digit GB, with a separate catalog combining multiple files together for query |
1TB does seem to be quite large but using smaller files requires reading more metadata (with their associated IO). In out system we write files in the range of 50-300MB and reading metadata was so expensive on large queries (we measured it as taking ~30% of total query processing time in some cases) that we built an entire separate system similar to https://github.com/G-Research/PalletJack to deal with it. |
My reading of https://github.com/G-Research/PalletJack is it is filling a similar role to a catalog like Hive, Deltalake or iceberg. It makes sense, at least to me, that applications would want to build additional metadata structures over the top of collections of parquet files, that are then optimised for their particular read/write workloads? |
I'm not sure that's right. A catalog can do many things and certainly some of those things don't belong in the storage format (eg grouping individual parquet files together into to some logical group relevant to the query or something), but if the catalog is just duplicating directly data from the parquet footer because reading the footer is too expensive that seems like something that should be addressed in the storage format itself. |
But data locality is extremely important? If you have to scan a load of files only to ascertain they're not of interest, that will be wasteful regardless of how optimal the storage format is? Most catalogs collocate file statistics from across multiple files so that the number of files can be quickly and cheaply whittled down. Only then does it consult those files that haven't been eliminated and perform more granular push down to the row group and page level using the statistics embedded in those files. Or at least that's the theory... |
Right, but there are two different levels here. You can store statistics that allow you to prune based only on metadata, but what's left you just have to go and scan. And the scanning requires reading the column chunk metadata. So if after the catalog tells you that you need to scan 100k files with 10k columns each and you are only projecting 5 columns (which pretty well describes what we do for every query), then reading the entire parquet footer for each file to get the 5 columns chunks is going to really add up. How much of that cost is related to thrift decoding specifically? I think not much, and the real issue is just IO. But if you could read a file-specific metadata header that allows you to prune IO by reading only the column chunk metadata you actually need (whether it is encoded using thrift or flatbuffers or protobuf or XYZ) then I think that could definitely help a lot. Then again, when dealing with object storage doing a bunch of random access reads can be counter-productive so ¯_(ツ)_/¯ |
This 1000%, even with the new SSD backed stores latencies are still on the order of milliseconds
Yeah at that point your only option really is to alter the write path to distribute the data in such a way that queries can eliminate files, no metadata chicanery is going to save you if you have to open 100K files 😅 |
Nimble (the new Meta storage format) has an interesting approach to this. From what I understand they store most of the data that would go in the footer in parquet inline in the column chunk itself. So the footer just needs to describe how the actual column chunks are laid out (along with the schema) and since you need to fetch the column chunk anyway you can avoid reading all the column-chunk specific stuff for columns you don't care about. But they also don't support predicate pushdown and so once you add that in you end with more IOPS to prune data pages (I think) |
This sounds like "drop column chunk metadata" in this section? #5770 (comment) This can be enabled also in standard parquet then? |
I don't think so because you need the column chunk metadata to actually read the column. |
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863-L867 |
Yeah, you're right and I think I just misunderstood the original comment (it gets confusing between the rust types in parquet and the thrift structs defined by the parquet thrift IDL :)). But I guess you could just have the column chunk as the bare minimum of just an offset and path and then just do decoding purely based on the data page headers. |
I created a thread on the parquet mailing list to get some other perspectives on this topic, as I think it is fundamental to the expectations of parquet's metadata - https://lists.apache.org/thread/lv5gtfsbqhs7nrxclrq97nt50msxdst2 |
If the size of entire dataset is tens of TBs then you need to use larger files or you end up having a lot of files (many thousands) which is hard to manage. Also, thinking about te future of ML, it is clear to me that a requirement for even larger datasets is not unrealistic. |
PalletJack is on a lower level than catalogs like Hive, Deltalake or Iceberg. It is designed for use with a individual parquet files for a specific use case of "To be able to decode/parse only a minimum amount of parquet metadata to be able to read the requested sample of data from a parquet file". |
Catalogs are designed to easily scale to PBs across potentially millions of files, if scalability is the goal an approach tied to single files will inevitably hit a scaling wall
This what catalogs are designed to provide, but can do so across file boundaries, significantly improving performance and scalability |
@XiangpengHao has kindly agreed to do some more analysis and a writeup |
I'm happy to experiment with this, please assign me! |
@XiangpengHao and @tustvold and @wiedld spoke a little about this, and here was my understanding of the plan: We will run experiments (perhaps based on @tustvold 's tool from #5770 (comment)) Specifically create these scenario:
The for each scenario, measure:
This should result in a table like for each of "metadata size in bytes" and "decode perfor
Other potential experiments to run:
|
Is anyone willing to review a draft blog post on this subject? Perhaps @jhorstmann ? If so please email me at alamb@influxdata.com |
Blog is published: https://www.influxdata.com/blog/how-good-parquet-wide-tables/ 🎉 |
|
|
|
I'll point out that we've had good results by thinking about what sorts of queries we want to run on our data and only storing stats for those columns. It's a trivial optimization, similar to not creating an index for every column in a OLTP database, but I don't see it discussed / a lot of high level writers don't give you that choice. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are several proposals for remedying perceived issues with parquet which generally propose new formats. For example Lance V2 and Nimble
One of the technical challenges raised about Parquet is that the metadata is encoded such that the entire footer must be read and decoded prior to reading any data.
As the numer of columns increases, the argument goes, the size of the parquet metadata increases beyond the ~8MB sweet spot for a single object store request as well as requiring substantial CPU to decode
However, my theory is that the reason that parquet metadata is typically so large for schemas with many columns is the embedded min/max statistical values for columns / pages
Describe the solution you'd like
I would like to gather data on parquet footer metadata size as a function of:
And then report this in a blog with some sort of conclusion about how well parquet can handle large schemas
Bonus points if we can also measure in memory size (though this will of course vary from implementation to implementation)
Describe alternatives you've considered
Additional context
Related discussion with @wesm on twitter: https://twitter.com/wesmckinn/status/1790884370603024826
Cited issue apache/arrow#39676
The text was updated successfully, but these errors were encountered: