-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support "Avro Old List Structure" in Parquet reader #764
Comments
@philrz I think the title should be "old" list structure, no? With that setting set to false, we are not writing the old structure which means we are writing the new structure and that works. But with that setting true, we are writing the old structure and that does not work. Do I have that right? |
@aswan: Indeed, I think you're right. I flipped it around in the title. Stupid double negatives. |
I revisited this one during an old issue scrub just to see if it maybe got magically fixed. As it turns out, it got a little worse: I confirmed via binary search that starting at Zed commit 3f05294 that's associated with the Parquet rewrite in #2227 (cc: @nwt), now when attempting to read the test data with the "old list structure" format, instead of an error message it's now a crash.
|
I've confirmed that with current Zed GA tagged
|
Reading and writing are much faster with it than with github.com/fraugster/parquet-go. Its only apparent drawback is that it offers no easy way to support Zed's duration and float16 types, and writing a value containing either produces a cryptic error. $ echo '{a:1.(float16)}' | zq -f parquet - parquetio: unsupported type: not implemented yet Closes #764, closes #4278, and closes #4527.
Verified in Zed commit deea4a4. The Parquet format that could not be read previously is now readable.
Thanks @nwt! |
I'm kinda parroting back concepts here that I don't fully understand, so please bear with me.
When outputting Parquet sample data with the Nifi ParquetRecordSetWriter, data that had started life as JSON arrays was written in a format that
zq
(as of commitc1360a8
) choked on. These two are one such example:dns.parquet.gz
http.parquet.gz
The error message is:
Looking the
zq
code for where this error came from led me to https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists, which in turn inspired me to fiddle with the knobs in the ParquetRecordSetWriter. Through trial and error, I stumbled onto this setting:That setting normally defaults to True, and the attachments above that
zq
choked on were generated when that was set to True. When I change it to False as is shown here, now the outputs are the attachments shown below, whichzq
reads without complaint.dns.parquet.gz
http.parquet.gz
At some point we should investigate this further and confirm if we need to enhance the variations we support such that we might be able to read the default output from Nifi without problems.
The text was updated successfully, but these errors were encountered: