Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "Avro Old List Structure" in Parquet reader #764

Closed
philrz opened this issue May 12, 2020 · 5 comments · Fixed by #4547
Closed

Support "Avro Old List Structure" in Parquet reader #764

philrz opened this issue May 12, 2020 · 5 comments · Fixed by #4547

Comments

@philrz
Copy link
Contributor

philrz commented May 12, 2020

I'm kinda parroting back concepts here that I don't fully understand, so please bear with me.

When outputting Parquet sample data with the Nifi ParquetRecordSetWriter, data that had started life as JSON arrays was written in a format that zq (as of commit c1360a8) choked on. These two are one such example:

dns.parquet.gz
http.parquet.gz

The error message is:

$ zq -t -i parquet dns.parquet 
dns.parquet: LIST element (Answers) should have 1 child

Looking the zq code for where this error came from led me to https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists, which in turn inspired me to fiddle with the knobs in the ParquetRecordSetWriter. Through trial and error, I stumbled onto this setting:

image.png

That setting normally defaults to True, and the attachments above that zq choked on were generated when that was set to True. When I change it to False as is shown here, now the outputs are the attachments shown below, which zq reads without complaint.

dns.parquet.gz
http.parquet.gz

At some point we should investigate this further and confirm if we need to enhance the variations we support such that we might be able to read the default output from Nifi without problems.

@aswan
Copy link

aswan commented May 12, 2020

@philrz I think the title should be "old" list structure, no? With that setting set to false, we are not writing the old structure which means we are writing the new structure and that works. But with that setting true, we are writing the old structure and that does not work. Do I have that right?

@philrz philrz changed the title Support "Avro New List Structure" in Parquet reader Support "Avro Old List Structure" in Parquet reader May 12, 2020
@philrz
Copy link
Contributor Author

philrz commented May 12, 2020

@aswan: Indeed, I think you're right. I flipped it around in the title. Stupid double negatives.

@philrz
Copy link
Contributor Author

philrz commented Sep 16, 2021

I revisited this one during an old issue scrub just to see if it maybe got magically fixed. As it turns out, it got a little worse: I confirmed via binary search that starting at Zed commit 3f05294 that's associated with the Parquet rewrite in #2227 (cc: @nwt), now when attempting to read the test data with the "old list structure" format, instead of an error message it's now a crash.

$ zq -version
Version: v0.29.0-156-g3f05294b

$ zq -f tzng -i parquet dns-old-list-structure.parquet
panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
github.com/brimsec/zq/zio/parquetio.newType(0xc0001381b8, 0xc0000a2000)
	/Users/phil/work/zed/zio/parquetio/type.go:47 +0x30c
github.com/brimsec/zq/zio/parquetio.newRecordType(0xc0001381b8, {0xc0002c6b00, 0x19, 0x0})
	/Users/phil/work/zed/zio/parquetio/type.go:16 +0x9e
github.com/brimsec/zq/zio/parquetio.NewReader({0x1903ec0, 0xc0001381c0}, 0x0)
	/Users/phil/work/zed/zio/parquetio/reader.go:28 +0x8f
github.com/brimsec/zq/zio/detector.lookupReader({0x1903ec0, 0xc0001381c0}, 0xc0001381b8, {0x7ffeefbffa26, 0xc00021f798}, {{0x7ffeefbffa1e, 0x7}, {0x1, 0x80000, 0xa00000}, ...})
	/Users/phil/work/zed/zio/detector/lookup.go:92 +0x6b0
github.com/brimsec/zq/zio/detector.OpenFromNamedReadCloser(0x0, {0x37e1c18, 0xc0001381c0}, {0x7ffeefbffa26, 0x1e}, {{0x7ffeefbffa1e, 0x7}, {0x1, 0x80000, 0xa00000}, ...})
	/Users/phil/work/zed/zio/detector/file.go:46 +0x170
github.com/brimsec/zq/zio/detector.OpenFileWithContext({0x1911270, 0xc00013c008}, 0x203000, {0x7ffeefbffa26, 0x1e}, {{0x7ffeefbffa1e, 0x7}, {0x1, 0x80000, 0xa00000}, ...})
	/Users/phil/work/zed/zio/detector/file.go:33 +0x178
github.com/brimsec/zq/zio/detector.OpenFile(...)
	/Users/phil/work/zed/zio/detector/file.go:21
github.com/brimsec/zq/cli/inputflags.(*Flags).Open(0xc0001ffbc0, 0xc00021fe18, {0xc00013a170, 0x1, 0x1784d46}, 0x1)
	/Users/phil/work/zed/cli/inputflags/flags.go:91 +0x170
main.(*Command).Run(0xc0001ffba0, {0xc00013a170, 0x1, 0x1})
	/Users/phil/work/zed/cmd/zq/zq.go:158 +0x62b
github.com/mccanne/charm.(*instance).run(0xc000163420, {0xc00013a130, 0x1d82f40, 0xc000068738})
	/Users/phil/.go/pkg/mod/github.com/mccanne/charm@v0.0.3-0.20191224190439-b05e1b7b1be3/instance.go:53 +0x19b
github.com/mccanne/charm.(*Spec).ExecRoot(0xc0000001a0, {0xc00013a130, 0x5, 0x5})
	/Users/phil/.go/pkg/mod/github.com/mccanne/charm@v0.0.3-0.20191224190439-b05e1b7b1be3/charm.go:77 +0x68
main.main()
	/Users/phil/work/zed/cmd/zq/main.go:9 +0x5f

@philrz
Copy link
Contributor Author

philrz commented Nov 29, 2022

I've confirmed that with current Zed GA tagged v1.3.0 this issue still remains. However, the new Parquet library that should come with the changes for #4226 may magically fix this, so I've created a dependency as a reminder to check that when it's done.

$ zq -version
Version: v1.3.0

$ zq -i parquet dns-old-list-structure.parquet
panic: runtime error: index out of range [0] with length 0

goroutine 21 [running]:
github.com/brimdata/zed/zio/parquetio.newType(0x17106a0?, 0xc0000c6600?)
	/Users/phil/work/zed/zio/parquetio/type.go:46 +0x2e6
github.com/brimdata/zed/zio/parquetio.newRecordType(0x29fdfb60?, {0xc000262400, 0x19, 0x17fd9d8?})
	/Users/phil/work/zed/zio/parquetio/type.go:15 +0xa5
github.com/brimdata/zed/zio/parquetio.NewReader(0x100b7fd?, {0x29fdfb40?, 0xc0003900e0?})
	/Users/phil/work/zed/zio/parquetio/reader.go:28 +0x79
github.com/brimdata/zed/zio/anyio.lookupReader(0xc0000ca720, {0x29fdfb40?, 0xc0003900e0}, {{0x7ff7bfeff9b6, 0x7}, {0x0, 0x80000, 0x40000000, 0x0}, 0x0})
	/Users/phil/work/zed/zio/anyio/lookup.go:46 +0xaa8
github.com/brimdata/zed/zio/anyio.NewFile(0x0?, {0x29fdfb18?, 0xc0003900e0}, {0x7ff7bfeff9be, 0x1e}, {{0x7ff7bfeff9b6, 0x7}, {0x0, 0x80000, 0x40000000, ...}, ...})
	/Users/phil/work/zed/zio/anyio/file.go:54 +0x19e
github.com/brimdata/zed/zio/anyio.Open.func1()
	/Users/phil/work/zed/zio/anyio/file.go:31 +0x17a
created by github.com/brimdata/zed/zio/anyio.Open
	/Users/phil/work/zed/zio/anyio/file.go:22 +0x28e

nwt added a commit that referenced this issue Apr 25, 2023
Reading and writing are much faster with it than with
github.com/fraugster/parquet-go.  Its only apparent drawback is that it
offers no easy way to support Zed's duration and float16 types, and
writing a value containing either produces a cryptic error.

    $ echo '{a:1.(float16)}' | zq -f parquet -
    parquetio: unsupported type: not implemented yet

Closes #764, closes #4278, and closes #4527.
@nwt nwt closed this as completed in deea4a4 Apr 27, 2023
@nwt nwt closed this as completed in #4547 Apr 27, 2023
@philrz
Copy link
Contributor Author

philrz commented Apr 27, 2023

Verified in Zed commit deea4a4.

The Parquet format that could not be read previously is now readable.

$ zq -version
Version: v1.7.0-50-gdeea4a47

$ zq -i parquet dns-old-list-structure.parquet
{_path:"dns",ts:1.521835106346483e+09,uid:"CVjl7O3baEWuwy9z7h",id_orig_h:"10.47.6.153",id_orig_p:48895,id_resp_h:"10.0.0.100",id_resp_p:53,proto:"udp",trans_id:25071,rtt:0.0008471012115478516,query:"ise.wrccdc.org",qclass:1,qclass_name:"C_INTERNET",qtype:1,qtype_name:"A",rcode:0,rcode_name:"NOERROR",AA:false,TC:false,RD:true,RA:true,Z:0,answers:["ise.wrccdc.cpp.edu","134.71.3.16"],TTLs:[3213.,42813.],rejected:false}
{_path:"dns",ts:1.521835106550864e+09,uid:"CK8Y7O3sMvrW8K3jAj",id_orig_h:"10.47.6.153",id_orig_p:25682,id_resp_h:"10.0.0.100",id_resp_p:53,proto:"udp",trans_id:32379,rtt:0.0010280609130859375,query:"ise.wrccdc.org",qclass:1,qclass_name:"C_INTERNET",qtype:1,qtype_name:"A",rcode:0,rcode_name:"NOERROR",AA:false,TC:false,RD:true,RA:true,Z:0,answers:["ise.wrccdc.cpp.edu","134.71.3.16"],TTLs:[3213.,42813.],rejected:false}
{_path:"dns",ts:1.521835107572815e+09,uid:"CWfcAe1JxOK6ySGqg1",id_orig_h:"10.47.8.155",id_orig_p:65313,id_resp_h:"10.0.0.100",id_resp_p:53,proto:"udp",trans_id:40411,rtt:0.000843048095703125,query:"pagead2.googlesyndication.com",qclass:1,qclass_name:"C_INTERNET",qtype:1,qtype_name:"A",rcode:0,rcode_name:"NOERROR",AA:true,TC:false,RD:true,RA:true,Z:0,answers:["0.0.0.0"],TTLs:[0.],rejected:false}
{_path:"dns",ts:1.521835108206925e+09,uid:"CahmAl1YJk8G8ZRkx8",id_orig_h:"10.47.2.156",id_orig_p:50444,id_resp_h:"10.0.0.100",id_resp_p:53,proto:"udp",trans_id:51310,rtt:0.0006799697875976562,query:"img03.en25.com",qclass:1,qclass_name:"C_INTERNET",qtype:1,qtype_name:"A",rcode:0,rcode_name:"NOERROR",AA:true,TC:false,RD:true,RA:true,Z:0,answers:["0.0.0.0"],TTLs:[0.],rejected:false}
{_path:"dns",ts:1.521835108259995e+09,uid:"CiXa3Vdju3WPYCYea",id_orig_h:"10.47.2.156",id_orig_p:59419,id_resp_h:"10.0.0.100",id_resp_p:53,proto:"udp",trans_id:5540,rtt:0.0011110305786132812,query:"d16u07fvuodaqr.cloudfront.net",qclass:1,qclass_name:"C_INTERNET",qtype:1,qtype_name:"A",rcode:0,rcode_name:"NOERROR",AA:false,TC:false,RD:true,RA:true,Z:0,answers:["52.85.83.143","52.85.83.20","52.85.83.166","52.85.83.104","52.85.83.215","52.85.83.224","52.85.83.117","52.85.83.177"],TTLs:[180.,180.,180.,180.,180.,180.,180.,180.],rejected:false}
{_path:"dns",ts:1.521835108275232e+09,uid:"CJbycPYKdHFUatqT8",id_orig_h:"10.47.2.156",id_orig_p:53533,id_resp_h:"10.0.0.100",id_resp_p:53,proto:"udp",trans_id:53336,rtt:0.022866010665893555,query:"e5211.dscb.akamaiedge.net",qclass:1,qclass_name:"C_INTERNET",qtype:28,qtype_name:"AAAA",rcode:0,rcode_name:"NOERROR",AA:false,TC:false,RD:true,RA:true,Z:0,answers:["2600:1406:3:383::145b","2600:1406:3:380::145b"],TTLs:[20.,20.],rejected:false}
{_path:"dns",ts:1.521835108472163e+09,uid:"CHEGm52ZEuDizHUbla",id_orig_h:"10.47.2.156",id_orig_p:56577,id_resp_h:"10.0.0.100",id_resp_p:53,proto:"udp",trans_id:5534,rtt:0.0009570121765136719,query:"hm.baidu.com",qclass:1,qclass_name:"C_INTERNET",qtype:1,qtype_name:"A",rcode:0,rcode_name:"NOERROR",AA:true,TC:false,RD:true,RA:true,Z:0,answers:["0.0.0.0"],TTLs:[0.],rejected:false}
{_path:"dns",ts:1.521835108492593e+09,uid:"C8EaBf3Vmkqgvu9JJj",id_orig_h:"10.47.2.156",id_orig_p:64234,id_resp_h:"10.0.0.100",id_resp_p:53,proto:"udp",trans_id:53213,rtt:0.0006020069122314453,query:"img03.en25.com",qclass:1,qclass_name:"C_INTERNET",qtype:1,qtype_name:"A",rcode:0,rcode_name:"NOERROR",AA:true,TC:false,RD:true,RA:true,Z:0,answers:["0.0.0.0"],TTLs:[0.],rejected:false}
{_path:"dns",ts:1.521835109548778e+09,uid:"CCIFlX3wAQaekXmWs2",id_orig_h:"10.47.3.156",id_orig_p:26388,id_resp_h:"10.0.0.100",id_resp_p:53,proto:"udp",trans_id:20882,rtt:0.0011548995971679688,query:"pkg.cdn.trueos.org",qclass:1,qclass_name:"C_INTERNET",qtype:28,qtype_name:"AAAA",rcode:0,rcode_name:"NOERROR",AA:false,TC:false,RD:true,RA:true,Z:0,answers:["pkg.pcbsd.scaleengine.net","pcbsd-pkg.secdn.net","2001:470:1:63a::3:81"],TTLs:[77305.,173.,173.],rejected:false}
{_path:"dns",ts:1.521835109553831e+09,uid:"CiBlLh3i6wWc34BaRa",id_orig_h:"10.47.6.153",id_orig_p:37584,id_resp_h:"10.0.0.100",id_resp_p:53,proto:"udp",trans_id:50067,rtt:0.0007789134979248047,query:"safebrowsing.google.com",qclass:1,qclass_name:"C_INTERNET",qtype:1,qtype_name:"A",rcode:0,rcode_name:"NOERROR",AA:false,TC:false,RD:true,RA:true,Z:0,answers:["sb.l.google.com","172.217.11.78"],TTLs:[21558.,258.],rejected:false}

Thanks @nwt!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants