About parquet file read and write problem #529

ives9638 · 2021-10-14T07:10:40Z

I tried to use parquet compression and found several problems:

Start compression. The decompression process is very slow.
Various compression comparisons. For strings, zstd vs lz4 differ greatly.
For encoding:: deltalengthbytearray, when is_ optional=true， Err NotYetImplemented .

Some data:
schema = DataSchema::new( DataField::new("sec", DataType::String, true) );The size of a single piece of data is about 4KB.
Arrow2:: array has 81920 rows in total.

Using zstd compression, the disk file size is about 10.8mb. Using lz4 compression, the disk file size is about 18MB
Decompression speed of zstd compression: a datapage（8192 rows） decompress time: 959 millis

Hope to get help

ives9638 · 2021-10-14T09:17:50Z

I found:
buffer.resize(compressed_page.uncompressed_size(), 0);
Is the main cause of slowness

jorgecarleitao · 2021-10-14T20:16:07Z

Hey @ives9638, thanks a lot for these!

I do not know how to avoid that resize: we need to resize the buffer before decompressing data into it. We could .clear it so that we avoid the extra memcopy, but if a new page is larger, we will need the allocation anyways, right?

For the compression speed itself, IMO we have no easy answer here: we depend on libraries to do the compression/decompression, which have varying degrees of performance. Or do you think there is an issue in how we are using them that is causing the slowness?

For the delta-encoding, yeap, on the todo list. Feel welcome to patch it :P

houqp · 2021-10-16T06:30:00Z

Would be good to benchmark to see if the resize overhead is actually coming from the value initialization or memory reallocation. I think using resize might be an overkill here because setting newly allocated memory with a default 0 value doesn't provide any value. We will override those 0s during the subsequent decompression anyway. Perhaps it's better to manually check for the vector length and use reserve instead?

jorgecarleitao · 2021-10-16T10:14:49Z

I checked the APIs of all decompression we have atm, and they all use read_exact, which require an allocated &[u8]. So, I think we do need the resize.

jorgecarleitao · 2021-10-18T07:44:14Z

Hey, curious why you closed it: did you find the root cause of the slowness? I amxtrying to sistematize the benchmarks, but I curious how we can improve it further ^_^

jorgecarleitao added the bug Something isn't working label Oct 15, 2021

jorgecarleitao mentioned this issue Oct 15, 2021

Alowed reusing compression buffer jorgecarleitao/parquet2#60

Merged

ives9638 closed this as completed Oct 18, 2021

jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About parquet file read and write problem #529

About parquet file read and write problem #529

ives9638 commented Oct 14, 2021

ives9638 commented Oct 14, 2021

jorgecarleitao commented Oct 14, 2021

houqp commented Oct 16, 2021

jorgecarleitao commented Oct 16, 2021

jorgecarleitao commented Oct 18, 2021

About parquet file read and write problem #529

About parquet file read and write problem #529

Comments

ives9638 commented Oct 14, 2021

ives9638 commented Oct 14, 2021

jorgecarleitao commented Oct 14, 2021

houqp commented Oct 16, 2021

jorgecarleitao commented Oct 16, 2021

jorgecarleitao commented Oct 18, 2021