Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

About parquet file read and write problem #529

Closed
ives9638 opened this issue Oct 14, 2021 · 5 comments
Closed

About parquet file read and write problem #529

ives9638 opened this issue Oct 14, 2021 · 5 comments
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@ives9638
Copy link

I tried to use parquet compression and found several problems:

  • Start compression. The decompression process is very slow.
  • Various compression comparisons. For strings, zstd vs lz4 differ greatly.
  • For encoding:: deltalengthbytearray, when is_ optional=true, Err NotYetImplemented .

Some data:
schema = DataSchema::new( DataField::new("sec", DataType::String, true) );The size of a single piece of data is about 4KB.
Arrow2:: array has 81920 rows in total.

Using zstd compression, the disk file size is about 10.8mb. Using lz4 compression, the disk file size is about 18MB
Decompression speed of zstd compression: a datapage(8192 rows) decompress time: 959 millis

Hope to get help

@ives9638
Copy link
Author

I found:
buffer.resize(compressed_page.uncompressed_size(), 0);
Is the main cause of slowness

@jorgecarleitao
Copy link
Owner

Hey @ives9638, thanks a lot for these!

I do not know how to avoid that resize: we need to resize the buffer before decompressing data into it. We could .clear it so that we avoid the extra memcopy, but if a new page is larger, we will need the allocation anyways, right?

For the compression speed itself, IMO we have no easy answer here: we depend on libraries to do the compression/decompression, which have varying degrees of performance. Or do you think there is an issue in how we are using them that is causing the slowness?

For the delta-encoding, yeap, on the todo list. Feel welcome to patch it :P

@houqp
Copy link
Collaborator

houqp commented Oct 16, 2021

Would be good to benchmark to see if the resize overhead is actually coming from the value initialization or memory reallocation. I think using resize might be an overkill here because setting newly allocated memory with a default 0 value doesn't provide any value. We will override those 0s during the subsequent decompression anyway. Perhaps it's better to manually check for the vector length and use reserve instead?

@jorgecarleitao
Copy link
Owner

I checked the APIs of all decompression we have atm, and they all use read_exact, which require an allocated &[u8]. So, I think we do need the resize.

@jorgecarleitao
Copy link
Owner

Hey, curious why you closed it: did you find the root cause of the slowness? I amxtrying to sistematize the benchmarks, but I curious how we can improve it further ^_^

@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Oct 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants