-
Notifications
You must be signed in to change notification settings - Fork 223
About parquet file read and write problem #529
Comments
I found: |
Hey @ives9638, thanks a lot for these! I do not know how to avoid that resize: we need to resize the buffer before decompressing data into it. We could For the compression speed itself, IMO we have no easy answer here: we depend on libraries to do the compression/decompression, which have varying degrees of performance. Or do you think there is an issue in how we are using them that is causing the slowness? For the delta-encoding, yeap, on the todo list. Feel welcome to patch it :P |
Would be good to benchmark to see if the resize overhead is actually coming from the value initialization or memory reallocation. I think using resize might be an overkill here because setting newly allocated memory with a default 0 value doesn't provide any value. We will override those 0s during the subsequent decompression anyway. Perhaps it's better to manually check for the vector length and use |
I checked the APIs of all decompression we have atm, and they all use |
Hey, curious why you closed it: did you find the root cause of the slowness? I amxtrying to sistematize the benchmarks, but I curious how we can improve it further ^_^ |
I tried to use parquet compression and found several problems:
Some data:
schema = DataSchema::new( DataField::new("sec", DataType::String, true) );The size of a single piece of data is about 4KB.
Arrow2:: array has 81920 rows in total.
Using zstd compression, the disk file size is about 10.8mb. Using lz4 compression, the disk file size is about 18MB
Decompression speed of zstd compression: a datapage(8192 rows) decompress time: 959 millis
Hope to get help
The text was updated successfully, but these errors were encountered: