[python] File size too large - maybe stats related #2965

convoi · 2024-10-29T11:13:32Z

Environment

Delta-rs version: deltalake 0.20.2

Binding: python

Environment:

Cloud provider:
OS: Mac OS Sonoma (Apple Silicon)
Other:

Bug

What happened:
I'm storing highly compressible but large strings in a delta table. If I write the data frame using pandas to parquet directly, the resulting parquet file is very small (2kb for a 1MB input string containing just the letter "a").
If I write the same data frame to a delta table, the resulting file is 2.0 MB.

What you expected to happen:
I expect the delta parquet files to have a similar size as the normal parquet files.
I explicitly stated to create the files without large statistics (truncate it to 16 chars) and it seems this is is true on the row group statistics level, as the output of parquet meta suggests.
However, when inspecting the file with a hex editor I still see the uncompressed strings. Is there a column index written, and if so, how do I turn it of?

How to reproduce it:

import deltalake as dt
import pandas as pd
import pyarrow as pa
one_mb = "a" * 1024 * 1024
df = pd.DataFrame({
        "name": ["vin1"] * 1,
        "date": ["2022-01-01"] * 1,
        "large_data": [one_mb] * 1,
    })
dt.write_deltalake("test_delta", df, engine="rust", mode="overwrite",
                     writer_properties=dt.WriterProperties(
                         compression="ZSTD",
                         statistics_truncate_length=16,
                         default_column_properties=ColumnProperties(dictionary_enabled=False, max_statistics_size=1),
                         column_properties={
                             "large_data": ColumnProperties(dictionary_enabled=False,
                                                           max_statistics_size=1,
                                                           bloom_filter_properties=BloomFilterProperties(
                                                               set_bloom_filter_enabled=False)),
                         }
                     ),
                     )

More details:
parquet meta output on the two files is also different:
on the small one:

Row group 0:  count: 1  299.00 B records  start: 4  total(compressed): 299 B total(uncompressed):1.000 MB 
--------------------------------------------------------------------------------
            type      encodings count     avg size   nulls   min / max
name        BINARY    Z _ R     1         82.00 B    0       "vin1" / "vin1"
date        BINARY    Z _ R     1         100.00 B   0       "2022-01-01" / "2022-01-01"
large_data  BINARY    Z _ R     1         117.00 B   0

on the delta table one:

Row group 0:  count: 1  2.000 MB records  start: 4  total(compressed): 2.000 MB total(uncompressed):3.000 MB 
--------------------------------------------------------------------------------
            type      encodings count     avg size   nulls   min / max
name        BINARY    Z RB_     1         56.00 B            "vin1" / "vin1"
date        BINARY    Z RB_     1         74.00 B            "2022-01-01" / "2022-01-01"
large_data  BINARY    Z RB_     1         2.000 MB           "aaaaaaaaaaaaaaaa" / "aaaaaaaaaaaaaaab"

The text was updated successfully, but these errors were encountered:

ion-elgreco · 2024-11-24T13:51:30Z

We are passing all the options correctly through the to ArrowWriter, I suggest you check this with the arrow-rs committers on which option you need to set to get the behavior you want https://github.com/apache/arrow-rs/issues

convoi added the bug Something isn't working label Oct 29, 2024

ion-elgreco added question Further information is requested and removed bug Something isn't working labels Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] File size too large - maybe stats related #2965

[python] File size too large - maybe stats related #2965

convoi commented Oct 29, 2024

ion-elgreco commented Nov 24, 2024

[python] File size too large - maybe stats related #2965

[python] File size too large - maybe stats related #2965

Comments

convoi commented Oct 29, 2024

Environment

Bug

ion-elgreco commented Nov 24, 2024