Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: use epoch instead of ce for date stats (#1672)
# Description date32 statistics logic was subjectively wrong. It was using `from_num_days_from_ce_opt` which > Makes a new NaiveDate from a day's number in the proleptic Gregorian calendar, with January 1, 1 being day 1. while date32 is commonly represented as days since UNIX epoch (1970-01-01) # Related Issue(s) closes #1670 # Documentation It doesn't seem like parquet actually has a spec for what a `date` should be, but many other tools seem to use the epoch logic. duckdb, and polars seem to use epoch instead of gregorian. Also arrow spec states that date32 should be epoch. for example, if i write using polars ```py import polars as pl # %% df = pl.DataFrame( { "a": [ 10561, 9200, 9201, 9202, 9203, 9204, 9205, 9206, 9207, 9208, 9199, ] } ) # %% df.select(pl.col("a").cast(pl.Date)).write_delta("./db/polars/") ``` the stats are correctly interpreted ``` {"add":{"path":"0-7b8f11ab-a259-4673-be06-9deedeec34ff-0.parquet","size":557,"partitionValues":{},"modificationTime":1695779554372,"dataChange":true,"stats":"{\"numRecords\": 11, \"minValues\": {\"a\": \"1995-03-10\"}, \"maxValues\": {\"a\": \"1998-12-01\"}, \"nullCount\": {\"a\": 0}}"}} ```
- Loading branch information