-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint stats maxValues is incorrect #2571
Comments
Please add a reproducible example :) |
It seems like it requires a combination of compact + checkpoint that breaks it. I tested this against a newer version of delta-rs (0.17.3), and it seems to be fixed. Do you know which PR fixed this (I'm unable to do an upgrade due to some breaking changes I haven't handled yet, but I could do a local version release including the fix, for now)? |
I am not entirely sure, throughout the releases there have been multiple occasions Ive touched timestamp types, I suggest you just pip install each release and check when it fails, if you can tell me which release it got fixed, I might be able to tell faster |
Going through the versions, it seems the fix was implicit with the addition of Adding
I still see the bad output from the second print:
|
Hmm maybe the precision is lost after checkpointing. @echai58 can you manually read the contents of the checkpoint with pyarrow and then grab the column with the stats for that add action, I'm curiously what's written there now |
I previously stated that the issue was only when you compacted, then checkpointed, but that turns out not to be true. @ion-elgreco
This is the file if I compact before I checkpoint:
And this is the metadata from the compact log, which has the correct maxValues
|
Maybe I'm lost on the order here, why is the checkpoint not referencing the same path? |
I reran the script twice, once with the compact and once without. The path in the compact log matches the path in the second checkpoint parquet I pasted. Sorry for the confusion. |
That "stats_parsed" col has lost some precision, I am trying to follow through the code where this happens. But that might be the main culpriit |
I agree in the case where I first compacted before checkpointing, it seems the precision is lost in But in the first case, where I did not compact first, it seems it computed the incorrect
has the incorrect value, even before parsing it. |
The last one seems fine because it's only 1 record, so 1 timestamp |
Ohh right yeah, that's the
The |
@echai58 @ion-elgreco - this is literally a footnote, but timestamps are truncated to miliseconds when computing stats. Could this be what we are seeing here? |
Ah, yeah that would explain it. |
Uhm so the protocol states it should be milliseconds :s. That's quite wrong because we are then passing rounded stats to the pyarrow.dataset fragments which in turn will return wrong results |
Instead of truncating, could we round up to the next millisecond when computing the stats? That would prevent this edge case from returning wrong results |
Rounding up or down can result in retrieving less or more records than expected. So both doesn't work |
@ion-elgreco isn't rounding up always safe? Because if the predicate is <= the rounded up value, it performs the filtering on the loaded in data? |
Environment
Delta-rs version: 0.15.3
Binding: python
Bug
What happened:
I have a delta table where if I do
to_pyarrow_table
on a timestamp column for<= datetime.datetime(2023, 3, 30, 0, 0, 0, 0)
, it leaves in a row that has value2023-03-30 00:00:00.000902
. I see that when inspecting the fragments of the pyarrow_dataset, there is an expression saying(timestamp <= 2023-03-30 00:00:00.000000))
, which is an incorrect max_value for this file. Because of this, it takes advantage of this incorrect expression to do the filtering, thus assuming all values in the file satisfy the predicate.Looking through the delta-rs code, it seems like this is parsed from the
maxValues
field of the checkpoint file. I looked at this checkpoint file, and I do indeed see the incorrectmaxValues
value:It seems to not include the microseconds field. I see in another timestamp column in this table, the
maxValues
field has the microsecond precision.Can someone help look into why this could happen?
The text was updated successfully, but these errors were encountered: