-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot read delta table partitioned by a date typed column #563
Comments
@burakyilmaz321 Thanks for reporting this. I think we should be able to fix this in delta-rs. PyArrow does seem capable of parsing the partition column as a date: >>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> from datetime import date
>>> from tempfile import mkdtemp
>>>
>>> tab = pa.table({
... 'd': pa.array([date(2020, 1, x) for x in range(1, 4)]),
... 'x': pa.array([1, 2, 3]),
... })
>>>
>>> tmp_dir = mkdtemp()
>>> part = ds.partitioning(
... pa.schema([("d", pa.date32())]), flavor="hive"
... )
>>> ds.write_dataset(tab, tmp_dir, partitioning=part, format='parquet')
>>> ds.dataset(tmp_dir, partitioning="hive").to_table()
pyarrow.Table
x: int64
d: string
----
x: [[1],[2],[3]]
d: [["2020-01-01"],["2020-01-02"],["2020-01-03"]]
>>> ds.dataset(tmp_dir, partitioning=part).to_table()
pyarrow.Table
x: int64
d: date32[day]
----
x: [[1],[2],[3]]
d: [[2020-01-01],[2020-01-02],[2020-01-03]] |
@wjones127 I made a patch here and it works for me. I can create a PR if you are interested. |
@burakyilmaz321 That looks like it will solve your case, but will break the pass through of file statistics to datasets. (In particular, I think this test will fail with this change: https://github.com/delta-io/delta-rs/blob/main/python/tests/test_table_read.py#L148). I will likely look at fixing this issue this weekend. |
@burakyilmaz321 Thanks for reporting this! 🙌 It looks like this was a regression introduced in 0.5.5; if you downgrade to 0.5.4 you should be able to read the table. I have an open PR to fix this and add tests to prevent this issue from coming up again. |
Great news! Thanks ✋ |
Hi! I am facing exactly the same issue on version 0.10.1. Does anybody knows why they have removed this fix from version 0.5.5? Thank you! 🙏 |
Are you using PyArrow 13.0.0? There was a regression in the PyArrow library that may cause this to fail. Working on fixing those soon: #1602 |
I had downgrade pyarrow to 12.0.0 and it works. |
Environment
Delta-rs version: 0.5.6
Binding: python
Environment: any
Bug
What happened: Cannot read delta table partitioned by a date type column.
What you expected to happen: It should be able to read tables partitioned by date.
How to reproduce it:
Generate fake data and save it
Read this with deltalake
This raises
ArrowNotImplementedError: Unsupported cast from string to date32 using function cast_date32
Full trace:
More details:
As the exception message says, it seems like there is no implementation for string to date32 in arrow. I checked arrow and saw that it eventually calls this one I guess, and string to date32 casting is not implemented.
Question: Is this a concern for deltalake project, or should this be handled within pyarrow?
The text was updated successfully, but these errors were encountered: