Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading delta table as pyarrow dataset does not work #131

Closed
dudzicp opened this issue Jan 17, 2023 · 5 comments
Closed

Reading delta table as pyarrow dataset does not work #131

dudzicp opened this issue Jan 17, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@dudzicp
Copy link

dudzicp commented Jan 17, 2023

Describe the bug
I am unable to display the contents of delta tables stored locally

To Reproduce

[tool.poetry.dependencies]
python = "^3.10"
datafusion = "^0.7.0"
deltalake = "^0.6.4"

then run the following code:

import pyarrow as pa
import pyarrow.dataset as ds

from deltalake import DeltaTable
import datafusion

ctx = datafusion.SessionContext()

delta_table = DeltaTable("/local_delta_path/")
pa_dataset = dt.to_pyarrow_dataset()

ctx.register_dataset("pa_dataset", pa_dataset)

tmp = ctx.sql("SELECT * FROM pa_dataset limit 10")
tmp.show()

When executed in notebook in vs code, this script can run for >20 min and I am unable to interrupt the execution.

Expected behavior
Top rows displayed

@dudzicp dudzicp added the bug Something isn't working label Jan 17, 2023
@dudzicp
Copy link
Author

dudzicp commented Jan 17, 2023

I have also tried to convert table to dataset in the following way:

dataset = dt.to_pyarrow_dataset(
    parquet_read_options=ParquetReadOptions(coerce_int96_timestamp_unit="ms")
)

but the result is the same

@kylebrooks-8451
Copy link
Contributor

kylebrooks-8451 commented May 1, 2023

@dudzicp - Could you give us the delta files to reproduce this?

@jordandakota
Copy link

Where in the code snippet is dt defined, or did you mean delta_table

@dudzicp
Copy link
Author

dudzicp commented Jun 6, 2023

Yes, I meant delta_table. It's been 5 months since I have reported this. Let me reproduce this with latest version of datafusion so then I will provide sample delta table.

@timsaucer
Copy link
Contributor

I have been able to read delta tables using the approach above on DF 42.0.0. However it can be extremely slow for large tables because we don't push down filters. A follow up solution will be provided in #921 and a corresponding update to delta-rs.

I am going to close this because the immediate problem is resolved. Please reopen the issue if you continue to have problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants