Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read partitions with special characters (including space) with pyarrow >= 11 #1393

Closed
emanueledomingo opened this issue May 26, 2023 · 2 comments · Fixed by #1613
Closed
Labels
bug Something isn't working

Comments

@emanueledomingo
Copy link

Environment

Delta-rs version:
Binding: Python
Environment: Python==3.10, deltalake==0.9.0, pyarrow==11.0.0

  • Cloud provider: N/A
  • OS: Ubuntu 22.04
  • Other:

Bug

What happened:

If the values of a column contain special characters (including space) the writer encodes them when using the column as a partition. If you then try to read the table with the same column as a partition, it finds nothing.

This bug happens if the pyarrow version is >= 11. It works with pyarrow 10.0.1 (special characters not encoded).

What you expected to happen: Partition of a column with special characters correctly read even if they are encoded.

How to reproduce it:

import deltalake as dl
import pyarrow as pa

n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle Stars", "Centipede"])
names = ["n_legs", "animals"]

pa_table = pa.Table.from_arrays([n_legs, animals], names=names)

dt_table_uri = "tmp"
dl.write_deltalake(dt_table_uri, pa_table, partition_by=["animals"], mode="overwrite")

dt_table = dl.DeltaTable(dt_table_uri)
dt_table.to_pyarrow_table(partitions=[("animals", "=", "Brittle Stars")]).num_rows

It num_rows returns 0, 1 is expected.

More details:

The content of the /tmp folder is

$ ls -1 tmp/
'animals=Brittle%2520Stars'
'animals=Centipede'
'animals=Flamingo'
'animals=Horse'
_delta_log

Note: even if i try with Brittle%2520Stars as partition value the num_rows returns 0.

With pyarrow 10.0.1 the same script gives num_rows equal to 1 and the folder is

$ ls -1 tmp/
'animals=Brittle Stars'
'animals=Centipede'
'animals=Flamingo'
'animals=Horse'
_delta_log

as expected.

@emanueledomingo emanueledomingo added the bug Something isn't working label May 26, 2023
@emanueledomingo emanueledomingo changed the title Cannot read partitions with special characters (including space) Cannot read partitions with special characters (including space) with pyarrow >= 11 May 26, 2023
@emanueledomingo
Copy link
Author

Tested with:

  • datafusion: 28.0.0
  • deltalake: 0.10.1
  • pyarrow: 13.0.0

And still persist.

@giacomorebecchi
Copy link
Contributor

@wjones127 thanks for working on this issue and releasing a patch of the python package!

However, I fear that the issue should be reopened. In particular, I just tested it with:

  • datafusion: 28.0.0
  • deltalake: 0.10.2
  • pyarrow: 12.0.0
    and the issue still persist, although now the content of the tmp folder is:
$ ls -1 tmp/
'animals=Brittle%20Stars'
'animals=Centipede'
'animals=Flamingo'
'animals=Horse'
_delta_log

In particular, the following line of code still outputs 0 instead of 1:

dt_table.to_pyarrow_table(partitions=[("animals", "=", "Brittle Stars")]).num_rows

while, if we remove the partitions kwarg, we get:

dt_table.to_pyarrow_table()
FileNotFoundError
Traceback (most recent call last)
File ~/mambaforge/envs/my-env/lib/python3.10/site-packages/deltalake/table.py:575, in DeltaTable.to_pyarrow_table(self, partitions, columns, filesystem, filters)
    573 if filters is not None:
    574     filters = _filters_to_expression(filters)
--> 575 return self.to_pyarrow_dataset(
    576     partitions=partitions, filesystem=filesystem
    577 ).to_table(columns=columns, filter=filters)

File ~/mambaforge/envs/my-env/lib/python3.10/site-packages/pyarrow/_dataset.pyx:546, in pyarrow._dataset.Dataset.to_table()

File ~/mambaforge/envs/my-env/lib/python3.10/site-packages/pyarrow/_dataset.pyx:3449, in pyarrow._dataset.Scanner.to_table()

File ~/mambaforge/envs/my-env/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/mambaforge/envs/my-env/lib/python3.10/site-packages/pyarrow/_fs.pyx:1551, in pyarrow._fs._cb_open_input_file()

File ~/mambaforge/envs/my-env/lib/python3.10/site-packages/deltalake/fs.py:22, in DeltaStorageHandler.open_input_file(self, path)
     15 def open_input_file(self, path: str) -> pa.PythonFile:
     16     \"\"\"
     17     Open an input file for random access reading.
     18 
     19     :param source: The source to open for reading.
     20     :return:  NativeFile
     21     \"\"\"
---> 22     return pa.PythonFile(DeltaFileSystemHandler.open_input_file(self, path))

FileNotFoundError: Object at location /home/ubuntu/repos/experiments/resources/animals/animals=Brittle Stars/0-d2225f30-656e-40ef-a797-f4985aec7342-0.parquet not found: No such file or directory (os error 2)"
}

I hope this is of any help, many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants