Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to read delta table on mounted disk #1189

Closed
kuksag opened this issue Feb 28, 2023 · 6 comments · Fixed by #1311
Closed

Fail to read delta table on mounted disk #1189

kuksag opened this issue Feb 28, 2023 · 6 comments · Fixed by #1311
Assignees
Labels
binding/python Issues for the Python package bug Something isn't working

Comments

@kuksag
Copy link

kuksag commented Feb 28, 2023

Environment

Delta-rs version:
0.7.0 (latest main)

Binding:
Python

Environment:

  • Cloud provider: N/A (mounted server disk)
  • OS: *NIX
  • Other:

Bug

What happened:
I'm trying to read a delta-table, that located on a local server.

If we run the following code:

from deltalake import DeltaTable
path = 'file:///mnt/path/to/table/'
DeltaTable(path).to_pyarrow_table()

We will get the following error:

Traceback
Traceback (most recent call last):
  File "/path/to/main.py", line 2, in <module>
    dt.to_pyarrow_table()
  File "/path/to/deltalake/table.py", line 401, in to_pyarrow_table
    return self.to_pyarrow_dataset(
  File "pyarrow/_dataset.pyx", line 369, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2818, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_open_input_file
  File "/path/to/deltalake/fs.py", line 22, in open_input_file
    return pa.PythonFile(DeltaFileSystemHandler.open_input_file(self, path))
deltalake.PyDeltaTableError: Object at location /mnt/network_expansion/path/to/table/part-00000-aaaaaa-1111-2222-bbbbb-333333333.c000.snappy.parquet not found: No such file or directory (os error 2)

The *.parquet file is present.

What you expected to happen:

Succeful read

How to reproduce it:

More details:

Worth noting:

  1. If we run similar code, but written in Rust, it will work just fine.
  2. The full network path contains characters that get URL-encoded.
  3. If we try to save schema (file://) in url-path, as suggested in this issue, the code will do some more progress (it will find and read all *.parquet files), but will fail with a SIGABRT.

From (1) I'm taking a guess, that there's a broken logic either somewhere in deltalake python wrapper, because (looking at the traceback):
File "pyarrow/_fs.pyx", line 1551, in pyarrow._fs._cb_open_input_file is a call from PyArrow to a SystemHandler, which is DeltaFileSystemHandler

@kuksag kuksag added the bug Something isn't working label Feb 28, 2023
@wjones127 wjones127 added the binding/python Issues for the Python package label Mar 1, 2023
@kuksag
Copy link
Author

kuksag commented Mar 2, 2023

Apparently, the current version (0.7.0) is unable to read a table, locally or over a network, if the path contains characters like space, which will be encoded with percent. This issue is related to this issue.

@wjones127, I attempted some hotfixes for my use case, which can be found in that PR. Please let me know if you would like me to open this PR in the main repo, as I would be eager to help.

The reason it is not working currently is that the str from the user gets converted to a Url representation, where all "bad" characters get encoded. Then, when DeltaObjectStore::try_new is called, the URL gets converted to a Path object, where all "bad" characters get encoded again.

My use case is a little more complex, as the path to the mounted folder does not contain any "bad" characters. However, when this path is expanded to some network server address, it contains URL-encoded characters. Therefore, I had to do something with the file:// schema.

@wjones127
Copy link
Collaborator

That sounds like it needs to be cleaned up. Possibly related to #1079 too. I'll take a look at this soon.

@wjones127
Copy link
Collaborator

So I think the problem is when we load a delta table we take the path (as a string) and convert it into a Url (which percent-encodes values). This makes sense for objects stores, but not local file systems. I think we need to think about a better way to handle this. cc @roeap

@wjones127
Copy link
Collaborator

I thinking we should instead use a type like:

enum TableUri {
    LocalPath(Path),
    RemoteUrl(Url)
}

What do you think of that @roeap?

@wjones127
Copy link
Collaborator

Huh, actually it might be that object_store just won't support paths that contain spaces?

https://docs.rs/object_store/0.5.5/object_store/path/struct.Path.html#path-safety

But it seems like we should be able to support arbitrary characters above the root, as long as we are passing them along correctly?

https://github.com/apache/arrow-rs/blob/71ecc39f36c8f38a5fc93bc3878a607c831b2f12/object_store/src/local.rs#L1252

@roeap
Copy link
Collaborator

roeap commented Apr 8, 2023

@wjones127 - sorry for taking so long, this somehow slipped my attention.

I think if just want to support spaces in the table root, we should get it to work, somehow like you described. As you mentioned object store is quite strict around how paths should look like, but I have right now no feeling on how limiting that is. I.e. if other writers can create somemting in an object store, that we would not be able to read.

To me all of this eventually points to handling absolute paths in the log as well since this will require different path handling quite deep into the log handling. While I have not yet really thought about this problem, I also haven't had an idea yet on that area that resonated..

wjones127 added a commit that referenced this issue Apr 30, 2023
# Description

Continues work in #1274. Adds tests, and handles characters besides
spaces.

# Related Issue(s)

- closes #1189


# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: Tomas Sedlak <tomas.sedlak@masterminds.sk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working
Projects
None yet
3 participants