Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support get_backend_for_uri_with_options in Read Path #689

Closed
joshuarobinson opened this issue Jul 15, 2022 · 7 comments · Fixed by #799
Closed

Support get_backend_for_uri_with_options in Read Path #689

joshuarobinson opened this issue Jul 15, 2022 · 7 comments · Fixed by #799
Labels
binding/python Issues for the Python package enhancement New feature or request

Comments

@joshuarobinson
Copy link

joshuarobinson commented Jul 15, 2022

Description

Goal: I am trying to read a Delta Table from a non-AWS S3 object store (like minio). I seem to be able to read the table metadata but then when reading data I get an error that seems to indicate that my endpoint_override is being ignored and requests are sent to aws instead.

My code looks like this:

from deltalake import DeltaTable
storage_options = {"AWS_ACCESS_KEY_ID": ACCESS_KEY, "AWS_SECRET_ACCESS_KEY":SECRET_KEY, "AWS_ENDPOINT_URL": ENDPOINT_URL, "AWS_REGION": 'us-east-1'}
dt = DeltaTable("s3://warehouse/tpcds_sf1000_dlt/item/", storage_options=storage_options)
print(dt.dt.pyarrow_schema()) # SUCCESS
print(dt.to_pandas().head()) # FAILS

In looking through the code, I wonder if the line here needs to use the "with_options" variant:

storage::get_backend_for_uri(table_uri).map_err(PyDeltaTableError::from_storage)?;

in order to pass through my configs?

Use Case

Delta on S3 object stores like Minio, Swift, FlashBlade, Vast, ECS, etc.

Related Issue(s)

@joshuarobinson joshuarobinson added the enhancement New feature or request label Jul 15, 2022
@wjones127
Copy link
Collaborator

Thanks for reporting this! Our current API is problematic in that we can't retrieve the storage backend once it's been created in the DeltaTable. We'll need to refactor a bit, probably changing the storage backend to use an Arc, instead of an owned Box pointer.

For now, the workaround is to explicitly pass a configured PyArrow filesystem:

from deltalake import DeltaTable
from pyarrow.fs import S3FileSystem

storage_options = {
    "AWS_ACCESS_KEY_ID": ACCESS_KEY,
    "AWS_SECRET_ACCESS_KEY":SECRET_KEY,
    "AWS_ENDPOINT_URL": ENDPOINT_URL,
    "AWS_REGION": 'us-east-1'
}
dt = DeltaTable("s3://warehouse/tpcds_sf1000_dlt/item/", storage_options=storage_options)

pa_fs = S3FileSystem(
    access_key = storage_options["AWS_ACCESS_KEY_ID"],
    secret_key = storage_options["AWS_SECRET_ACCESS_KEY"],
    endpoint_override = storage_options["AWS_ENDPOINT_URL"],
    region = storage_options["AWS_REGION"],
)
print(dt.dt.pyarrow_schema()) # SUCCESS
print(dt.to_pandas(filesystem=pa_fs).head()) # This *should* work

@wjones127 wjones127 added the binding/python Issues for the Python package label Jul 17, 2022
@joshuarobinson
Copy link
Author

thanks @wjones127 for the quick response. This helps me get a little further but I think I then run afoul of differing path expectations. After making the change, I get the following error:

pyarrow.lib.ArrowInvalid: Expected an S3 object path of the form 'bucket/key...', got a URI: 's3://warehouse/tpcds_sf1000_dlt/item/part-00000-1805aaca-4959-40a2-87b1-c42385e763e8-c000.snappy.parquet'

I think the problem is coming from here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L364

Where here it's expecting a "bucket/key" path without a URI (i.e., no "s3://"), but I could be wrong.

Am I still missing something or is this maybe just not quite functional yet?

If there's any other issues tracking the work needed, I'd be happy to take a look and see if I can't contribute.

@wjones127
Copy link
Collaborator

Am I still missing something or is this maybe just not quite functional yet?

Darn. I guess that workaround doesn't work. Thanks for reporting that failure too.

We don't yet have integration testing with object stores, but that's high priority in the next few months.

If there's any other issues tracking the work needed, I'd be happy to take a look and see if I can't contribute.

Fixing #696 should make the workaround function. Fixing #690 should make this work without the workaround. You are welcome to submit a PR on either!

@roeap
Copy link
Collaborator

roeap commented Jul 17, 2022

Whilst working on #632 I was moving the storage backend behind an Arc as well, so I opened a quick PR with just that included (#697). This will not yet fix the issue, but prepares the rust side for the python side to use a shareble reference to the store, and avoid re-creating it.

@roeap
Copy link
Collaborator

roeap commented Sep 6, 2022

@joshuarobinson with the release of 0.6.1 we now - whenever possible - reuse the same backend store also in the read path. Could you check if that works for you?

@joshuarobinson
Copy link
Author

joshuarobinson commented Oct 10, 2022

Hi, I'm reopening this as I've tested against 0.6.1 and I'm still seeing the same behavior, both with and without manually specifying the pyarrow Filesystem.

I've installed "pip install deltalake==0.6.1"

and I confirm that I can read the schema successfully, but not the data files. In particular, the exception seems to happen when I try to convert a pyarrow dataset to a pyarrow table.

If I manually specify the S3FileSystem, the error is

pyarrow.lib.ArrowInvalid: Expected an S3 object path of the form 'bucket/key...', got a URI: 's3://warehouse/tpcds_sf1000_dlt/item//part-00000-1805aaca-4959-40a2-87b1-c42385e763e8-c000.snappy.parquet'

but if I let it "auto-resolve" to the storage_options I originally used when loading the Delta Table, the error is different:

pyo3_runtime.PanicException: should be URL safe: EmptySegment { path: "/tpcds_sf1000_dlt/item//part-00000-1805aaca-4959-40a2-87b1-c42385e763e8-c000.snappy.parquet" }

is it possible the problem is with the paths? I know that the pyarrow.S3FileSystem expects to see paths without the s3:// prefix (because it's assumed I guess).

Sorry to be a bother @roeap, but I wonder if you can confirm that the fixes in 0.6.1 should address this issue? If it should, is it possible to get a pointer to an example or test case?

@joshuarobinson
Copy link
Author

Hi @roeap

I've just tried again with 0.6.2 and the issue is now fixed!

fyi, manually specifying the pyarrow.filesystem doesn't work, but letting Delta take care of it behind the scenes does work (and is the better option, IMO)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants