Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deltalake.PyDeltaTableError: Failed to read delta log object: Generic S3 error: Missing region #2308

Closed
isunli opened this issue May 30, 2023 · 8 comments · Fixed by #2315
Closed

Comments

@isunli
Copy link

isunli commented May 30, 2023

I am querying an delta table from sagemaker notebook. If I use all default argument like

df = wr.s3.read_deltalake(uri,
                          without_files = True
)

Then it will return following error message:

---------------------------------------------------------------------------
PyDeltaTableError                         Traceback (most recent call last)
Cell In[11], line 3
      1 uri = 's3://atari-fdv-glue-bucket-beta-us-east-1-226832659959/431112eb/FDV_SNAP_SYNC/delta/ofa_ap/ap_invoices_all/table/'
      2 session = boto3.Session(boto3.session.Session(region_name='us-east-1'))
----> 3 df = wr.s3.read_deltalake(uri,
      4                           without_files = True
      5 )

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/awswrangler/_utils.py:120, in check_optional_dependency.<locals>.decorator.<locals>.inner(*args, **kwargs)
    116     install_name = package_name if package_name is not None else name
    117     raise ModuleNotFoundError(
    118         f"Missing optional dependency '{name}'. " f"Use pip awswrangler[{install_name}] to install it."
    119     )
--> 120 return func(*args, **kwargs)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/awswrangler/s3/_read_deltalake.py:87, in read_deltalake(path, version, partitions, columns, without_files, use_threads, boto3_session, s3_additional_kwargs, pyarrow_additional_kwargs)
     84 arrow_kwargs = _data_types.pyarrow2pandas_defaults(use_threads=use_threads, kwargs=pyarrow_additional_kwargs)
     85 storage_options = _set_default_storage_options_kwargs(boto3_session, s3_additional_kwargs)
     86 return (
---> 87     deltalake.DeltaTable(
     88         table_uri=path,
     89         version=version,
     90         storage_options=storage_options,
     91         without_files=without_files,
     92     )
     93     .to_pyarrow_table(partitions=partitions, columns=columns)
     94     .to_pandas(**arrow_kwargs)
     95 )

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/deltalake/table.py:122, in DeltaTable.__init__(self, table_uri, version, storage_options, without_files)
    109 """
    110 Create the Delta Table from a path with an optional version.
    111 Multiple StorageBackends are currently supported: AWS S3, Azure Data Lake Storage Gen2, Google Cloud Storage (GCS) and local URI.
   (...)
    119                       DeltaTable will be loaded with a significant memory reduction.
    120 """
    121 self._storage_options = storage_options
--> 122 self._table = RawDeltaTable(
    123     str(table_uri),
    124     version=version,
    125     storage_options=storage_options,
    126     without_files=without_files,
    127 )
    128 self._metadata = Metadata(self._table)

PyDeltaTableError: Failed to read delta log object: Generic S3 error: Missing region

version I am using:

3.1.1

EDIT:
following works, I need to set aws region manually


df = wr.s3.read_deltalake(uri,
                          s3_additional_kwargs = {"AWS_REGION":"us-east-1"},
)
@jaidisido
Copy link
Contributor

When using S3, DeltaTable requires AWS specific storage options. We attempt to pass some based on the boto3 session but the AWS region is not one of them.

IMO this issue is better addressed by the delta lake team, so will reference it there.

@isunli
Copy link
Author

isunli commented May 31, 2023

Sounds good, maybe we can add this to doc somewhere? I need to read through code to figure out what argument I need to pass in to add region information.

@roeap
Copy link

roeap commented Jun 1, 2023

Hi - coming from the issue in delta-rs to get some context :).

I seem to remember that we were discussing if we should just default to us-east-1 as a defautl region, since the rusoto crate was using that default back then (this was pre aws rust sdks ..).

I do also vaguely remember reading some docs somewhere, that this is no longer a recommended value to take as the default region, since the number of regions grew significantly, but I may also be completely off ..

Do you know if there is a reasonable default for a region parameter that will serve most users? Otherwise I would think that choosing some more arbitrary default region is likely not something we would want to adopt, and one would just have to live with passing it in as a parameter 😄.

@isunli
Copy link
Author

isunli commented Jun 1, 2023

Yeah I agree, passing in a default parameter might not be a good choice at this moment. We should provide an instruction on how to pass region in the read_deltalake method of the AWS Pandas SDK doc/tutorials. (I can submit a quick pr if you feel it is appropriate.

@roeap
Copy link

roeap commented Jun 1, 2023

if you feel it is appropriate.

absolutely do!

@jaidisido
Copy link
Contributor

@isunli and @roeap, I believe I understand the underlying issue a bit better after some testing. In short this is due to the difference between how boto3 and the deltalake packages obtain the AWS region information.

In awswrangler the region is managed by the underlying boto3 session. boto3 obtains the AWS region by looking at different locations in a particular order (1. boto3 Config object 2. OS env variable (e.g. AWS_DEFAULT_REGION) 3. the .aws/config file). If no region is found, us-east-1 is NOT set by default, and a NoRegionError exception is raised instead.

Correct me if I am wrong @roeap, but the deltalake package only looks at the OS env variables, but not the .aws/config file.
As an example, I first ran this snippet without setting an OS env variable:

import deltalake
import pandas as pd
import pyarrow as pa

path = "s3://bucket/test_delta_lake/"
storage_options={
    "AWS_ACCESS_KEY_ID": "foo",
    "AWS_SECRET_ACCESS_KEY": "bar",
    "AWS_SESSION_TOKEN": "baz",
    "AWS_S3_ALLOW_UNSAFE_RENAME": "TRUE",
}

deltalake.write_deltalake(
    table_or_uri=path,
    data=pa.Table.from_pandas(df=pd.DataFrame({"c0": [1, 2, 3]})),
    storage_options=storage_options,
    mode="append",
    overwrite_schema=True,
    schema=pa.schema([("c0", pa.int64())]),
)

and it raised: deltalake.PyDeltaTableError: Failed to read delta log object: Generic S3 error: Missing region

After setting the AWS_REGION OS env variable, it passed.

I see three options here:

  1. The deltalake package not only relies on the OS env variables but also obtains AWS creds/details from the .aws/config file like boto3
  2. We obtain the region from the boto3 session and pass it along in storage_options
  3. Delegate it completely to the user

We can easily implement #2 on our end, but the underlying issue won't be fixed in the deltalake package

@jaidisido jaidisido linked a pull request Jun 2, 2023 that will close this issue
@jaidisido jaidisido changed the title wr.s3.read_deltalake won't work deltalake.PyDeltaTableError: Failed to read delta log object: Generic S3 error: Missing region Jun 2, 2023
@roeap
Copy link

roeap commented Jun 3, 2023

just realized that in the object store 0.6.1 release, reading the region from the profile is fixed, and with that also available in deltalake, once we updated. Nut sure if that would have implications here, since boto3 gets the information from the same source?

@jaidisido
Copy link
Contributor

With the PR fix we would always pass AWS_REGION obtained from the boto3 object to the underlying deltalake object, so we should be ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants