Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWS Credentials parsing from file #2117

Closed
Shershebnev opened this issue Jan 24, 2024 · 10 comments
Closed

Add AWS Credentials parsing from file #2117

Shershebnev opened this issue Jan 24, 2024 · 10 comments
Labels
binding/rust Issues for the Rust crate enhancement New feature or request storage/aws AWS S3 storage related
Milestone

Comments

@Shershebnev
Copy link

Environment

Delta-rs version:

$ pip show deltalake
Name: deltalake
Version: 0.15.1
Summary: Native Delta Lake Python binding based on delta-rs with Pandas integration
Home-page: https://github.com/delta-io/delta-rs
Author: Qingping Hou <dave2008713@gmail.com>, Will Jones <willjones127@gmail.com>
Author-email: Qingping Hou <dave2008713@gmail.com>, Will Jones <willjones127@gmail.com>
License: Apache-2.0
Location: ...
Requires: pyarrow, pyarrow-hotfix
Required-by: 

Binding:
Python
Environment:

  • Cloud provider: AWS (Ubuntu)
  • OS: MacOS
  • Other:

Bug

What happened:
It seems that credentials are not correctly obtained from ~/.aws/credentials and ~/.aws/config files. Just like here #1416 I'm getting OSError: Generic S3 error: Missing region when trying to read from S3

On MacOS locally:
Setting AWS_DEFAULT_REGION fixes this, but then it tries to retrieve instance metadata using http://169.254.169.254/latest/api/token which obviously fails when running not from AWS instance OSError: Generic S3 error: Error after 10 retries in 6.409805791s, max_retries:10, retry_timeout:180s, source:error sending request for url (http://169.254.169.254/latest/api/token): error trying to connect: tcp connect error: Host is down (os error 64)

On AWS instance:
Setting only AWS_DEFAULT_REGION results in OSError: Generic S3 error: Client error with status 403 Forbidden: <?xml version="1.0" encoding="UTF-8"?>

In both cases setting everything through env variables fixes the problem, e.g. AWS_DEFAULT_REGION=... AWS_ACCESS_KEY_ID=... AWS_SECRET_ACCESS_KEY=... python. Other tools like boto3 don't have problems using credentials stored in default location:

$ python3.9
Python 3.9.16 (main, Aug  3 2023, 01:00:02) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> master_table_df = pl.scan_delta("s3://REDACTED.delta").select("audio_type", "parallel_id").collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/.local/lib/python3.9/site-packages/polars/io/delta.py", line 263, in scan_delta
    dl_tbl = _get_delta_lake_table(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/polars/io/delta.py", line 306, in _get_delta_lake_table
    dl_tbl = deltalake.DeltaTable(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/deltalake/table.py", line 396, in __init__
    self._table = RawDeltaTable(
OSError: Generic S3 error: Missing region
>>> import boto3
>>> ecr = boto3.client("ecr")
>>> 

What you expected to happen:
Credentials are properly read from default location ~/.aws/credentials and ~/.aws/config
How to reproduce it:
Install deltalake and try to read from S3 while having credentials set in default files. See example above with polars and deltalake

@Shershebnev Shershebnev added the bug Something isn't working label Jan 24, 2024
@ion-elgreco ion-elgreco added the binding/rust Issues for the Rust crate label Jan 25, 2024
@r3stl355
Copy link
Contributor

As far as I know, deltalake expects all the AWS parameters to be defined in the environment, exactly as you noted @Shershebnev.

~/.aws/* files on Linux/Mac are managed by AWS cli, and boto3 being an AWS SDK piggybacks on it to get the profile configs. If we implement the same/similar in deltalake then we'd have to either implement a dependency on AWS cli (or some sub-component of it) which, if possible, would just create another dependency we'd have to maintain, or use the "knowledge" of the possible locations of these files but this will have to be OS file system specific, and will also create a dependency on the corresponding file format.

To summarize, I don't see this as something that should be prioritized but If there is a strong support for implementing this I can have a stab

@ion-elgreco
Copy link
Collaborator

@r3stl355 Polars for example parses these files to grab the credentials, could likely take inspiration from that implementation

@r3stl355
Copy link
Contributor

Yes @ion-elgreco, looks like Polars is using the second approach I mentioned - looking into specific files config and credentials files it "knows" may exist. However, it uses hard-code paths like "~/.aws/credentials" which, I believe, will break on Windows, hence there will be a need to handle OS specific file system as I mentioned

@ion-elgreco ion-elgreco added enhancement New feature or request and removed bug Something isn't working labels Jan 27, 2024
@ion-elgreco ion-elgreco changed the title AWS Credentials parsing error Add AWS Credentials parsing from file Jan 27, 2024
@rtyler
Copy link
Member

rtyler commented Jan 30, 2024

Some of this will go away with #1601 fwiw, right now there's kind of a hodge-podge of configuration possibilities between object_store and some of the rusoto crates we depend on

@rtyler rtyler added the storage/aws AWS S3 storage related label Feb 1, 2024
@rtyler rtyler added this to the Rust v0.18 milestone Feb 6, 2024
@mrocklin
Copy link

mrocklin commented Feb 8, 2024

I don't see this as something that should be prioritized but If there is a strong support for implementing this I can have a stab

Speaking from a Dask perspective I'd certainly like to throw weight behind this. We certainly find that people commonly use .aws directories in their home directories. The ROI on looking for and parsing those files is hopefully fairly high. It's a common practice. With regards to Windows machines, I'm not sure what the convention there is, but I suspect that there is a fairly similar one. Hopefully addressing that convention as well is an easy switch.

In the meantime, can I ask what mechanisms are available to specify AWS credentials? Is it just environment variables? Is there something people can do to specify these programmatically in the meantime?

@wjones127
Copy link
Collaborator

In the meantime, can I ask what mechanisms are available to specify AWS credentials?

Environment variables or passing to storage_options parameter:

>>> storage_options = {"AWS_ACCESS_KEY_ID": "THE_AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY":"THE_AWS_SECRET_ACCESS_KEY"}
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0", storage_options=storage_options)

https://delta-io.github.io/delta-rs/usage/loading-table/

@jrbourbeau
Copy link

Thanks for the example showing AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY being used @wjones127. That was useful

The system I'm running on uses AWS_CONTAINER_CREDENTIALS_FULL_URI for managing AWS credentials (https://docs.aws.amazon.com/sdkref/latest/guide/feature-container-credentials.html). I tried passing AWS_CONTAINER_CREDENTIALS_FULL_URI via storage_options but unfortunately it didn't work (I access denied errors when trying to write a deltatable). It'd be great if other authentication options like AWS_CONTAINER_CREDENTIALS_FULL_URI were supported.

@mrocklin
Copy link

FWIW I think that the system referred to above is used by systems that are on AWS machines that have an IAM role attached and use that rather than AWS secret keys in environment variables. Systems know to go and read a local available endpoint to get access tokens.

@danieldiamond
Copy link

danieldiamond commented Feb 16, 2024

Great to see this is a recent thread - I've gone down a rabbit hole determining if IAM roles could be used in delta-rs (but looking at arrow-rs issues: apache/arrow-rs#4556 and apache/arrow-rs#4238)

I'm trying to use delta-rs using IAM role attached to ECS task and finding it very hard to believe you can't (and that you have to use AWS KEYS).

Can you confirm that you cannot use IAM roles to write delta lake tables to S3?

+1 to the points above

@rtyler
Copy link
Member

rtyler commented Dec 1, 2024

I'm reviving this old thread to clean it up! I believe we have corrected this behavior since the deltalake-aws crate will use the AWS SDK itself for all AWS-related credential resolution, this includes passing access key/secret key through to the object_store crate. 🤞

@rtyler rtyler closed this as completed Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate enhancement New feature or request storage/aws AWS S3 storage related
Projects
None yet
Development

No branches or pull requests

8 participants