Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading datasets from private s3 bucket #74

Merged
merged 4 commits into from
May 14, 2021

Conversation

katxiao
Copy link
Contributor

@katxiao katxiao commented May 10, 2021

Add functionality to download datasets from a private s3 bucket.

  • Add aws_key and aws_secret arguments to run function and use them on the download dataset.
  • boto3 needs to be used instead of urllib.
  • Also adapt unit tests as necessary if any of them fail.

✅ Verified that e2e flow works when pulling data from a public s3 bucket
✅ Verified that e2e flow works when pulling data from a private s3 bucket

@CLAassistant
Copy link

CLAassistant commented May 10, 2021

CLA assistant check
All committers have signed the CLA.

@katxiao katxiao force-pushed the pull-private-datasets-args branch from 313c547 to f6f504d Compare May 10, 2021 22:40
@katxiao katxiao requested a review from csala May 10, 2021 22:46
@katxiao katxiao marked this pull request as ready for review May 10, 2021 22:47
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good, but I think that we can get rid of the old urllib approach and not require the user to pass the bucket name including the s3:// prefix on it.

sdgym/datasets.py Outdated Show resolved Hide resolved
sdgym/datasets.py Outdated Show resolved Hide resolved
@katxiao katxiao force-pushed the pull-private-datasets-args branch 4 times, most recently from a4ee416 to 5df83bf Compare May 11, 2021 23:13
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks almost ready, but there is a situation that should be fixed to allow system-wide credentials to be used.

sdgym/datasets.py Outdated Show resolved Hide resolved
sdgym/datasets.py Outdated Show resolved Hide resolved
@katxiao katxiao force-pushed the pull-private-datasets-args branch 2 times, most recently from 726bb4e to f7c2700 Compare May 12, 2021 18:01
@katxiao katxiao changed the title Allow reading datasets from private s3 bucket Allow reading datasets from private s3 bucket (PR 1/3) May 12, 2021
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one last code style change (line break after if/else block) and this is ready to go :-)

sdgym/datasets.py Show resolved Hide resolved
@katxiao katxiao force-pushed the pull-private-datasets-args branch from f7c2700 to 4724aee Compare May 13, 2021 17:08
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@katxiao katxiao changed the title Allow reading datasets from private s3 bucket (PR 1/3) Allow reading datasets from private s3 bucket May 14, 2021
@katxiao katxiao merged commit 900ec9d into master May 14, 2021
@katxiao katxiao deleted the pull-private-datasets-args branch May 14, 2021 18:38
@csala csala added the internal The issue doesn't change the API or functionality label May 20, 2021
@csala csala added this to the 0.3.1 milestone May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal The issue doesn't change the API or functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants