-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add google Application Default Credentials to download #376
Conversation
2d8bb88
to
2562442
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed PR description. It was super useful. And thanks for creating a PR to improve the user experience. The approach looks ok to me. Posted some minor comments.
Can you also share a snapshot that things works when user set HMAC and Service credentials (one at a time) ?
docs/source/how_to_guides/configure_cloud_storage_credentials.md
Outdated
Show resolved
Hide resolved
I've used the script hidden below to demonstrate that; I'm mocking away gce credentials for demonstrating HMAC and only set HMAC credentials for HMAC. Note also the error if none of them are available. I'm running this on a google cloud instance, which has gce credentials available. demo scriptimport os
import tempfile
import unittest.mock
import google.auth
from streaming import StreamingDataset
GCS_KEY = "YOURKEY"
GCS_SECRET = "YOUR_SECRET"
def available_methods() -> dict:
available, unavailable = set(), set()
if 'GCS_KEY' in os.environ and 'GCS_SECRET' in os.environ:
available.add("HMAC")
else:
unavailable.add("HMAC")
# Using the default google checkers here.
checkers = {
"GOOGLE_APPLICATION_CREDENTIALS": google.auth._default._get_explicit_environ_credentials(),
"gcloud sdk credentials": google.auth._default._get_gcloud_sdk_credentials(),
"gae credentials": google.auth._default._get_gae_credentials(),
"gce credentials": google.auth._default._get_gce_credentials()
}
for name, result in checkers.items():
if result[0] is None:
unavailable.add(name)
else:
available.add(name)
method_results = {"available": available, "unavailable": unavailable}
return method_results
def read_one_sample():
remote_dir = "YOUR_DIR"
tmpdir = tempfile.mkdtemp()
dataset = StreamingDataset(remote=remote_dir, local=tmpdir, split=None, shuffle=True)
dl = dataset
sample = next(iter(dl))
print(f"{type(sample)=}")
del dataset
def main():
with unittest.mock.patch("google.auth._default._get_gce_credentials", lambda **kwargs: (None, None)):
print("1. Using HMAC")
os.environ['GCS_KEY'] = GCS_KEY
os.environ['GCS_SECRET'] = GCS_SECRET
print(available_methods())
read_one_sample()
del os.environ['GCS_KEY']
del os.environ['GCS_SECRET']
print("---------------------------\n\n")
print("2. Using GCE credentials")
print(available_methods())
read_one_sample()
print("---------------------------\n\n")
print("3. No method available")
with unittest.mock.patch("google.auth._default._get_gce_credentials", lambda *args, **kwargs: (None, None)):
print(available_methods())
read_one_sample()
if __name__ == "__main__":
main() |
Thanks! |
Description of changes:
#315 introduced GCS authentication for service accounts as an alternative to HMAC credentials. However,
google.auth
provides thegoogle.auth.default
function that uses Application Default Credentials to obtain credentials. See https://cloud.google.com/docs/authentication/provide-credentials-adc for more details. This does the following (from their docs):Point 1 was introduces in #315, so this is essentially a superset of the service account authentication introduced in that PR.
This is important for us because we use the Compute Engine credentials to avoid mounting secrets in the container which have unrestricted lifetime. I assume others are in the same situation as this is the recommended and more secure way.
Thanks to @b-chu for #315!
Backwards-Incompatible Change
There is one backwards-incompatible change in the order of HMAC/Service Account.
Previously, streaming prioritized a service account with an explicitly set
GOOGLE_APPLICATION_CREDENTIALS
environment variable over the HMAC credentials, i.e.GOOGLE_APPLICATION_CREDENTIALS
Now, this prioritizes HMAC over
GOOGLE_APPLICATION_CREDENTIALS
(the test has been changed correspondingly). This is because thedefault
authentication includes explicitGOOGLE_APPLICATION_CREDENTIALS
and we'd have to either (a) special-case outside ofdefault
authentication or (b) could not use HMAC if any of thedefault
authentication methods in 2.-4. above were set. That means the order isGOOGLE_APPLICATION_CREDENTIALS
with 2.-4. provided by
google.auth.default
. This order allows us to explicitly set either HMAC or explicit service account credentials through environment variables, and use App Engine or Compute Engine if these are available.I feel this backwards-incompatible change is acceptable because it only applies if both HMAC and
GOOGLE_APPLICATION_CREDENTIALS
are set at the same time. It also hasn't been part of any release yet, so we can expect adoption ofGOOGLE_APPLICATION_CREDENTIALS
isn't widespread.If you disagree, I'll special-case it.
Issue #, if available:
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Docs Screenshot
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)cc @fellhorn