Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add google Application Default Credentials to download #376

Merged
merged 8 commits into from
Aug 14, 2023
37 changes: 25 additions & 12 deletions docs/source/how_to_guides/configure_cloud_storage_credentials.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,43 +95,56 @@ export S3_ENDPOINT_URL='https://<accountid>.r2.cloudflarestorage.com'

### MosaicML platform

For [MosaicML platform](https://www.mosaicml.com/cloud) users, follow the steps mentioned in the [Google Cloud Storage](https://mcli.docs.mosaicml.com/en/latest/secrets/gcp.html) MCLI doc on how to configure the cloud provider credentials.
For [MosaicML platform](https://www.mosaicml.com/cloud) users, follow the steps mentioned in the [Google Cloud Storage](https://docs.mosaicml.com/projects/mcli/en/latest/resources/secrets/gcp.html) MCLI doc on how to configure the cloud provider credentials.


### GCP Service Account Credentials Mounted as Environment Variables
### GCP User Auth Credentials Mounted as Environment Variables

Streaming dataset supports [GCP user credentials](https://cloud.google.com/storage/docs/authentication#user_accounts) or [HMAC keys for User account](https://cloud.google.com/storage/docs/authentication/hmackeys). Users must set their GCP `user access key` and GCP `user access secret` in the run environment.

Users must set their GCP `account credentials` to point to their credentials file in the run environment.
From the Google Cloud console, navigate to `Google Storage` > `Settings (Left vertical pane)` > `Interoperability` > `Service account HMAC` > `User account HMAC` > `Access keys for your user account` > `Create a key`.

````{tabs}
```{code-tab} py
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'KEY_FILE'
os.environ['GCS_KEY'] = 'AKIAIOSFODNN7EXAMPLE'
os.environ['GCS_SECRET'] = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
```

```{code-tab} sh
export GOOGLE_APPLICATION_CREDENTIALS='KEY_FILE'
export GCS_KEY='AKIAIOSFODNN7EXAMPLE'
export GCS_SECRET='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
```
````

### GCP User Auth Credentials Mounted as Environment Variables

Streaming dataset supports [GCP user credentials](https://cloud.google.com/storage/docs/authentication#user_accounts) or [HMAC keys for User account](https://cloud.google.com/storage/docs/authentication/hmackeys). Users must set their GCP `user access key` and GCP `user access secret` in the run environment.
### GCP Application Default Credentials

From the Google Cloud console, navigate to `Google Storage` > `Settings (Left vertical pane)` > `Interoperability` > `Service account HMAC` > `User account HMAC` > `Access keys for your user account` > `Create a key`.
Streaming dataset supports the use of Application Default Credentials (ADC) to authenticate you with Google Cloud. When
no HMAC keys are given (see above), it will attempt to authenticate using ADC. This will, in order, check

1. a key-file whose path is given in the `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
2. a key-file in the Google cloud configuration directory.
3. the Google App Engine credentials.
4. the GCE Metadata Service credentials.

See the [Google Cloud Docs](https://cloud.google.com/docs/authentication/provide-credentials-adc) for more details.

To explicitly use the `GOOGLE_APPLICATION_CREDENTIALS` (point 1 above), users must set their GCP `account credentials`
to point to their credentials file in the run environment.

````{tabs}
```{code-tab} py
import os
os.environ['GCS_KEY'] = 'AKIAIOSFODNN7EXAMPLE'
os.environ['GCS_SECRET'] = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'KEY_FILE'
```

```{code-tab} sh
export GCS_KEY='AKIAIOSFODNN7EXAMPLE'
export GCS_SECRET='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
export GOOGLE_APPLICATION_CREDENTIALS='KEY_FILE'
```
````


## Oracle Cloud Storage

### MosaicML platform
Expand Down
24 changes: 15 additions & 9 deletions streaming/base/storage/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@

BOTOCORE_CLIENT_ERROR_CODES = {'403', '404', 'NoSuchKey'}

GCS_ERROR_NO_AUTHENTICATION = """\
Either set the environment variables `GCS_KEY` and `GCS_SECRET` or use any of the methods in \
https://cloud.google.com/docs/authentication/external/set-up-adc to set up Application Default Credentials. See also \
https://docs.mosaicml.com/projects/mcli/en/latest/resources/secrets/gcp.html.
"""


def download_from_s3(remote: str, local: str, timeout: float) -> None:
"""Download a file from remote AWS S3 (or any S3 compatible object store) to local.
Expand Down Expand Up @@ -158,19 +164,19 @@ def download_from_gcs(remote: str, local: str) -> None:
remote (str): Remote path (GCS).
local (str): Local path (local filesystem).
"""
from google.auth.exceptions import DefaultCredentialsError
obj = urllib.parse.urlparse(remote)
if obj.scheme != 'gs':
raise ValueError(
f'Expected obj.scheme to be `gs`, instead, got {obj.scheme} for remote={remote}')

if 'GOOGLE_APPLICATION_CREDENTIALS' in os.environ:
_gcs_with_service_account(local, obj)
elif 'GCS_KEY' in os.environ and 'GCS_SECRET' in os.environ:
if 'GCS_KEY' in os.environ and 'GCS_SECRET' in os.environ:
_gcs_with_hmac(remote, local, obj)
else:
raise ValueError(f'Either GOOGLE_APPLICATION_CREDENTIALS needs to be set for ' +
f'service level accounts or GCS_KEY and GCS_SECRET needs to be ' +
f'set for HMAC authentication')
try:
_gcs_with_service_account(local, obj)
except (DefaultCredentialsError, EnvironmentError):
raise ValueError(GCS_ERROR_NO_AUTHENTICATION)
b-chu marked this conversation as resolved.
Show resolved Hide resolved


def _gcs_with_hmac(remote: str, local: str, obj: urllib.parse.ParseResult) -> None:
Expand Down Expand Up @@ -215,11 +221,11 @@ def _gcs_with_service_account(local: str, obj: urllib.parse.ParseResult) -> None
local (str): Local path (local filesystem).
obj (ParseResult): ParseResult object of remote path (GCS).
"""
from google.auth import default as default_auth
from google.cloud.storage import Blob, Bucket, Client

service_account_path = os.environ['GOOGLE_APPLICATION_CREDENTIALS']
gcs_client = Client.from_service_account_json(service_account_path)

credentials, _ = default_auth()
karan6181 marked this conversation as resolved.
Show resolved Hide resolved
gcs_client = Client(credentials=credentials)
blob = Blob(obj.path.lstrip('/'), Bucket(gcs_client, obj.netloc))
blob.download_to_filename(local)

Expand Down
24 changes: 12 additions & 12 deletions streaming/base/storage/upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@

import tqdm

from streaming.base.storage.download import BOTOCORE_CLIENT_ERROR_CODES
from streaming.base.storage.download import (BOTOCORE_CLIENT_ERROR_CODES,
GCS_ERROR_NO_AUTHENTICATION)

__all__ = [
'CloudUploader',
Expand Down Expand Up @@ -270,14 +271,7 @@ def __init__(self,
keep_local: bool = False,
progress_bar: bool = False) -> None:
super().__init__(out, keep_local, progress_bar)

if 'GOOGLE_APPLICATION_CREDENTIALS' in os.environ:
from google.cloud.storage import Client

service_account_path = os.environ['GOOGLE_APPLICATION_CREDENTIALS']
self.gcs_client = Client.from_service_account_json(service_account_path)
self.authentication = GCSAuthentication.SERVICE_ACCOUNT
elif 'GCS_KEY' in os.environ and 'GCS_SECRET' in os.environ:
if 'GCS_KEY' in os.environ and 'GCS_SECRET' in os.environ:
import boto3

# Create a session and use it to make our client. Unlike Resources and Sessions,
Expand All @@ -292,9 +286,15 @@ def __init__(self,
)
self.authentication = GCSAuthentication.HMAC
else:
raise ValueError(f'Either GOOGLE_APPLICATION_CREDENTIALS needs to be set for ' +
f'service level accounts or GCS_KEY and GCS_SECRET needs to ' +
f'be set for HMAC authentication')
from google.auth import default as default_auth
from google.auth.exceptions import DefaultCredentialsError
from google.cloud.storage import Client
try:
credentials, _ = default_auth()
self.gcs_client = Client(credentials=credentials)
self.authentication = GCSAuthentication.SERVICE_ACCOUNT
except (DefaultCredentialsError, EnvironmentError):
raise ValueError(GCS_ERROR_NO_AUTHENTICATION)

self.check_bucket_exists(self.remote) # pyright: ignore

Expand Down
18 changes: 18 additions & 0 deletions tests/test_download.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,18 @@ def test_download_from_gcs(self, remote_local_file: Any):
download_from_gcs(mock_remote_filepath, tmp.name)
assert os.path.isfile(tmp.name)

@patch('google.auth.default')
@patch('google.cloud.storage.Client')
@pytest.mark.usefixtures('gcs_service_account_credentials')
@pytest.mark.parametrize('out', ['gs://bucket/dir'])
def test_download_service_account(self, mock_client: Mock, mock_default: Mock, out: str):
with tempfile.NamedTemporaryFile(delete=True, suffix='.txt') as tmp:
credentials_mock = Mock()
mock_default.return_value = credentials_mock, None
download_from_gcs('gs://bucket_file', tmp.name)
mock_client.assert_called_once_with(credentials=credentials_mock)
assert os.path.isfile(tmp.name)

@pytest.mark.usefixtures('gcs_hmac_client', 'gcs_test', 'remote_local_file')
def test_filenotfound_exception(self, remote_local_file: Any):
with pytest.raises(FileNotFoundError):
Expand All @@ -110,6 +122,12 @@ def test_invalid_cloud_prefix(self, remote_local_file: Any):
mock_remote_filepath, mock_local_filepath = remote_local_file(cloud_prefix='s3://')
download_from_gcs(mock_remote_filepath, mock_local_filepath)

def test_no_credentials_error(self, remote_local_file: Any):
"""Ensure we raise a value error correctly if we have no credentials available."""
with pytest.raises(ValueError):
mock_remote_filepath, mock_local_filepath = remote_local_file(cloud_prefix='gs://')
download_from_gcs(mock_remote_filepath, mock_local_filepath)


def test_download_from_local():
mock_remote_dir = tempfile.TemporaryDirectory()
Expand Down
26 changes: 15 additions & 11 deletions tests/test_upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -219,32 +219,36 @@ def test_hmac_authentication(self, mocked_requests: Mock, out: str):
uploader = GCSUploader(out=out)
assert uploader.authentication == GCSAuthentication.HMAC

@patch('streaming.base.storage.upload.GCSUploader.check_bucket_exists')
@patch('google.cloud.storage.Client.from_service_account_json')
@patch('google.auth.default')
@patch('google.cloud.storage.Client')
@pytest.mark.usefixtures('gcs_service_account_credentials')
@pytest.mark.parametrize('out', ['gs://bucket/dir'])
def test_service_account_authentication(self, mocked_requests: Mock, mock_client: Mock,
out: str):
def test_service_account_authentication(self, mock_client: Mock, mock_default: Mock, out: str):
mock_default.return_value = Mock(), None
uploader = GCSUploader(out=out)
assert uploader.authentication == GCSAuthentication.SERVICE_ACCOUNT

@patch('streaming.base.storage.upload.GCSUploader.check_bucket_exists')
@patch('google.cloud.storage.Client.from_service_account_json')
@patch('google.auth.default')
@patch('google.cloud.storage.Client')
@pytest.mark.usefixtures('gcs_service_account_credentials', 'gcs_hmac_credentials')
@pytest.mark.parametrize('out', ['gs://bucket/dir'])
def test_service_account_and_hmac_authentication(self, mocked_requests: Mock,
mock_client: Mock, out: str):
mock_default: Mock, mock_client: Mock,
out: str):
mock_default.return_value = Mock(), None
uploader = GCSUploader(out=out)
assert uploader.authentication == GCSAuthentication.SERVICE_ACCOUNT
assert uploader.authentication == GCSAuthentication.HMAC

@pytest.mark.parametrize('out', ['gs://bucket/dir'])
def test_no_authentication(self, out: str):
with pytest.raises(
ValueError,
match=(f'Either GOOGLE_APPLICATION_CREDENTIALS needs to be set for'
f' service level accounts or GCS_KEY and GCS_SECRET needs to be'
f' set for HMAC authentication'),
):
match=
(f'Either set the environment variables `GCS_KEY` and `GCS_SECRET` or use any of the methods in '
f'https://cloud.google.com/docs/authentication/external/set-up-adc to set up Application Default '
f'Credentials. See also https://docs.mosaicml.com/projects/mcli/en/latest/resources/secrets/'
f'gcp.html.')):
_ = GCSUploader(out=out)


Expand Down