Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Disable soft delete policy when creating new default bucket. #31344

Merged
merged 4 commits into from
May 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions sdks/python/apache_beam/io/gcp/gcsio.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,12 +162,21 @@ def get_bucket(self, bucket_name):
except NotFound:
return None

def create_bucket(self, bucket_name, project, kms_key=None, location=None):
def create_bucket(
self,
bucket_name,
project,
kms_key=None,
location=None,
soft_delete_retention_duration_seconds=0):
"""Create and return a GCS bucket in a specific project."""

try:
bucket = self.client.bucket(bucket_name)
bucket.soft_delete_policy.retention_duration_seconds = (
soft_delete_retention_duration_seconds)
bucket = self.client.create_bucket(
bucket_or_name=bucket_name,
bucket_or_name=bucket,
project=project,
location=location,
)
Expand Down
40 changes: 40 additions & 0 deletions sdks/python/apache_beam/io/gcp/gcsio_integration_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,11 @@
import unittest
import uuid

import mock
import pytest

from apache_beam.io.filesystems import FileSystems
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.testing.test_pipeline import TestPipeline

try:
Expand Down Expand Up @@ -141,6 +143,44 @@ def test_batch_copy_and_delete(self):
self.assertFalse(
result[1], 're-delete should not throw error: %s' % result[1])

@pytest.mark.it_postcommit
@mock.patch('apache_beam.io.gcp.gcsio.default_gcs_bucket_name')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't sounds quite right that a PostCommit needs a mock. And this mock isn't mock a fake service, it's used to override nomenclature of temp bucket. What happens if we do not hack it?

Also, this test does not run a pipeline, should we configure it only run on test-suites:direct:py3xx:postCommitIT. Persumably currently it is running on Dataflow PostCommit IT suites which is not quite right

Copy link
Collaborator Author

@shunping shunping May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @Abacn. Below are my responses.

it doesn't sounds quite right that a PostCommit needs a mock. And this mock isn't mock a fake service, it's used to override nomenclature of temp bucket. What happens if we do not hack it?

For a given project, the function of default_gcs_bucket_name will return a fixed bucket name as the default. If we don't override this, we need to create a particular project (other than using apache-beam-testing or whatever project the users want to provide during running this test) to test this. Per the offline discussion with @damccorm, it seems a bit overkill to create a project and then remove it afterward for this test. I think using mocking is kind of a "hack" but the code is clean. I am open to any better suggestion though.

Also, this test does not run a pipeline, should we configure it only run on test-suites:direct:py3xx:postCommitIT. Persumably currently it is running on Dataflow PostCommit IT suites which is not quite right

If you look at the other tests under gcsio_integration_test.py, they are also testing the gcsio functionality with an actual gcs operation. However, they don't trigger any pipeline running either.

def test_create_default_bucket(self, mock_default_gcs_bucket_name):
google_cloud_options = self.test_pipeline.options.view_as(
GoogleCloudOptions)
# overwrite kms option here, because get_or_create_default_gcs_bucket()
# requires this option unset.
google_cloud_options.dataflow_kms_key = None

import random
from hashlib import md5
# Add a random number to avoid collision if multiple test instances
# are run at the same time. To avoid too many dangling buckets if bucket
# removal fails, we limit the max number of possible bucket names in this
# test to 1000.
overridden_bucket_name = 'gcsio-it-%d-%s-%s' % (
random.randint(0, 999),
google_cloud_options.region,
md5(google_cloud_options.project.encode('utf8')).hexdigest())

mock_default_gcs_bucket_name.return_value = overridden_bucket_name

# remove the existing bucket with the same name as the default bucket
existing_bucket = self.gcsio.get_bucket(overridden_bucket_name)
if existing_bucket:
existing_bucket.delete()

bucket = gcsio.get_or_create_default_gcs_bucket(google_cloud_options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realize, in case upstream code changed and the mock no longer effective, the following will delete the default bucket. We should assert that the created bucket is the one that with injected name, thus guard from deleting the real default bucket

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Added the check. PTAL

self.assertIsNotNone(bucket)
self.assertEqual(bucket.name, overridden_bucket_name)

# verify soft delete policy is disabled by default in the default bucket
# after creation
self.assertEqual(bucket.soft_delete_policy.retention_duration_seconds, 0)
bucket.delete()

self.assertIsNone(self.gcsio.get_bucket(overridden_bucket_name))


if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
Expand Down
33 changes: 33 additions & 0 deletions sdks/python/apache_beam/io/gcp/gcsio_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -567,6 +567,39 @@ def test_headers(self, mock_get_service_credentials, mock_do_request):
self.assertIn(beam_user_agent, actual_headers['User-Agent'])
self.assertEqual(actual_headers['x-goog-custom-audit-job'], 'test-job-name')

@mock.patch('google.cloud._http.JSONConnection._do_request')
@mock.patch('apache_beam.internal.gcp.auth.get_service_credentials')
def test_create_default_bucket(
self, mock_get_service_credentials, mock_do_request):
from apache_beam.internal.gcp.auth import _ApitoolsCredentialsAdapter
mock_get_service_credentials.return_value = _ApitoolsCredentialsAdapter(
_make_credentials("test-project"))

gcs = gcsio.GcsIO(pipeline_options={"job_name": "test-job-name"})
# no HTTP request when initializing GcsIO
mock_do_request.assert_not_called()

import requests
response = requests.Response()
response.status_code = 200
mock_do_request.return_value = response

# The function of create_bucket() is supposed to send only one HTTP request
gcs.create_bucket("test-bucket", "test-project")
mock_do_request.assert_called_once()
call_args = mock_do_request.call_args[0]

# Request data is specified as the fourth argument of
# google.cloud._http.JSONConnection._do_request
actual_request_data = call_args[3]

import json
request_data_json = json.loads(actual_request_data)
# verify soft delete policy is disabled by default in the bucket creation
# request
self.assertEqual(
request_data_json['softDeletePolicy']['retentionDurationSeconds'], 0)


if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
Expand Down
2 changes: 1 addition & 1 deletion sdks/python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -437,7 +437,7 @@ def get_portability_package_data():
'google-cloud-datastore>=2.0.0,<3',
'google-cloud-pubsub>=2.1.0,<3',
'google-cloud-pubsublite>=1.2.0,<2',
'google-cloud-storage>=2.14.0,<3',
'google-cloud-storage>=2.16.0,<3',
# GCP packages required by tests
'google-cloud-bigquery>=2.0.0,<4',
'google-cloud-bigquery-storage>=2.6.3,<3',
Expand Down
Loading