Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for existing storage containers in aws and gcp. #651

Merged
merged 8 commits into from
Sep 8, 2022

Conversation

tasdomas
Copy link
Contributor

@tasdomas tasdomas commented Sep 2, 2022

This pull request contains a proposal to use gomock-generated mocks for testing interactions with
cloud APIs in unit tests.

@tasdomas tasdomas temporarily deployed to automatic September 2, 2022 09:32 Inactive
@tasdomas tasdomas temporarily deployed to automatic September 2, 2022 09:32 Inactive
@tasdomas tasdomas temporarily deployed to automatic September 2, 2022 09:32 Inactive
@tasdomas tasdomas temporarily deployed to automatic September 2, 2022 09:32 Inactive
@tasdomas tasdomas temporarily deployed to automatic September 2, 2022 09:32 Inactive
Comment on lines +45 to +48
// Container stores the id of the container to be used.
Container string
// Path stores the subdirectory inside the container.
Path string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge those two so users can specify either storage or storage/prefix with a single string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesistant to do so - it feels like separating the two explicitly is clearer. Especially since we'll most likely also need to add a container-type field to support pre-allocated buckets in k8s.

Copy link
Member

@0x2b3bfa0 0x2b3bfa0 Sep 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Kubernetes, this field would probably take a user-specified StorageClass or PersistentVolume name, which is a single string. 🤔 What else could it take?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about e.g. a minio bucket?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a native storage strictly speaking, and would have the same consideration as a Google Cloud task using a Microsoft Azure storage account. It would be out of scope for us, and users would be responsible of synchronization as part of their code.

Comment on lines +49 to +51
// Config stores provider-specific configuration keys for accessing the pre-allocated
// storage container.
Config map[string]string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe too much flexibility? I'd assume that no special options are needed to connect and, if they were, users could just provide a full rclone connection string, removing the maintenance burden on our side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on what we're aiming at. Rclone connection strings hardly make it easy to configure.

The difference is between

  storage {    
    container = "tpi-test"
    container_path = "directory"
    container_opts {
      az_storage_account = "storage-account"
      az_storage_account_key = "storage-account-key"
    }
  }

and

  storage {
    container = ":azureblob,account='storage-account',key='storage-account-key':tpi-test/directory"
  }

I'm not sure the latter is easier, especially if we consider the target users of tpi.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More precisely, the difference would be between

  storage {    
    container = "tpi-test"
    container_path = "directory"
    container_opts {
      az_storage_account = "storage-account"
      az_storage_account_key = "storage-account-key"
    }
  }

and

  storage = ":azureblob,account='storage-account',key='storage-account-key':tpi-test/directory"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the variants sound particularly easy to me, although the latter is already well docuemnted by rclone and, despite being a lengthy string, is fairly readable to my mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could perhaps provide a format that is 1:1 compatible with rclone strings. E.g.

storage = {
  type    = "azureblob"
  path    = "tpi-test/directory"

  options = {
    account = "storage-account"
    key     = "storage-account-key"
  }
}

would be directly translated into

:azureblob,account='storage-account',key='storage-account-key':tpi-test/directory

Still, it seems a bit too bulky, to my mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the latter is easier, especially if we consider the target users

That's a double-edged reason: if our target users are on their own, they probably won't need to specify a custom storage and, if they do, it probably won't require any options beyond the name and path (prefix). 😸

Note that credentials usually match (or can be retrieved by using) the main credentials used to create resources; see #651 (comment) for details.

And those users who really need custom options on their connection strings are probably part of a team with opinions, and people having those opinions may as well read rclone documentation. 😈 Worst-case scenario, we can provide an interactive rclone string builder as part of our documentation.

Comment on lines +22 to +28
// StorageCredentials is an interface implemented by data sources and resources
// that provide access to cloud storage buckets.
type StorageCredentials interface {
// ConnectionString returns the connection string necessary to access
// an S3 bucket.
ConnectionString(ctx context.Context) (string, error)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assume that there is no need for separate credentials and the storage can be accessed with the same credentials provided to manage the rest of resources? In cases where different credentials are needed (e.g. Microsoft Azure), they can be retrieved using the main credentials:

keys, err := s.Client.Services.StorageAccounts.ListKeys(ctx, s.Dependencies.ResourceGroup.Identifier, s.Identifier, "")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, using cloud credentials to retrieve azure storage credentials will only work if the storage is owned by the same azure account. But that is not necessarily the case. I was thinking of also supporting azure sas url's which allow granting access to storage containers by sharing a single url (albeit a lengthy one).

The other side of the issue (and probably the main reason why I moved the connection string generation out to this interface) is separation of concerns. The container resource/data source is the right component to handle generation of the rclone connection string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, using cloud credentials to retrieve azure storage credentials will only work if the storage is owned by the same azure account.

What do you mean by account? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By account I meant the subscription - sorry about not being clear enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keys, err := s.Client.Services.StorageAccounts.ListKeys(ctx, s.Dependencies.ResourceGroup.Identifier, s.Identifier, "")

This only works if the storage account is in the same resource group. In our case (since we've just created the resource group) we'd need to pass the resource group that the already existing storage account belongs to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me a reasonable requirement. If the storage account were in a different subscription, it would have to be treated as an user-controlled storage (e.g. DVC remote) and not as a task storage.

I'd prefer to keep task storage as local and native as possible, and leave other kinds of data synchronization to specialized tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this still leaves the issue of the storage account and its access key

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue of the storage account and its access key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue of the storage account and its access key

question
Well, in order to access a blob container, you need a storage account. And to retrieve a storage account's key you need to know its resource group, so either way the user needs to supply either (storage_account, key) or (resource_group, storage_account).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I missed that comment.

Here you're two proposed solutions:

  • Specify the storage account as part of a storage string and the key through an environment variable
  • Specify both the resource group and the storage account as part of the storage string

@@ -0,0 +1,58 @@
package resources_test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are nice, although mocking cloud infrastructure is rather risky. Maybe we can rely on rclone to use functionality that it's already tested on their side or can be tested locally with the same API as cloud resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how mocking cloud apis (not infrastructure) is riskier than having no tests at all.

My current proposal is twofold:

  1. Each resource defines limited interfaces (implemented by corresponding cloud apis) that include only the methods actually used.
  2. We generate mocks for those interfaces and use them in tests to ensure conditions are handled correctly and api calls are made in the right order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your proposal implies accounting for a considerable amount of cloud-specific details (e.g. “limited interfaces” for S3 and EC2, others for Cloud Storage and Compute Engine, ...) I wonder if maintaining them will be any easier than testing the code against actual cloud APIs, even if it's noticeably slower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing with unit tests is that we can test for various edge cases in a repeatable and deterministic way. And while they are in no way a replacement for integration tests, accounting for edge cases in integration tests will either require developing extensive test suites or exhaustive checklists, completing which will take considerable time.

Co-authored-by: Helio Machado <0x2b3bfa0+git@googlemail.com>
@tasdomas tasdomas temporarily deployed to automatic September 5, 2022 08:26 Inactive
@tasdomas tasdomas temporarily deployed to automatic September 5, 2022 08:26 Inactive
@tasdomas tasdomas temporarily deployed to automatic September 5, 2022 08:26 Inactive
@tasdomas tasdomas temporarily deployed to automatic September 5, 2022 08:26 Inactive
@tasdomas tasdomas temporarily deployed to automatic September 5, 2022 11:42 Inactive
@tasdomas tasdomas requested a deployment to automatic September 5, 2022 11:42 Abandoned
@tasdomas tasdomas requested a deployment to automatic September 5, 2022 11:42 Abandoned
@tasdomas tasdomas requested a deployment to automatic September 5, 2022 11:42 Abandoned
@tasdomas tasdomas temporarily deployed to automatic September 5, 2022 12:11 Inactive
@tasdomas tasdomas temporarily deployed to automatic September 5, 2022 12:11 Inactive
@tasdomas tasdomas temporarily deployed to automatic September 5, 2022 12:11 Inactive
@tasdomas tasdomas temporarily deployed to automatic September 5, 2022 12:11 Inactive
@tasdomas tasdomas temporarily deployed to automatic September 5, 2022 16:29 Inactive
@tasdomas tasdomas requested a deployment to automatic September 5, 2022 16:29 Abandoned
@tasdomas tasdomas requested a deployment to automatic September 5, 2022 16:29 Abandoned
@tasdomas tasdomas requested a deployment to automatic September 5, 2022 16:29 Abandoned
@tasdomas tasdomas temporarily deployed to automatic September 5, 2022 16:29 Inactive
Copy link
Member

@0x2b3bfa0 0x2b3bfa0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the review comments are specific to AWS or GCP. Feel free to merge into the feature branch. 🌳

@tasdomas tasdomas merged commit a731724 into feature-bucket-reuse Sep 8, 2022
@tasdomas tasdomas deleted the d011-bucket-reuse branch September 8, 2022 07:30
tasdomas added a commit that referenced this pull request Sep 10, 2022
* Update config schema.

* AWS support for existing S3 buckets.

* Existing bucket support for gcp.

* Add mocks and tests to existing bucket resource in aws.

* Update docs with new pre-allocated container fields.
tasdomas added a commit that referenced this pull request Oct 21, 2022
* Support for existing storage containers in aws and gcp. (#651)

* Update config schema.

* AWS support for existing S3 buckets.

* Existing bucket support for gcp.

* Add mocks and tests to existing bucket resource in aws.

* Update docs with new pre-allocated container fields.

* Support using pre-allocated blob containers in azure. (#660)

* AWS support for existing S3 buckets.

* Existing bucket support for gcp.

* Fix subdirectory in rclone remote.

* Blob container generates the rclone connection string.

* Introduce a type for generating rclone connection strings.

* Azure support for reusable blob containers.

* Update docs.

* Fix path prefix.

* Initialize s3 existing bucket with RemoteStorage struct.

* Update gcp and aws existing bucket data sources to align with the azure data source.

Use common.RemoteStorage to initialize the data sources.
Using rclone to verify storage during Read.
Remove aws s3 client mocks and tests that rely on them.

* Fix comment.

* K8s support for specifying an existing persistent volume claim (#661)

* K8s support for specifying an existing persistent volume claim.

Co-authored-by: Helio Machado <0x2b3bfa0+git@googlemail.com>

* Update use of Identifier struct.

* Combine container and container_path config keys.

* Update docs.

* Use split function from rclone.

Co-authored-by: Helio Machado <0x2b3bfa0+git@googlemail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants