Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote S3 configs #116

Merged
merged 1 commit into from
Jul 24, 2024
Merged

Remote S3 configs #116

merged 1 commit into from
Jul 24, 2024

Conversation

pranavr12
Copy link
Contributor

@pranavr12 pranavr12 commented Jul 22, 2024

The remote s3 parameters can now be made configurable. Two types of styles can be used for referencing the s3 - PathStyle and VirtualHostStyle.
The default hostname template for pathStyle is: "s3.${region}.${domain}"
The default hostname template for virtualHostStyle is: "${bucket}.s3.${region}.${domain}"

The motivation behind this configuration: It may not always be the case where the region is always specified in the S3 URI (https://s3.testDomain.com) or the "s3" constant always remains "s3". The configuration can handle custom URIs such as:
https://s3a.testDomain.com/
https://s3a.testDomain.us-east.com/
https://s3a.us-east.testDomain.com/

Copy link

cla-bot bot commented Jul 22, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

{
this(VIRTUAL_HOST_BUILDER, true, Optional.empty());
this((bucket, region) ->
(bucket.isEmpty() ? "" : "%s.".formatted(bucket)) + "s3.%s.%s".formatted(region, remoteS3Config.getDomain()), remoteS3Config.getHttps(), remoteS3Config.getPort());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that could be interesting is to template the domain part of the config, so that it is not tied to AWS. The current implementation in this PR allows you to change the domain, but still assumes other systems will use the same <bucket>.s3.<region>.<domain> format which is not necessarily true.

Could we instead template this with 2 named arguments:

  • %(region)
  • %(bucket)

And have 2 configurations:

  • rawDomainTemplate (for cases without a bucket) - the AWS S3 one would be s3.%(region).amazonaws.com
  • bucketDomainTemplate (for cases with a bucket) - the AWS S3 one would be %(bucket).s3.%(region).amazonaws.com

We could then infer whether to load the PathStyle implementation or the VirtualHostStyle one based on whether we have both configs or only rawDomainTemplate

NB: I'm using %(whatever) encoding for placeholders, that's just the first thing that came to mind - whatever library we want to use for this may use a different format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the option of configuring the template option. But, I've removed the case where the bucket is empty. Would there be a scenario where the bucket is empty?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this comment. Yes, there would be - for instance, listing buckets https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListBuckets.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logic for empty buckets

pom.xml Outdated Show resolved Hide resolved
}

@Config("remoteS3.pathstyle.template")
public RemoteS3Config setPathStyleTemplate(String hostnameTemplate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we need 2 settings for this. In its present form, we would need to check to make sure that these 2 are mutually exclusive - a user should not configure both of these.

What about using a single template? That would suffice for both, with a flag to control whether we want virtual host style addressing or path style addressing.

So instead of remoteS3.virtualhoststyle.template and remoteS3.pathstyle.template we could have:

  • remoteS3.domain-template
  • remoteS3.use-virtual-host-addressing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note my original comment yesterday relating to 2 templates was not suggesting we have a template for virtual host addressing and one for path style addressing, but rather that we can have a template for "bucketed" operations and one for "unbucketed" operations.

Currently, this implementation I believe would not work with "unbucketed" operations if using virtual host addressing.

An operation like listing buckets (which is not a "bucketed" operation) would, with the default template, resolve the hostname to .s3.us-east-1.amazonaws.com (us-east-1 here just being some sample region) with a leading dot, which is not valid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, in the present form, PathStyleRemoteS3Facade uses remoteS3.pathstyle.template and VirtualHostStyleRemoteS3Facade uses remoteS3.virtualhoststyle.template. When both properties are configured, the binding in ServerModule determines the property that will be used. If the properties are not configured, the default values are used. Though a user should configure just one of these properties, there wouldn't be a problem if both are configured.

Currently, the RemoteS3Facade that is binded is controlled here - https://github.com/trinodb/aws-proxy/blob/main/trino-aws-proxy/src/main/java/io/trino/aws/proxy/server/TrinoAwsProxyServerModule.java#L119. We could try to leverage remoteS3.use-virtual-host-addressing property to bind the type of RemoteS3Facade. WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a RemoteS3Module to conditionally bind VirtualHostStyle or PathStyle.

@pranavr12 pranavr12 force-pushed the s3-configs branch 3 times, most recently from 7e079b0 to 1866965 Compare July 23, 2024 15:58
@pranavr12 pranavr12 force-pushed the s3-configs branch 2 times, most recently from a2f80dc to febc22c Compare July 24, 2024 04:19
@Randgalt Randgalt dismissed their stale review July 24, 2024 09:06

Dependency was removed

@mosiac1
Copy link
Contributor

mosiac1 commented Jul 24, 2024

Test failures are unrelated, see #115

@Randgalt Randgalt merged commit 06d743f into trinodb:main Jul 24, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants