Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions open-api/rest-catalog-open-api.py
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,14 @@ class AssertViewUUID(BaseModel):
uuid: str


class StorageCredential(BaseModel):
prefix: str = Field(
...,
description='Indicates a storage location prefix where the credential is relevant. Clients should choose the most specific prefix (by selecting the longest prefix) if several credentials of the same type are available.',
)
config: Dict[str, str]


class PlanStatus(BaseModel):
__root__: Literal['completed', 'submitted', 'cancelled', 'failed'] = Field(
..., description='Status of a server-side planning operation'
Expand Down Expand Up @@ -1195,6 +1203,11 @@ class LoadTableResult(BaseModel):
- `s3.session-token`: if present, this value should be used for as the session token
- `s3.remote-signing-enabled`: if `true` remote signing should be performed as described in the `s3-signer-open-api.yaml` specification

## Storage Credentials

Credentials for ADLS / GCS / S3 / ... are provided through the `storage-credentials` field.
Clients must first check whether the respective credentials exist in the `storage-credentials` field before checking the `config` for credentials.

"""

metadata_location: Optional[str] = Field(
Expand All @@ -1204,6 +1217,9 @@ class LoadTableResult(BaseModel):
)
metadata: TableMetadata
config: Optional[Dict[str, str]] = None
storage_credentials: Optional[List[StorageCredential]] = Field(
None, alias='storage-credentials'
)


class ScanTasks(BaseModel):
Expand Down Expand Up @@ -1311,11 +1327,19 @@ class LoadViewResult(BaseModel):

- `token`: Authorization bearer token to use for view requests if OAuth2 security is enabled

## Storage Credentials

Credentials for ADLS / GCS / S3 / ... are provided through the `storage-credentials` field.
Clients must first check whether the respective credentials exist in the `storage-credentials` field before checking the `config` for credentials.

"""

metadata_location: str = Field(..., alias='metadata-location')
metadata: ViewMetadata
config: Optional[Dict[str, str]] = None
storage_credentials: Optional[List[StorageCredential]] = Field(
None, alias='storage-credentials'
)


class ReportMetricsRequest(BaseModel):
Expand Down
32 changes: 32 additions & 0 deletions open-api/rest-catalog-open-api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3103,6 +3103,21 @@ components:
uuid:
type: string

StorageCredential:
type: object
required:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add the region? Without it, clients might run into issues. Here’s why it would be helpful:

  1. Setting a default region on the client side doesn’t work well since clients could be accessing s3 files from different regions.
  2. Clients could try to figure out the region using get-bucket-location, but this means an extra API call. Plus, if the vended credentials don’t allow this call, it could cause more problems.

Adding the region would simplify things and enhance overall client reliability.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for including region.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the other comment below - the region should be the same as the one for the S3FileIO configuration?

Copy link
Contributor Author

@nastra nastra Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyrain the region is already sent back in the config of LoadTableResult/ LoadViewResult itself, so I'm not sure why we would want to make it part of S3Credentials? The region is for configuring the S3 client but the credentials are for configuring S3FileIO`, so it's a different layer where the region is being used

Copy link
Contributor

@danielcweeks danielcweeks Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the region is bound to the credential. I believe it's possible to build a policy that allows access to buckets in multiple regions, so region should be associated with the bucket, not the credentials. Therefore, we shouldn't include it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 it's not published yet because I wanted to get this proposal in first and otherwise it would deviate the discussion.
Here's a quick summary of how refreshing vended credentials would look like:

Refreshing vended credentials would entail adding a new /v1/{prefix}/namespaces/{namespace}/tables/{table}/credentials endpoint that returns a LoadCredentialsResponse, which carries

storage-credentials:
  type: array
  items:
    $ref: '#/components/schemas/Credential'

This new endpoint would be called by the mechanism that is offered by the respective storage provider. For example, GCS offers a OAuth2CredentialsWithRefresh which then effectively would call this new endpoint to refresh a vended credential.
Achieving the same thing for S3 would look different but effectively comes down to calling this credential endpoint too.

Copy link
Contributor

@jackye1995 jackye1995 Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, in that case I think this is a great step, it does make a lot of sense to have dedicated credentials refresh API, we were also thinking about proposing it later. We have implemented credentials vending in LakeFormation for a long time now, and last year S3 also offered access grant with similar feature. The user pattern has been proven that having a dedicated API for that makes a lot of sense since refreshing credentials is a much more frequent operation than loading table and can be done very quickly with a dedicated API, and from security perspective it is also a better posture. In that case I think what is proposed here makes sense, and I am more in favor of the current design than what Yufei suggests, because it makes it clear that these fields are related to that new API, and existing config map continues to work for configuring the FileIO, with the additional note that the new credentials field would override the credentials config in the config map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra, thanks for the context. The suggest I provide is a minor structure change, which isn't needed in the new endpoint. I am OK with the current one if that's more align with the new endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyrain yes the structure that is being proposed here is fully aligned with how credentials will be returned by the new credentials endpoint. I understand your reasoning for adding one additional structure on top, but I think we should only do that if we have a concrete use case that this would address (which I don't think we currently have).
Thanks again for your review @flyrain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on recent discussions the feedback was that we don't want to have anything storage-specific in the OpenAPI spec (other than documenting the different storage configurations, which is handled by #10576). Therefore I've updated the PR and made it flexible enough so that we could still pass different credentials based on a given prefix. That should hopefully work for everyone.

- prefix
- config
properties:
prefix:
type: string
description: Indicates a storage location prefix where the credential is relevant. Clients should choose the most
specific prefix (by selecting the longest prefix) if several credentials of the same type are available.
config:
type: object
additionalProperties:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand there was a general agreement to treat vendor-specific credentials as simple property bags. However, would it make sense to define some common attributes (like expiry time) in this spec? I suppose those would be relevant to the follow-up effort of refreshing credentials on the fly. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that there could be some common attributes that we could extract, but that would lead to a mixture of having to look up things in a config map vs other properties.
Also what if storage providers have different semantics for how expiration is defined (expires-at-ms vs expires-in-ms vs ....). This would require additional calculations to convert from one expiration format to the other. Therefore I think it's better to keep everything in the config for now until we have a clear use case where it makes sense to extract this into a common attribute.

This isn't set in stone and we can always revisit this once we implement refreshing vended credentials for the different providers and see how helpful having a separate field for expiration would be. I'd like to unblock this proposal so that we can make progress on refreshing vended credentials

type: string

LoadTableResult:
description: |
Result used when a table is successfully loaded.
Expand All @@ -3129,6 +3144,11 @@ components:
- `s3.secret-access-key`: secret for credentials that provide access to data in S3
- `s3.session-token`: if present, this value should be used for as the session token
- `s3.remote-signing-enabled`: if `true` remote signing should be performed as described in the `s3-signer-open-api.yaml` specification

## Storage Credentials

Credentials for ADLS / GCS / S3 / ... are provided through the `storage-credentials` field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Servers have to provide concrete properties for vendor-specific credentials. Since this spec does not define them, how will servers know what properties to provide?

If servesr adapt to the java REST client impl. then the java client will become the "spec"... I'd prefer for Iceberg to provide additional guidelines (which do not have to be part of the REST Catalog spec) to clarify credential properties for known use cases. It might be necessary for this spec to define a sub-type property since some vendors support multiple auth methods for the same URI scheme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the documentation of the respective FileIO properties is handled outside this PR and is in the scope of #10576. We'll be updating/rebasing #10576 once this PR here is in and that should make it clearer for servers which properties are available for FileIO implementations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimas-b does the documentation of the storage related properties address your comment about "additional guidelines"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#10576 appears to define credential properties in the REST spec, which is fine from my POV :) I assume after rebasing it will be talking about credential properties under the new config section introduced in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly, we're trying to move the docs to this new credential section

Clients must first check whether the respective credentials exist in the `storage-credentials` field before checking the `config` for credentials.
type: object
required:
- metadata
Expand All @@ -3142,6 +3162,10 @@ components:
type: object
additionalProperties:
type: string
storage-credentials:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why should return multiple credentials for one table? Any use cases? The current TableOps just support one fileIO and also noticed the AWS and GCS credential refresh handler just pick the first credential from the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case for this is outlined in #10722 (comment).

type: array
items:
$ref: '#/components/schemas/StorageCredential'

ScanTasks:
type: object
Expand Down Expand Up @@ -3395,6 +3419,10 @@ components:

- `token`: Authorization bearer token to use for view requests if OAuth2 security is enabled

## Storage Credentials

Credentials for ADLS / GCS / S3 / ... are provided through the `storage-credentials` field.
Clients must first check whether the respective credentials exist in the `storage-credentials` field before checking the `config` for credentials.
type: object
required:
- metadata-location
Expand All @@ -3408,6 +3436,10 @@ components:
type: object
additionalProperties:
type: string
storage-credentials:
type: array
items:
$ref: '#/components/schemas/StorageCredential'

TokenType:
type: string
Expand Down