Skip to content

Conversation

@dhruv-pratap
Copy link
Contributor

Context: https://apache-iceberg.slack.com/archives/C029EE6HQ5D/p1666112648002419

We at Netflix here are trying to integrate PyIceberg 0.1.0 with our Iceberg Rest Catalog Service and realized there is a gap in the PyIceberg rest client that we need to address.
For background, at Netflix all client-server interaction happens over TLS and is client side auth enforced for security purposes. The rest client that sits inside PyIceberg at present uses requests module for interaction with rest catalog service. Although this module allows the CA trust bundle to be set via an environment variable, but it does not allow a similar mechanism for setting client side certificates via environment variable and has to be done programmatically when setting up a requests client.

The below two approaches were discussed on the Slack thread:

  1. RestCatalog exposes requests.Session() and accepts a pre-configured Session object during initialization. This puts the onus on the PyIceberg consumer to configure the session with correct auth mechanism, connection pooling, custom headers etc., and makes the RestCatalog client dumb and simply use the provided session to perform API interaction. The current RestSpec kind of assumes/dictates the auth mechanism to be oauth which might not be the case for every enterprise.
  2. The other alternative is to define new set of PyIceberg properties and add to the current PyIceberg configuration spec to accept SSL configuration. Example: catalog.rest.ssl.client.key catalog.rest.ssl.client.cert catalog.rest.ssl.ca This would be minimal amount of changes, but I could see this list growing overtime with every new enterprise adoption requiring customization.

After discussion with @samredai and @Fokko we agreed on approach #2, and this PR is to address the same.

@samredai samredai requested a review from Fokko October 19, 2022 16:52
@dhruv-pratap
Copy link
Contributor Author

dhruv-pratap commented Oct 19, 2022

High level summary of changes done as part of the PR:

  1. Rest Catalog encapsulates a requests.Session
  2. Configure Session object with SSL config, if provided, during initialization.
  3. Configure existing headers one-time in the session object during initialization.
  4. Use the one-time initialized Session object for all further REST calls.
  5. Add unit tests for SSL config initiation, and enforce HTTP header checks in existing REST mock tests.
  6. Add documentation for REST Catalog client mTLS configuration.

…aders as well as it no longer needs to be a property.
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @dhruv-pratap! I like it a lot. Another approach we could also take is to have a function _create_session() -> Session that will set up the session. This would both set the SSL and the headers, and in the constructor, we would do:

self.session = self._create_session()

This way we have a single place where we set up the session, instead of two methods. WDYT?

"X-Client-Version": ICEBERG_REST_SPEC_VERSION,
"User-Agent": f"PyIceberg/{__version__}",
}
def _set_session_headers(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot, much nicer than having to pass in the headers every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @dhruv-pratap! I like it a lot. Another approach we could also take is to have a function _create_session() -> Session that will set up the session. This would both set the SSL and the headers, and in the constructor, we would do:

self.session = self._create_session()

This way we have a single place where we set up the session, instead of two methods. WDYT?

I did think about that initially, but the self._fetch_access_token(credential) REST call poses a kind-of chicken-and-egg problem. You need the session SSL config to be configured to make that REST call, and then the session's header has to be enriched with the Auth Token. Let me try to give it a shot to see what that looks like and send an amendment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested changes pushed in recent commit. Please review.

Copy link
Contributor

@Fokko Fokko Oct 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhruv-pratapI didn't think of the chicken-and-egg problem. The current version isn't working:

RESTError: HttpMediaTypeNotSupportedException: Content type 'application/json' not supported

It already sets the application/json header, but we need to set the application/x-www-form-urlencoded header for the signer:
https://github.com/apache/iceberg/blob/master/python/pyiceberg/catalog/rest.py#L237

We either need to set the headers after we fetch the token, or revert to the previous version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't have an OAuth environment to integration test it against.

Changed the order in recent commit, now setting the HTTP headers after fetching the OAuth token. Also, enforcing the content-type check to application/x-www-form-urlencoded for OAuth calls in unit tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem at all. I've just checked and it works 👍🏻

@Fokko Fokko added this to the Python 0.2.0 release milestone Oct 20, 2022
dhruv-pratap and others added 3 commits October 20, 2022 11:54
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
@dhruv-pratap dhruv-pratap requested a review from Fokko October 20, 2022 17:16
…uth token call expects the Content-Type application/x-www-form-urlencoded. Enforce the content-type check for oauth calls in unit tests as well.
@Fokko Fokko merged commit bbe5765 into apache:master Oct 20, 2022
@Fokko
Copy link
Contributor

Fokko commented Oct 20, 2022

This is awesome, thanks @dhruv-pratap 👍🏻

@dhruv-pratap dhruv-pratap deleted the client_side_ssl_auth_support branch October 20, 2022 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants