Skip to content

Conversation

@eric-maynard
Copy link
Contributor

The CLI currently only supports the version of EXTERNAL catalogs that was present in 0.9.0. Now, EXTERNAL catalogs can be configured with various configurations relating to federation. This PR updates the CLI to better match the REST API so that federated catalogs can be easily set up in the CLI.

@eric-maynard eric-maynard changed the title Add support for catalog federation in CLI Add support for catalog federation in the CLI Jun 20, 2025
HonahX
HonahX previously approved these changes Jun 27, 2025
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the lateness. Overall LGTM! All related properties in the spec seem to have a correct correspondence in the argument list.

--catalog-role-session-name (For authentication type SIGV4) The role session name to be used by the SigV4 protocol for signing requests
--catalog-external-id (For authentication type SIGV4) An optional external id used to establish a trust relationship with AWS in the trust policy
--catalog-signing-region (For authentication type SIGV4) Region to be used by the SigV4 protocol for signing requests
--catalog-signing-name (For authentication type SIGV4) The service name to be used by the SigV4 protocol for signing requests, the default signing name is "execute-api" is if not provided
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non-blocker] I think it will be good to provide some example of creating an external "Iceberg_rest"/"hadoop" catalog with various auth types, may be in a follow-up

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Jun 27, 2025
@eric-maynard eric-maynard requested a review from HonahX July 2, 2025 02:08
@eric-maynard eric-maynard merged commit eb6b6ad into apache:main Jul 2, 2025
11 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Jul 2, 2025
eric-maynard pushed a commit that referenced this pull request Jul 17, 2025
PRs #1925 and #1912 were merged around the same time.  This PR connects the two changes and enables the CLI to accept IMPLICIT authentication type. 

Since Hadoop federated catalogs rely purely on IMPLICIT authentication, the CLI parsing test has been updated to reflect the same.
snazy added a commit to snazy/polaris that referenced this pull request Nov 20, 2025
* Exclude unused dependency for polaris spark client dependency (apache#1933)

* enable ETag integration tests (apache#1935)

tests were added in 8b5dfa9 and afaict supposed to get enabled after ec97c1b

* Fix Pagination for Catalog Federation (apache#1849)

Details can be found in this issue: apache#1848

* Update doc to fix docker build inconsistency issue (apache#1946)

* Simplify install dependency doc (apache#1941)

* Simply getting start doc

* Simply install dependecy doc

* Minor words change

* Fix admin tool for quick start (apache#1945)

When attempting to use the `polaris-admin-tool.jar` to bootstrap a realm, the application fails with a `jakarta.enterprise.inject.UnsatisfiedResolutionException` because it cannot find a `javax.sql.DataSource` bean. Detail in apache#1943

This issue occurs because `quarkus.datasource.db-kind`is a build-time property in Quarkus. Its value must be defined during the application's build process to enable the datasource extension and generate the necessary CDI bean producer (ref: https://quarkus.io/guides/all-config#quarkus-datasource_quarkus-datasource-db-kind).

I think we only support postgres for now, thus, I set `quarkus.datasource.db-kind=postgresql`. This can be problematic if we later want to support more data sources other than postgres. There are couple options we have for this such as use multiple named datasources in the config during build time. But this may be out of scope of this PR. I am open for more discussion on this, but for the time being, it may be better to unblock people who are trying to use the quick start doc.

Sample output for the bootstrap container after the fix:
```
➜  polaris git:(1943) docker logs polaris-polaris-bootstrap-1
Realm 'POLARIS' successfully bootstrapped.
Bootstrap completed successfully.
```

* fix(build): Fix deprecation warnings in FeatureConfiguration (apache#1894)

* Fix NPE in listCatalogs (apache#1949)

listCatalogs is non-atomic. It first atomically lists all entities and then iterates through each one and does an individual loadEntity call. This causes an NPE when calling `CatalogEntity::new`.

I don't think it's ever useful for listCatalogsUnsafe to return null since the caller isn't expecting a certain length of elements, so I just filtered it there.

* Fix doc for sample log and default password (apache#1951)

Minor updates for the quick start doc:
1. update sample output to reflect with the latest code
2. update default password to the right value
3. remove trailing space

* Optimize the location overlap check with an index (apache#1686)

The location overlap check for "sibling" tables (those which share a parent) has been a performance bottleneck since its introduction, but we haven't historically had a good way around this other than just disabling the check. 

<hr>

### Current Behavior

The current logic is that when we create a table, we list all sibling tables and check each and every one to ensure there is no location overlap. This results in O(N^2) checks when adding N tables to a namespace, quickly becoming untenable.

With the `CreateTreeDataset` [benchmark](https://github.com/eric-maynard/polaris-tools/blob/main/benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/simulations/CreateTreeDataset.scala) I tested creating 5000 sibling tables using the current code:

It is apparent that latency increases over time. Runs took between 90 and 200+ seconds, and Polaris instances with a small memory allocation were prone to crashing due to OOMs:


### Proposed change

This PR adds a new persistence API, `hasOverlappingSiblings`, which if implemented can be used to directly check for the presence of siblings at the metastore layer.

This API is implemented for the JDBC metastore in a new schema version, and some changes are made to account for an evolving schema version now and in the future.

This implementation breaks a location down into components and queries for a sibling at each of those locations, so a new table at location `s3://bucket/root/n1/nA/t1/` will require checking for an entity with location `s3://bucket/`, `s3://bucket/root/`, `s3://bucket/root/n1/`, `s3://bucket/root/n1/nA/`, and finally `s3://bucket/root/n1/nA/t1/%`. All of this can be done in a single query which makes a single pass over the data. 

The query is optimized by the introduction of a new index over a new _location_ column.

With the changes enabled, I tested creating 5000 sibling tables:

Latency is stable over time, and runs consistently completed in less than 30 seconds. I did not observe any OOMs when testing with the feature enabled.

* Add SUPPORTED_EXTERNAL_CATALOG_AUTHENTICATION_TYPES feature configuration (apache#1931)

* Add SUPPORTED_FEDERATION_AUTHENTICATION_TYPES feature configuration

* Add unit tests

* Update Helm chart version (apache#1957)

* Remove the maintainer list in Helm Chart README (apache#1962)

* Use multi-lines instead of single line (apache#1961)

* Fix invalid sample script in CLI doc (apache#1964)

* Fix hugo blockquote (apache#1967)

* Fix hugo blockquote

* Add license header

* Fix lint rules (apache#1953)

* Mutable objects used for immutable values (apache#1596)

* fix: Only include project LICENSE and NOTICE in Spark Client Jar (apache#1950)

* Add Sushant as a collaborator (apache#1956)

* Adds missing Google Flatbuffers license information (apache#1968)

* fix: Typo in Spark Client Build File (apache#1969)

debugrmation

* Python code format (apache#1954)

* test(integration): refactor PolarisRestCatalogIntegrationTest to run against any cloud provider (apache#1934)

* Make Catalog Integration Test suite cloud native

* Fix admin tool doc (apache#1977)

* Fix admin tool doc

* Fix admin tool doc

* Update release-guide.md (apache#1927)

* Add relational-jdbc to helm (apache#1937)


Motivation for the Change

Polaris needs to support relational-jdbc as the default persistence type for simpler database configuration and better cloud-native deployment experience.
Description of the Status Quo (Current Behavior)

Currently, the Helm chart only supports eclipse-link persistence type as the default, which requires complex JPA configuration with persistence.xml files.
Desired Behavior

    Add relational-jdbc persistence type support to Helm chart
    Use relational-jdbc as the default persistence type
    Inject JDBC configuration (username, password, jdbc_url) through Kubernetes Secrets as environment variables
    Maintain backward compatibility with eclipse-link

Additional Details

    Updated persistence-values.yaml for CI testing
    Updated test coverage for relational-jdbc configuration
    JDBC credentials are injected via QUARKUS_DATASOURCE_* environment variables from Secret
    Secret keys: username, password, jdbc_url

* Add CHANGELOG (apache#1952)

* Add rudimentary CHANGELOG.md

* Add the Jetbrains Changelog Gradle plugin to help managing CHANGELOG.md

* Share Polaris Community Meeting for 2025-06-26 (apache#1978)

* Correct javadoc text in generateOverlapQuery() (apache#1975)

* Fix javadoc warning: invalid input: '&'
* Correct javadoc text in generateOverlapQuery()

* Do not serialize null properties in the management model (apache#1955)

* Ignore null values in JSON output

* This may have an impact on existing client, but it is not
  likely to be substantial because normally absent properties
  should be treated the same as having `null` values.

* This change enables adding new optional fields to the
  Management API while maintaining backward compatibility in
  the future: New properties will not be exposed to clients
  unless a value for them in explicitly set.

* Add OpenHFT in Spark plugin LICENSE (apache#1979)

* Add additional unit and integration tests for etag functionality (apache#1972)

* Additional unit test for Etags

* Added a few corner case IT tests for testing etags with schema changes.

* Added IT tests to test changes after DDL and DML

* Add options to the bootstrap command to specify a schema file (apache#1942)

Instead of always using the hardcoded `schema-v1.sql` file, it would be nice if users could specify a file to bootstrap from. This is especially relevant after apache#1686 which proposes to add a new "version" of the schema.

* Added support for `s3a` scheme (apache#1932)

* Fix the sign failure (apache#1926)

* Fix doc to remove outdated note about fine-grained access controls support (apache#1983)

Minor update for the access control doc:

1. Remove the misleading section on privileges can only be granted at catalog level. I've tested the fine-grained access controls and confirmed that privileges can be applied to an individual table in the catalog.

* Add support for catalog federation in the CLI (apache#1912)

The CLI currently only supports the version of EXTERNAL catalogs that was present in 0.9.0. Now, EXTERNAL catalogs can be configured with various configurations relating to federation. This PR updates the CLI to better match the REST API so that federated catalogs can be easily set up in the CLI.

* fix: Remove db-kind in helm chart (apache#1987)

* Add a Spark session builder for the tests (apache#1985)

* Fix doc for CLI update (apache#1994)

PR for apache#1866

* Improve createPrincipal example in API docs (apache#1992)

In apache#1929 it was pointed out that the example in the Polaris docs suggests that users can provide a client ID during principal creation:

. . .


This PR attempts to fix this by adding an explicit example to the spec.

* Add doc for repair option (apache#1993)

PR for apache#1864

* Refactor relationalJdbc in helm (apache#1996)

* Add regression test coverage for Spark Client with package conf (apache#1997)

* Remove unnecessary `InputStream.close` call (apache#1982)

apache#1942 changed the way that the bootstrap init script is handled, but it added an extra `InputStream.close` call that shouldn't be needed after the BufferedReader [here](https://github.com/apache/polaris/pull/1942/files#diff-de43b240b5b5e07aba7e89f5515a417cefd908845b85432f3fcc0819911f3e2eR89) is closed. This PR removes that extra call.

* Materialize Realm ID for Session Supplier in JDBC (apache#1988)

It was discovered that the Session Supplier maps used in the MetaStoreManagerFactory implementations were passing in RealmContext objects to the supplier directly and then using the RealmContext objects to create BasePersistence implementation objects within the supplier. This supplier is cached on a per-realm basis in most MetaStoreManagerFactory implementations. RealmContext objects are request-scoped beans.

As a result, if any work is being done outside the scope of the request, such as during a Task, any calls to getOrCreateSessionSupplier for creating a BasePersistence implementation will fail as the RealmContext object is no longer available.

This PR will ensure for the JdbcMetaStoreManagerFactory that the Realm ID is materialized from the RealmContext and used inside the supplier so that the potentially deactivated RealmContext object does not need to be used in creating the BasePersistence object. Given that we are caching on a per-realm basis, this should not introduce any unforeseen behavior for the JdbcMetaStoreManagerFactory as the Realm ID must match exactly for the same supplier to be returned from the Session Supplier map.

* rebase/changes

* minor refactoring

* Last merged commit 8fa6bf2

---------

Co-authored-by: Yun Zou <yunzou.colostate@gmail.com>
Co-authored-by: Christopher Lambert <xn137@gmx.de>
Co-authored-by: Rulin Xing <xjdkcsq3@gmail.com>
Co-authored-by: MonkeyCanCode <yongzheng0809@gmail.com>
Co-authored-by: Alexandre Dutra <adutra@users.noreply.github.com>
Co-authored-by: Andrew Guterman <andrew.guterman1@gmail.com>
Co-authored-by: Eric Maynard <eric.maynard+oss@snowflake.com>
Co-authored-by: Pooja Nilangekar <poojan@umd.edu>
Co-authored-by: Yufei Gu <yufei@apache.org>
Co-authored-by: fabio-rizzo-01 <fabio.rizzocascio@jpmorgan.com>
Co-authored-by: Russell Spitzer <russell.spitzer@GMAIL.COM>
Co-authored-by: Sushant Raikar <sraikar@linkedin.com>
Co-authored-by: Jiwon Park <22048252+jparkzz@users.noreply.github.com>
Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@gmail.com>
Co-authored-by: JB Onofré <jbonofre@apache.org>
Co-authored-by: Sandhya Sundaresan <sandhya.sun100@gmail.com>
Co-authored-by: Pavan Lanka <planka@duck.com>
Co-authored-by: CG <cgpoh@users.noreply.github.com>
Co-authored-by: Adnan Hemani <adnan.h@berkeley.edu>
snazy added a commit to snazy/polaris that referenced this pull request Nov 20, 2025
* chore(deps): update dependency mypy to >=1.17, <=1.17.0 (apache#2114)

* Spark 3.5.6 and Iceberg 1.9.1 (apache#1960)

* Spark 3.5.6 and Iceberg 1.9.1

* Cleanup

* Add `pathStyleAccess` to AwsStorageConfigInfo (apache#2012)

* Add `pathStyleAccess` to AwsStorageConfigInfo

This change allows configuring the "path-style" access
mode in S3 clients (both in Polaris Servers and Iceberg
REST Catalog API clients).

This change is applicable both to AWS storage and to
non-AWS S3-compatible storage (apache#1530).

* Add TestFileIOFactory helper (apache#2105)

* Add FileIOFactory.wrapExisting helper

* fix(deps): update dependency gradle.plugin.org.jetbrains.gradle.plugin.idea-ext:gradle-idea-ext to v1.2 (apache#2125)

* fix(deps): update dependency boto3 to v1.39.7 (apache#2124)

* Abstract polaris-runtime-service tests for all persistence implementations (apache#2106)

The NoSQL persistence implementation has to run the Iceberg table & view catalog plus the Polaris specific tests as well. Reusing existing tests is beneficial to avoid a lot of code duplcation.

This change moves the actual tests to `Abstract*` classes and refactors the existing tests to extend those. The NoSQL persistence work extends the same `Abstract*` classes but runs with different Quarkus test profiles.

* Add IMPLICIT authentication support to the CLI (apache#2121)

PRs apache#1925 and apache#1912 were merged around the same time.  This PR connects the two changes and enables the CLI to accept IMPLICIT authentication type. 

Since Hadoop federated catalogs rely purely on IMPLICIT authentication, the CLI parsing test has been updated to reflect the same.

* feat(helm): Add support for external authentication (apache#2104)

* fix(deps): update dependency org.apache.iceberg:iceberg-bom to v1.9.2 (apache#2126)

* fix(deps): update quarkus platform and group to v3.24.4 (apache#2128)

* fix(deps): update dependency boto3 to v1.39.8 (apache#2129)

* fix(deps): update dependency io.smallrye.config:smallrye-config-core to v3.13.3 (apache#2130)

* Add newIcebergCatalog helper (apache#2134)

creation of `IcebergCatalog` instances was quite redundant as tests
mostly use the same parameters most of the time.

also remove an unused field in 2 other tests.

* Add server and client support for the new generic table `baseLocation` field (apache#2122)

* Use Makefile to simplify setup and commands (apache#2027)

* Use Makefile to simplify setup and commands

* Add targets for minikube state management

* Add podman support and spark plugin build

* Add version target

* Update README.md for Makefile usage and relation to the project

* Fix nit

* Package polaris client as python package (apache#2049)

* Package polaris client as python package

* Package polaris client as python package

* Change owner to spark when copying files from local into Dockerfile

* CI: Address failure from accessing GH API (apache#2132)

CI sometimes fails with this failure:
```
* What went wrong:
Execution failed for task ':generatePomFileForMavenPublication'.
> Unable to process url: https://api.github.com/repos/apache/polaris/contributors?per_page=1000
```

The sometimes failing request fetches the list of contributors to be published in the "root" POM. Unauthorized GH API requests have an hourly(?) limit of 60 requests per source IP. Authorized requests have a much higher rate limit. We do have a GitHub token available in every CI run, which can be used in GH API requests. This change adds the `Authorization` header for the failing GH API request to leverage the higher rate limit and let CI not fail (that often).

* fix(deps): update dependency com.nimbusds:nimbus-jose-jwt to v10.4 (apache#2139)

* fix(deps): update dependency com.diffplug.spotless:spotless-plugin-gradle to v7.2.0 (apache#2142)

* fix(deps): update dependency software.amazon.awssdk:bom to v2.32.4 (apache#2146)

* fix(deps): update dependency org.xerial.snappy:snappy-java to v1.1.10.8 (apache#2138)

* fix(deps): update dependency org.junit:junit-bom to v5.13.4 (apache#2147)

* fix(deps): update dependency boto3 to v1.39.9 (apache#2137)

* fix(deps): update dependency com.fasterxml.jackson:jackson-bom to v2.19.2 (apache#2136)

* Python client: add support for endpoint, sts-endpoint, path-style-access (apache#2127)

This change adds support for endpoint, sts-endpoint, path-style-access to the Polaris Python client.

Amends apache#1913 and apache#2012

* Remove PolarisEntityManager.getCredentialCache (apache#2133)

`PolarisEntityManager` itself is not using the `StorageCredentialCache` but just hands it out via `getCredentialCache`.
the only caller of `getCredentialCache` is `FileIOUtil.refreshAccessConfig`, which in in turn is only called by `DefaultFileIOFactory` and `IcebergCatalog`.

note that in a follow-up we will likely be able to remove `PolarisEntityManager` usage completely from `IcebergCatalog`.

additional cleanups:
- use `StorageCredentialCache` injection in tests (but we need to invalidate all entries on test start)
- remove unused `UserSecretsManagerFactory` from `PolarisCallContextCatalogFactory`

* chore(deps): update registry.access.redhat.com/ubi9/openjdk-21-runtime docker tag to v1.22-1.1752676419 (apache#2150)

* fix(deps): update dependency com.diffplug.spotless:spotless-plugin-gradle to v7.2.1 (apache#2152)

* fix(deps): update dependency boto3 to v1.39.10 (apache#2151)

* chore: fix class reference in the javadoc of TableLikeEntity (apache#2157)

* fix(deps): update dependency commons-codec:commons-codec to v1.19.0 (apache#2160)

* fix(deps): update dependency boto3 to v1.39.11 (apache#2159)

* Last merged commit 395459f

---------

Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: Yong Zheng <yongzheng0809@gmail.com>
Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@gmail.com>
Co-authored-by: Christopher Lambert <xn137@gmx.de>
Co-authored-by: Pooja Nilangekar <poojan@umd.edu>
Co-authored-by: Alexandre Dutra <adutra@apache.org>
Co-authored-by: Yun Zou <yunzou.colostate@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants