Skip to content

Conversation

@XN137
Copy link
Contributor

@XN137 XN137 commented Aug 11, 2025

BasePersistence.listEntities has 3 variants:

Page<EntityNameLookupRecord> listEntities(..., PageToken);

Page<EntityNameLookupRecord> listEntities(..., Predicate<PolarisBaseEntity>, PageToken)

<T> Page<T> listEntities(..., Predicate<PolarisBaseEntity>, Function<PolarisBaseEntity, T>, PageToken);

the 1st method exists to only return the subset of entity properties required to build an EntityNameLookupRecord.

the 3rd method supports a predicate and transformer function on the underlying PolarisBaseEntity, which means it has to load all entity properties.

the 2nd method is weird as it supports a full Predicate<PolarisBaseEntity>, which means it has to load all entity properties under the hood for filtering but then throws most of them away to return a EntityNameLookupRecord.
this explains why the implementations of the 2nd method simply forward to the 3rd method usually.
any performance benefits of returning a EntityNameLookupRecord are lost.

as it turns out the 2nd method is only used, because methods 1 and 3 dont support passing a PolarisEntitySubType parameter to filter down the retrieved data.
Note that the sub type property is available from both the PolarisBaseEntity as well as the EntityNameLookupRecord.

By adding this parameter, the 2nd method can go away completely.
we can even push down the sub type filtering into the queries of some of our persistence implementations.
other existing implementations are free to decide whether they want to push it down as well or filter on the query results in memory.

note that since we have no TransactionalPersistence implementation in the codebase that provides an optimized variant of method 1 we can have a default method in the interface that forwards to method 3.

@XN137 XN137 force-pushed the add-PolarisEntitySubType-param-to-BasePersistence.listEntities branch from 6e415e6 to 71dd71c Compare August 11, 2025 13:23
@XN137 XN137 marked this pull request as ready for review August 11, 2025 15:46
resolver.getParentId(),
entityType,
filter,
entitySubType,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that due to this change calls to PolarisMetaStoreManager.listEntities like here:

private Page<TableIdentifier> listTableLike(
PolarisEntitySubType subType, Namespace namespace, PageToken pageToken) {
PolarisResolvedPathWrapper resolvedEntities = resolvedEntityView.getResolvedPath(namespace);
if (resolvedEntities == null) {
// Illegal state because the namespace should've already been in the static resolution set.
throw new IllegalStateException(
String.format("Failed to fetch resolved namespace '%s'", namespace));
}
List<PolarisEntity> catalogPath = resolvedEntities.getRawFullPath();
ListEntitiesResult listResult =
getMetaStoreManager()
.listEntities(
getCurrentPolarisContext(),
PolarisEntity.toCoreList(catalogPath),
PolarisEntityType.TABLE_LIKE,
subType,
pageToken);

no longer have to load the full PolarisBaseEntity to apply the Predicate<PolarisBaseEntity> and thus they can actually take advantage of optimized implementations of Page<EntityNameLookupRecord> listEntities in BasePersistence that load only the required properties.

Copy link
Contributor

@adutra adutra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome improvement!

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Aug 12, 2025
Copy link
Contributor

@dennishuo dennishuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

I also would like to adjust the transformer/filter version of listEntities to better reflect it as less of an overloaded method and more of a totally distinct method, but perhaps that fits in better in your other PR #2290

@Nonnull PolarisEntityType entityType,
@Nonnull PolarisEntitySubType entitySubType,
@Nonnull PageToken pageToken) {
// TODO: only fetch the properties required for creating an EntityNameLookupRecord
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help us remember as well, can we add and update the underlying CONSTRAINT clause to make the implicit index INCLUDE the subTypeCode to this TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll take a stab at this as an immediate follow-up or file it as a separate issue where i will take note of this additional "index coverage" requirement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filed the ticket: #2352

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-maynard eric-maynard merged commit ee04df4 into apache:main Aug 13, 2025
12 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Aug 13, 2025
@XN137 XN137 deleted the add-PolarisEntitySubType-param-to-BasePersistence.listEntities branch August 14, 2025 14:12
snazy added a commit to snazy/polaris that referenced this pull request Nov 20, 2025
* fix(deps): update dependency io.projectreactor.netty:reactor-netty-http to v1.2.9 (apache#2326)

* Add getting-started example with external authentication (apache#2244)

* chore(deps): update quay.io/keycloak/keycloak docker tag to v26.3.2 (apache#2331)

* fix(deps): update immutables to v2.11.3 (apache#2333)

* JWTBroker: move error message (apache#2330)

This change moves the `LOGGER.error` call when a token cannot be verified from `verify()` to `generateFromToken()`.

On the token generation path, this should be a no-op; however, on the authentication path, this log message was excessive, especially when using mixed authentication since a failure to decode a token is perfectly normal when the token is from an external IDP.

* Let CI archive html test reports (apache#2327)

when having to debug CI test failures its much more convenient to be
able to download the html report compared to the XML reports (as the
latter requires to you find the right file/failure manually).

* Make S3 `roleARN` optional (apache#2329)

Fixes apache#2325

* Remove spotbugs-annotations (apache#2320)

we dont seem to be running spotbugs/findbugs in our build, so depending
on the annotations is not necessary.

also fix name of common-codec lib.

* Remove redundant locations when constructing access policies (apache#2149)

Iceberg tables can technically store data across any number of paths, but Polaris currently uses 3 different locations for credential vending:
1. The table's base location
2. The table's `write.data.path`, if set
3. The table's `write.metadata.path`, if set

This was intended to capture scenarios where e.g. (2) is not a child path of (1), so that the vended credentials can still be valid for reading the entire table. However, there are systems that seem to always set (2) and (3), such as:

1. `s3:/my-bucket/base/iceberg`
2. `s3:/my-bucket/base/iceberg/data`
3. `s3:/my-bucket/base/iceberg/metadata`

In such cases the extra paths (e.g. extra resources in the AWS Policy) are redundant. In one such case, these redundant paths caused the policy to exceed the maximum allowable 2048 characters.

This PR removes redundant paths -- those that are the child of another path -- from the list of accessible locations tracked for a given table and does some slight refactoring to consolidate the logic for extracting these paths from a TableMetadata.

* Remove CallContext from IcebergPropertiesValidation (apache#2338)

it is sufficient to pass the `RealmConfig`.
same applies to helpers in `PolarisEndpoints`.

* Add entitySubType param to BasePersistence.listEntities (apache#2317)

`BasePersistence.listEntities` has 3 variants:
```
Page<EntityNameLookupRecord> listEntities(..., PageToken);

Page<EntityNameLookupRecord> listEntities(..., Predicate<PolarisBaseEntity>, PageToken)

<T> Page<T> listEntities(..., Predicate<PolarisBaseEntity>, Function<PolarisBaseEntity, T>, PageToken);
```

the 1st method exists to only return the subset of entity properties required to build an `EntityNameLookupRecord`.

the 3rd method supports a predicate and transformer function on the underlying `PolarisBaseEntity`, which means it has to load all entity properties.

the 2nd method is weird as it supports a full `Predicate<PolarisBaseEntity>`, which means it has to load all entity properties under the hood for filtering but then throws most of them away to return a `EntityNameLookupRecord`.
this explains why the implementations of the 2nd method simply forward to the 3rd method usually.
any performance benefits of returning a `EntityNameLookupRecord` are lost.

as it turns out the 2nd method is only used, because methods 1 and 3 dont support passing a `PolarisEntitySubType` parameter to filter down the retrieved data.
Note that the sub type property is available from both the `PolarisBaseEntity` as well as the `EntityNameLookupRecord`.

By adding this parameter, the 2nd method can go away completely.
we can even push down the sub type filtering into the queries of some of our persistence implementations.
other existing implementations are free to decide whether they want to push it down as well or filter on the query results in memory.

note that since we have no `TransactionalPersistence` implementation in the codebase that provides an optimized variant of method 1 we can have a default method in the interface that forwards to method 3.

* Add PyIceberg example (apache#2315)

It is not obvious how to connect PyIceberg to a Polaris catalog.

This PR clears that up by providing an example in the getting-started section of the documentation.

* fix(docs): fix some broken url. (apache#2335)

* fix(docs): fix entity doc API links. (apache#2316)

* fix(deps): update dependency io.netty:netty-codec-http2 to v4.2.4.final (apache#2342)

* NoSQL: Misc ports

* Adopt to the state of apache#2131 (OSS NoSQL PR / idgen)
* Track "base locations" and use an index to detect conflicts (via PolarisMetaStoreManager.hasOverlappingSiblings). Feature must be enabled in the Polaris config. Implementation prepared for intentional overlaps. Backwards compatible, except for checks against already existing tables.
* Cosmetic changes (bunch of)

* Some more adoptions from OSS

... based on a `git diff` against the OSS `persistence-nosql` PR branch.

* Last merged commit 4c23eb7

---------

Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: Alexandre Dutra <adutra@apache.org>
Co-authored-by: Christopher Lambert <xn137@gmx.de>
Co-authored-by: Eric Maynard <eric.maynard+oss@snowflake.com>
Co-authored-by: Frederic Khayat <61949371+FredKhayat@users.noreply.github.com>
Co-authored-by: Yujiang Zhong <42907416+zhongyujiang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants