Skip to content

Conversation

@collado-mike
Copy link
Contributor

It's helpful to be able to look at the Table entities in persistence and immediately know the state of the current snapshot, current schema, partition scheme, etc. without having to load the whole TableMetadata. This copies some of the primitive properties from the TableMetadata structure sticks them into the Table PolarisEntity as internalProperties.

Currently, there are no other properties being stored except the metadata file location, the parent namespace, and the last notification timestamp. Currently, as is, we'll drop any other properties that are added to the internalProperties map. I went this approach to avoid the properties map always being additive, but happy to hear if folks think we should defer to always copying the previous map, then overwriting properties.

@eric-maynard
Copy link
Contributor

Hi @collado-mike, what would be the utility of persisting just these fields? I think in order to optimize loadTable / commit, we would really need to store the entire metadata somewhere.

Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks reasonable to me, and the code looks good.
Would just prefer to use constants to avoid typos at call sites.

@collado-mike
Copy link
Contributor Author

Hi @collado-mike, what would be the utility of persisting just these fields? I think in order to optimize loadTable / commit, we would really need to store the entire metadata somewhere.

This doesn't aim to optimize those APIs, but simply adds the fields for other possible utilities - e.g., within Polaris, knowing whether the schema has changed. Now, it's true that this could be used to help optimize the loadTable API, but the intention for this PR is not to fully accomplish that.

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

PolarisEntitySubType.ICEBERG_TABLE, tableIdentifier, newLocation)
PolarisEntitySubType.ICEBERG_TABLE,
tableIdentifier,
Map.of(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If we're using the builder pattern, why have rich constructor parameters? Why not call .setABC()?

In this case the Map is empty, so this parameter is redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is in setting the internalProperties map in the builder methods because we have these helper methods, such as setMetadataLocation that aren't fields themselves, but modify entries in the map. If we do call the following, we're fine

builder
    .setInternalProperties(props)
    .setMetadataLocation(newLocation)
    .build();

but if we reverse the order and call the following, we're broken:

builder
    .setMetadataLocation(newLocation)
    .setInternalProperties(props)
    .build();

It's not obvious from the caller's perspective, but setMetadataLocation modifies the underlying map, then the setInternalProperties call completely overwrites the map, losing the value set in the previous call.

With the existing constructor, it's impossible to order the setting of the metadataLocation and the internalProperties via the builder methods. If we used the builder for all properties all the time, it would be fine, but because we pass in the newLocation parameter as a constructor arg, if we set the internalProperties field using the builder method, we lose the location parameter we just passed in.

Copy link
Contributor

@dimas-b dimas-b Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, TIL 🤔 However, having this kind of effects in the codebase it pretty risky in the long term maintenance perspective, IMHO. Would it be reasonable to refactor the builders / related code to allow for more intuitive usage (in another PR, of course)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about making metadataLocation and other settable map entries into distinct fields in the Builder so that they can just be added to the map in the build call, but... yeah, a future PR

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Oct 2, 2025
@eric-maynard
Copy link
Contributor

In what API is it useful to specifically know if the schema has changed? We already have etag support for loadTable to know if any metadata has changed.

@collado-mike
Copy link
Contributor Author

In what API is it useful to specifically know if the schema has changed? We already have etag support for loadTable to know if any metadata has changed.

There aren't any APIs that use this information today. As stated in the PR description, the intention is to be able to look at the persistence layer to understand the state of the table without loading the entire TableMetadata structure from cloud storage. As an example, JB recently brought up working on the UI - it would be nice if the UI could print some information about the table without loading the entire metadata - last-updated-ms, table-uuid, and others. For our case, we'd like to be able to see the current schema and snapshot state without having to load all the TableMetadata for every table.

In part, this helps move us toward being able to support TableMetadata caching, which you reference above, but this is useful in itself and doesn't need to wait on that implementation.

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

catalog.buildTable(TABLE, SCHEMA).create();
catalog.loadTable(TABLE).newFastAppend().appendFile(FILE_A).commit();
Table afterAppend = catalog.loadTable(TABLE);
EntityResult schemaResult =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EntityResult schemaResult =
EntityResult namespaceResult =

[nit] Let's use namespace here to be consistent with polaris' terminology

}

@Test
public void testTableInternalPropertiesStoredOnCommit() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non-blocker] Since we are testing the extraction of metadata fields, it would be good if we could add/or parametrize this test with different format-version (1,2,3). Technically the TableMetadata should be format-version agnostic, but a sanity check would be useful in case there's some upstream bug : )

@eric-maynard
Copy link
Contributor

this is useful in itself

I'm still confused about this point. In the example you gave, the UI wants to load a table schema and these fields could help do that. How would that work? What API would the UI call?

Wouldn't it make sense to design the API first before making persistence changes?

Copy link
Contributor

@eric-maynard eric-maynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it makes sense to make entity changes that aren't visible to any API without any clear pathway to making them so.

@github-project-automation github-project-automation bot moved this from Ready to merge to PRs In Progress in Basic Kanban Board Oct 5, 2025
@collado-mike
Copy link
Contributor Author

Wouldn't it make sense to design the API first before making persistence changes?

What API? Not every field stored in the persistence layer serves a public-facing API. For example, we store cleanup task id in the internal properties field and it is never exposed through a public-facing API. If we serve TableMetadata from persistence rather than cloud storage, there will be zero impact to any public-facing API. The UI may or may not solely rely on existing Polaris management APIs. Or we may build custom APIs strictly for the UI and query the database directly. internalProperties is specifically used to serve internal purposes. I don't understand your question about which public-facing API needs this information.

@collado-mike
Copy link
Contributor Author

@eric-maynard , what's the justification for blocking this PR? I didn't merge on Friday even when I had received approval, as I was addressing your questions, so I'm not really sure why the block. I see no concern about the change being unsafe or incorrect. Is there a technical concern here?

Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the constants, +1!

@snazy
Copy link
Member

snazy commented Oct 15, 2025

@collado-mike mind deleting the https://github.com/apache/polaris/tree/mcollado-table-metadata-properties branch?

@eric-maynard eric-maynard dismissed their stale review October 17, 2025 19:08

Moved to ML discussion

@dimas-b
Copy link
Contributor

dimas-b commented Oct 22, 2025

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Oct 22, 2025
@collado-mike collado-mike merged commit 07edd30 into apache:main Oct 23, 2025
16 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Oct 23, 2025
snazy added a commit to snazy/polaris that referenced this pull request Nov 20, 2025
* Update Quarkus Platform and Group to v3.28.4 (apache#2786)

* Update dependency org.testcontainers:testcontainers-bom to v2.0.1 (apache#2830)

* Build/polaris-core: Remove outdated `constraint`s (apache#2818)

The `:polaris-core` build scripts contains (soft) version-constraints for some dependencies with a vague reason "Vulnerability detected in ..." (concrete CVE/reason not mentioned) referencing specific dependency versions. The mentioned versions are all quite outdated, some are even not transitively referenced. Hence, removing those constraings, as those seem no longer relevant.

Effective dependency versions can be inspected via `./gradlew :polaris-core:dependencies --configuration runtimeClasspath`.

* Add Community Meetings for 2025-10-02 and 2025-10-16 (apache#2832)

* Update docker.io/prom/prometheus Docker tag to v3.7.1 (apache#2834)

* testcontainers v2: tackle deprecation warnings (apache#2835)

* Add findPrincipalById helper (apache#2810)

* Add findPrincipalById helper

this simplifies frequent usage of the lower level `loadEntity` api (similar to the
existing `findPrincipalByName` helper)

* [Python] Add more tests cases for policy CLI (apache#2831)

* Update dependency software.amazon.awssdk:bom to v2.35.10 (apache#2840)

* Update dependency ch.qos.logback:logback-classic to v1.5.20 (apache#2839)

* Reproducible builds: make parent pom content reproducible (apache#2826)

The parent pom contains the `<developer>` and `<contributor>` elements. The former is populated from ASF people information including role information (champion, mentor, chair, (P)PMC member, committer). The latter is retrieved from a GitHub API endpoint, ordered by number contributions. Especially the latter list is prone to vary between builds, which makes the parent pom not reproducible as the locally built one is likely different from the one that was built by the release managed (staged artifact).

This change removes both lists, leaving a single static `<developer>` entry pointing to `https://polaris.apache.org/community/`. Related build-script code has been updated and no longer retrieves people information.

* Log root cause exceptions in mappers (apache#2837)

Fix `IcebergExceptionMapper` and `PolarisExceptionMapper` to pass exceptions as "cause" to the logger (as opposed to unreferenced log parameters).

* Remove credential flag from `StorageAccessProperty.CLIENT_REGION` (apache#2838)

`CLIENT_REGION` is not a credential value, which is in line with
Iceberg's `VendedCredentialsProvider` code.

Cf. apache/iceberg#11389

* CI: Let all workflows use GitHub's docker.io mirror (apache#2841)

* Correct template rendering for authentication options (apache#2808)

* Correct template rendering for authentication options

* Added tpl back

* Increase javadoc visibility in `:polaris-async-vertx` (apache#2745)

This is to fix javadoc error: `No public or protected classes found to document`

* Update slack invite url (apache#2846)

* Remove unused ConcurrentLinkedQueueWithApproximateSize (apache#2849)

* Merge AwsCloudWatchConfiguration and QuarkusAwsCloudWatchConfiguration (apache#2848)

For some reason, these two classes weren't properly merged when the runtime-service and service-common modules were merged. This PR fixes that.

This PR also adds some examples of AWS Cloud Watch configuration to the default application.properties file.

* Move TestPolarisEventListener to test fixtures (apache#2850)

* Update dependency com.google.cloud:google-cloud-storage-bom to v2.59.0 (apache#2857)

* Update actions/stale digest to e46bbab (apache#2856)

* Servcie: Remove a duplicated config (apache#2854)

* Update docker.io/prom/prometheus Docker tag to v3.7.2 (apache#2858)

* Update Quarkus Platform and Group to v3.28.5 (apache#2859)

* Update dependency com.google.errorprone:error_prone_core to v2.43.0 (apache#2860)

* Add --no-sts to CLI (apache#2855)

* Add --no-sts to CLI

Following up on apache#2672, add new `--no-sts` option to CLI to allow
configuring `stsUnavailable` in `AwsStorageConfigInfo`

* Use AccessConfigProvider.getAccessConfig in DefaultFileIOFactory (apache#2852)

* CLI: Remove the trailing comma (apache#2863)

* Update dependency pip-licenses-cli to v3 (apache#2842)

* Update dependency pip-licenses-cli to v3

* Update pip-licenses-cli version format

* Fix pip-licenses-cli version specification

---------

Co-authored-by: Yong Zheng <yongzheng0809@gmail.com>

* Update quay.io/keycloak/keycloak Docker tag to v26.4.2 (apache#2868)

* Bump main to 1.3.0-SNAPSHOT (apache#2870)

* Add properties from TableMetadata into Table entity internalProperties (apache#2735)

* Add properties from TableMetadata into Table entity internalProperties

* Made table properties constants and pulled out static utility method

* Update dependency io.smallrye:jandex to v3.5.1 (apache#2872)

* Fix exec flags on getting-started scripts (apache#2878)

* Add `+x` to script source files
* Remove (unnecessary) `chmod` from docs

* Update plugin jcstress to v0.9.0 (apache#2882)

* Update registry.access.redhat.com/ubi9/openjdk-21-runtime Docker tag to v1.23-6.1761164966 (apache#2874)

* Update dependency openapi-generator-cli to v7.16.0 (apache#2703)

* Update Gradle to v9 (apache#2226)

* Update Gradle to v9

* adopt gradlew

---------

Co-authored-by: Robert Stupp <snazy@snazy.de>

* Last merged commit 7892540

---------

Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: JB Onofré <jbonofre@apache.org>
Co-authored-by: Christopher Lambert <xn137@gmx.de>
Co-authored-by: Nuoya Jiang <98131931+NuoyaJiang@users.noreply.github.com>
Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@gmail.com>
Co-authored-by: Yong Zheng <yongzheng0809@gmail.com>
Co-authored-by: Honah (Jonas) J. <honahx@apache.org>
Co-authored-by: Alexandre Dutra <adutra@apache.org>
Co-authored-by: Yufei Gu <yufei@apache.org>
Co-authored-by: Nuoya Jiang <98131931+CodingBangboo@users.noreply.github.com>
Co-authored-by: Michael Collado <40346148+collado-mike@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants