Skip to content

Conversation

@adutra
Copy link
Contributor

@adutra adutra commented May 23, 2025

As discussed in the ML, this PR introduces two flags to enable the inclusion of realm ID tags in API and HTTP metrics.

They are both disabled by default.

There is also a new safeguard: if the cardinality of realm IDs in HTTP metrics goes above a configurable threshold (100 by default), a warning is printed and no more HTTP metrics will be recorded. (Quarkus has a similar safeguard for URI tags in HTTP metrics.)

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM 👍

Re: the conventional commits message, I suppose this may need to be flagged as a breaking change. From my POV metrics constitute an API of the server too (API to observe metrics' values). Stopping to report ream ID tags is akin to removing a property from a JSON response. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this reads a bit odd to me. It feels like all API metrics are disabled. How about enabledInApiMetrics?

Comment on lines 48 to 58
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can eliminate the overrides with

@ConfigProperty("polaris.metrics.realm-id-tag.api-metrics-enabled")
boolean apiMetricsEnabled;
@ConfigProperty("polaris.metrics.realm-id-tag.http-metrics-enabled")
boolean httpMetricsEnabled;

and read those fields in the overridden functions.

@snazy
Copy link
Member

snazy commented May 23, 2025

Re: the conventional commits message, I suppose this may need to be flagged as a breaking change. From my POV metrics constitute an API of the server too (API to observe metrics' values). Stopping to report ream ID tags is akin to removing a property from a JSON response. WDYT?

Worth an explicit mention, yea

adutra added 2 commits May 23, 2025 16:00
BREAKING CHANGE: The realm_id metric tag is not emitted by default anymore. To retrieve the previous behavior, add
@adutra adutra force-pushed the realm-id-cardinality-mitigation branch from 8f430d5 to cb48a73 Compare May 23, 2025 14:00
@adutra
Copy link
Contributor Author

adutra commented May 23, 2025

Re: the conventional commits message, I suppose this may need to be flagged as a breaking change. From my POV metrics constitute an API of the server too (API to observe metrics' values). Stopping to report ream ID tags is akin to removing a property from a JSON response. WDYT?

Modified the initial commit to include a BREAKING CHANGE line. Is there another place to announce that kind of change?

snazy
snazy previously approved these changes May 23, 2025
Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board May 23, 2025
dimas-b
dimas-b previously approved these changes May 23, 2025
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add BREAKING CHANGE to the PR description too?

@adutra adutra dismissed stale reviews from dimas-b and snazy via 5825411 May 23, 2025 15:21
Copy link
Member

@snazy snazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
High cardinality values in metric tags are a source of memory issues. The approach to disable the realm-id metric tags by default is the right way, because it protects the system by default. The docs clearly point out that enabling it can risk the stability of Polaris - with the safeguard of 100 (default) different realm-ids, so there's another level of protection. +1 on the "safety first" approach.

Comment on lines +64 to +67
crashing the server, the number of unique realm IDs in HTTP request metrics is limited to 100 by
default. If the number of unique realm IDs exceeds this value, a warning will be logged and no more
HTTP request metrics will be recorded. This threshold can be changed by setting the
`polaris.metrics.realm-id-tag.http-metrics-max-cardinality` property.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify this only applies when polaris.metrics.realm-id-tag.enable-in-http-metrics is set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@eric-maynard eric-maynard merged commit 97d16d3 into apache:main May 28, 2025
6 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board May 28, 2025
@adutra adutra deleted the realm-id-cardinality-mitigation branch May 28, 2025 15:46
snazy added a commit to snazy/polaris that referenced this pull request Jun 13, 2025
* main: Update dependency org.postgresql:postgresql to v42.7.6 (apache#1697)

* main: Update helm/chart-testing-action action to v2.7.0 (apache#1700)

* main: Update gradle/actions digest to 8379f6a (apache#1696)

* main: Update medyagh/setup-minikube action to v0.0.19 (apache#1698)

* feat(metrics): Mitigate potential performance issues with realm_id tag (apache#1662)

As discussed in the ML, this PR introduces two flags to enable the inclusion of realm ID tags in API and HTTP metrics.

They are both disabled by default.

There is also a new safeguard: if the cardinality of realm IDs in HTTP metrics goes above a configurable threshold (100 by default), a warning is printed and no more HTTP metrics will be recorded. (Quarkus has a similar safeguard for URI tags in HTTP metrics.)

* Site/contributing: add recommendations for working with PRs (apache#1625)

This change updates the PR guidelines on the "Contributing" web page after [this discussion](https://lists.apache.org/thread/kfxo3cqmw3pgrpgtgqvqpwvn46yw8q7h).

Also adopt `gradlew test` to `gradlew check` in README, following the intent (all tests, incl ITs)

* main: Update dependency com.azure:azure-sdk-bom to v1.2.35 (apache#1703)

* main: Update dependency com.adobe.testing:s3mock-testcontainers to v4.4.0 (apache#1705)

* main: Update dependency boto3 to v1.38.24 (apache#1702)

* Keep generated RSA-key-pair for JWT token broker on heap (apache#1661)

Polaris allows using RSA key-paris for the JWT token broker. The recommended way is to [generate the RSA key pair](https://github.com/apache/polaris/blob/d8b862b13914d526ee147dc0e359bfc9c1e319ad/site/content/in-dev/unreleased/configuring-polaris-for-production.md?plain=1#L61-L66) and configure the location of the key files.

However, if only `polaris.authentication.token-broker.type=rsa-key-pair` but not the `public/private-key-pair` options are configured, Polaris generates those and stores them in `/tmp` using random file names (using `Files.createTempFile()`) - this happens for each (matching) realm. Each Polaris startup generates new key-pairs for each of those realms. It's practically not possible to associate the files to a realm. There is already a [production readiness check](https://github.com/apache/polaris/blob/d8b862b13914d526ee147dc0e359bfc9c1e319ad/quarkus/service/src/main/java/org/apache/polaris/service/quarkus/config/ProductionReadinessChecks.java#L118-L166) to warn users about this behavior.

Due to the issue that the files cannot be associated, those seem to be somewhat useless and bring no advantage over keeping these "ephemeral RSA key pairs" on heap. This PR changes the code to not write the key-pair to the file system and keeps these "ephemeral key pairs" on heap. Since the same code path is used for key-paris _provided_ by the user (via the `public/private-key-pair` config options), that code path now only reads those files once and not every time the private/public key is needed.

* Merge JPA module with EclipseLink Module (apache#1718)

* Create LICENSE and NOTICE for "single" distribution (apache#1694)

* Fix a failing task with the release profile (apache#1693)

* Remove unused adminDocs artifact (apache#1749)

* Production readiness for Persistence (apache#1707)

Production readiness for Persistence (apache#1707)

* Fixes for direct usage of client_secret apache#1756

When the spec was upgraded and the python client regenerated from it, clientSecret was made a password, which means calling str on it directly yields a redacted string like ******. In the initial PR to change the python client and fix regtests, some existing usage of client_secret was not changed.

* main: Update dependency org.junit:junit-bom to v5.13.0 (apache#1760)

* main: Update dependency org.testcontainers:testcontainers-bom to v1.21.1 (apache#1748)

* fix: Improve reliability of metrics tests (apache#1763)

CI sometimes fails with errors like "http_server_requests_seconds not found"
in the reported metrics.

This looks like a race between the Quarkus metrics producer and the
tests asking for these metrics.

This change adds a time-limited retry loop until the expected metrics
are available, before proceeding with other assertions.

Note: in normal cases the loop finishes fast because the metrics are
available. The two-minute timeout would apply only when the expected
metrics fail to be produced at all.

* Fix test_spark_credentials_s3_exception_on_metadata_file_deletion (apache#1759)

* Regenerate bundled spec & Regenerate Python client (apache#1751)

I ran these commands from main:

```
redocly bundle spec/polaris-catalog-service.yaml -o spec/generated/bundled-polaris-catalog-service.yaml
./gradlew regeneratePythonClient
```

I didn't realize before that some Python types are generated form the bundled spec, so some of the fixes from apache#1347 didn't get properly applied before.

* main: Update dependency boto3 to v1.38.27 (apache#1714)

* NoSQL: bump Weld/Junit5 (fixes a bug that surfaces w/ JUnit 5.13)

* NoSQL: Let some more tests leverage Jandex

* Info: Last merged commit b7aac72

---------

Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: Alexandre Dutra <adutra@users.noreply.github.com>
Co-authored-by: Yufei Gu <yufei@apache.org>
Co-authored-by: JB Onofré <jbonofre@apache.org>
Co-authored-by: Prashant Singh <35593236+singhpk234@users.noreply.github.com>
Co-authored-by: Eric Maynard <eric.maynard+oss@snowflake.com>
Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@gmail.com>
Co-authored-by: gh-yzou <167037035+gh-yzou@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants