Implement pagination for list APIs #273

eric-maynard · 2024-09-06T20:37:50Z

Description

The Apache Iceberg REST catalog spec describes how APIs like listNamespaces can implement pagination, but Polaris currently does not implement pagination.

This PR implements pagination for listNamespaces, listViews, and listTables. It also introduces a framework that could be used to implement pagination for other APIs like listCatalogs.

Fixes #147

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Documentation update
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Added tests in IcebergCatalogTest.

Manual testing confirms that the page-size limit applies, and that page tokens can be used to continue a listing operation:

  curl -X GET "http://localhost:8181/api/catalog/v1/quickstart_catalog/namespaces" \
     -H "Authorization: Bearer principal:root;realm:default-realm" \
     -H "Content-Type: application/json" \
     && echo
{"namespaces":[["ns1"],["ns2"],["ns3"]],"next-page-token":null}

. . .

curl -X GET "http://localhost:8181/api/catalog/v1/quickstart_catalog/namespaces" \
     -H "Authorization: Bearer principal:root;realm:default-realm" \
     -H "Content-Type: application/json" \
     --get \
     --data "pageSize=1" \
     && echo
{"namespaces":[["ns1"]],"next-page-token":"cG9sYXJpczoxOjE="}

. . .

curl -X GET "http://localhost:8181/api/catalog/v1/quickstart_catalog/namespaces" \
     -H "Authorization: Bearer principal:root;realm:default-realm" \
     -H "Content-Type: application/json" \
     --get \
     --data "pageToken=cG9sYXJpczoxOjE=" \
     --data "pageSize=1" \
     && echo
{"namespaces":[["ns2"]],"next-page-token":"cG9sYXJpczoyOjE="}

polaris-core/src/main/java/org/apache/polaris/core/catalog/PageToken.java

dbosco

I can't resolve my own feedback. Can you close them? Thanks

polaris-core/src/main/java/org/apache/polaris/core/catalog/PageToken.java

dbosco · 2024-09-30T01:37:12Z

@eric-maynard let me know if you need any further help on this PR.

github-actions · 2024-12-05T02:06:04Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

dimas-b · 2025-02-28T23:12:35Z

What's the plan for this PR? Do we want to revive it? :)

…isGenericTableCatalog-2

singhpk234

Thank you Eric !
I think the idea make sense to me, will need some more time to do deeper look.

singhpk234 · 2025-03-28T04:17:13Z

...e/polaris/extension/persistence/impl/eclipselink/PolarisEclipseLinkMetaStoreSessionImpl.java

+    } else {
+      // In this case, we cannot push the filter down into the query. We must therefore remove
+      // the page size limit from the PageToken and filter on the client side.
+      // TODO Implement a generic predicate that can be pushed down into different metastores


singhpk234 · 2025-03-28T04:22:11Z

...e/polaris/extension/persistence/impl/eclipselink/PolarisEclipseLinkMetaStoreSessionImpl.java

+      List<ModelEntity> rawData =
+          this.store.lookupFullEntitiesActive(
+              localSession.get(), catalogId, parentId, entityType, unlimitedPageSizeToken);


[doubt] Let say we want to 10 records and table had 10k records and we have filter currently since this is a client side filter. And client requested page size of 2 we will get 5 time 10 K records.

I was seeing similar issue had 2 options.

Load all the entities in one shot

keep triggering the queries in loop until we get the requested page results.

I am not sure which is lesser poison pill to swallow.

Luckily, we do not use this filter very often at all... but yes, this situation is less than ideal.

If the metastore itself provided a lazily-fetched Iterator here instead of a List (presumably with its own pagination/cache built into the client), we should definitely prefer to use that. But I think the right place for that logic is in the metastore itself and not at this layer, so I kept with the simplest approach. I think we need a solution for predicate push down in place eventually anyway.

There is actually a third option as well, which is that we just return more results than the client asked for. This is what we currently do actually, since we don't respect the page token.

Looking into this a little more, EclipseLink does support TypedQuery.getResultStream, so maybe it's as simple as just refactoring lookupFullEntitiesActive to return a Stream instead of a List. I'm not sure how well that actually performs though; we'd want to benchmark it.

edit: 😔

Does listing requires now to make this transactionally consistent, consider for ex: we need to list records, and some record got added while we were listing ? whats the behaviour expected ?

for ex Snowflake id generator which NoSQL impl generates or the ID generator that i am writing just makes sure entity id is unique, it doesn't guarantee incremental, am I missing something ?

I don't think this needs to be transactionally consistent. You can miss an entity, but you will never get the same entity twice.

Entity ID is guaranteed to always be unique.

I suppose there is a case where you transactionally add some table and drop some other table that was already returned in an existing page -- the list results in aggregate will be inconsistent with that transaction. But I think this is outside the realm of what we need to support for the time being. I would probably not want to fail either the list operation or the table update operation to ensure consistency when clients who care about transactional consistency have the option to list without a page token.

I disagree that an inconsistency (duplicate and omitted) results is acceptable.

I also have a strong concern that there's more data processed here than absolutely necessary.

There are no duplicate or omitted results here if we define omitted as "not present in the results of a list when the entity was in fact present through the duration of the list". Results are only omitted when they are deleted in the middle of a list operation; that is, the entity is gone by the time we reach the page that should contain it.

singhpk234

Thank @eric-maynard, this is a much needed thing for practical uses cases ! I think the approach looks good to me overall.

I have some open orthogonal question, may be I am missing some context :

We change the BasePersistence / PolarisMetaStoreManager for this, are all stakeholders onboard for this ?
Are Entity-Ids assumed to be monotonically increasing ? is this a requirement from Polaris ? As entity_id just needs to unique from what i am aware.

singhpk234 · 2025-03-31T18:11:00Z

...e/polaris/extension/persistence/impl/eclipselink/PolarisEclipseLinkMetaStoreSessionImpl.java

+      List<ModelEntity> rawData =
+          this.store.lookupFullEntitiesActive(
+              localSession.get(), catalogId, parentId, entityType, unlimitedPageSizeToken);


Does listing requires now to make this transactionally consistent, consider for ex: we need to list records, and some record got added while we were listing ? whats the behaviour expected ?

for ex Snowflake id generator which NoSQL impl generates or the ID generator that i am writing just makes sure entity id is unique, it doesn't guarantee incremental, am I missing something ?

.../java/org/apache/polaris/extension/persistence/impl/eclipselink/PolarisEclipseLinkStore.java

singhpk234 · 2025-03-31T18:22:14Z

polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java

+      PolarisConfiguration.<Boolean>builder()
+          .key("LIST_PAGINATION_ENABLED")
+          .catalogConfig("list-pagination.enabled")
+          .description("If set to true, pagination for APIs like listTables is enabled")


should we be explicit on which of the following APIs we will support pagination ?

singhpk234 · 2025-03-31T18:22:39Z

polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java

@@ -183,4 +183,12 @@ protected FeatureConfiguration(
              "How many times to retry refreshing metadata when the previous error was retryable")
          .defaultValue(2)
          .buildFeatureConfiguration();
+
+  public static final PolarisConfiguration<Boolean> LIST_PAGINATION_ENABLED =


can this be configured at client level ?

It should be there at both levels. The client can choose to call with or without a page token. But we need a feature flag on the server side.

But we need a feature flag on the server side.

What's the reason for that? Whether an Iceberg client supports pagination is clear from the request.

As the feature is somewhat large in scope, I think it makes sense to feature flag it off. There are regression considerations; imagine you deploy Polaris with a metastore implementation which you later find will perform poorly when page tokens are used. It could make sense to disable page token support for a time being to improve performance, even if clients are providing page tokens.

Do you think this should rather be a behavior-change flag?

singhpk234 · 2025-03-31T20:19:06Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/PolarisMetaStoreManager.java

   * @return list of tasks to be completed
   */
  @Nonnull
-  EntitiesResult loadTasks(@Nonnull PolarisCallContext callCtx, String executorId, int limit);
+  EntitiesResult loadTasks(
+      @Nonnull PolarisCallContext callCtx, String executorId, PageToken pageToken);


[orthogonal] Are we ok in changing this interface ?
cc @dennishuo

singhpk234 · 2025-03-31T20:26:46Z

service/common/build.gradle.kts

@@ -29,6 +29,7 @@ dependencies {
  implementation(project(":polaris-api-management-service"))
  implementation(project(":polaris-api-iceberg-service"))
  implementation(project(":polaris-api-catalog-service"))
+  implementation(project(":polaris-jpa-model"))


[doubt] why do we need this here ?

It's because ModelEntity is one of the types that EntityIdPageToken.updated is intended to support. It allows the metastore layer to update the existing page token with a List<ModelEntity>. If we can remove this dependency I'm okay with it. To be honest, this looked very different before the persistence refactor and this could be my misunderstanding when resolving conflicts.

The polaris-jpa-model is only used by the EclipseLink, what about the other persistence implementations, e.g., jdbc one? I don't think the service module can depend on a specific persistence impl.

...ervice/src/test/java/org/apache/polaris/service/quarkus/catalog/GenericTableCatalogTest.java

singhpk234 · 2025-03-31T20:31:17Z

...re/src/main/java/org/apache/polaris/core/persistence/pagination/ReadEverythingPageToken.java

+public class ReadEverythingPageToken extends PageToken {
+
+  private ReadEverythingPageToken() {
+    this.pageSize = Integer.MAX_VALUE;


[doubt] Do we have checks further down in the persistence to not push down LIMIT when we need INT_MAX ?

Currently we do actually push this limit down and apply it; it's up to each metastore how to implement this but both implementations seem to handle this okay.

dimas-b · 2025-03-31T22:43:37Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PageToken.java

+
+  public int pageSize;
+
+  public static final PageToken DONE = null;


Suggestion: NONE

Semantically, it's the content of the page-token in a response when there are no more pages; NONE sounds to me like it's intended to be used when there is no page token in the response at all. Of course, they are one in the same...

edit: One other thought -- when we have an Optional<PageToken>, it seems like there could be ambiguity between Optional.of(PageToken.NONE) and Optional.empty(). Maybe I only feel that way because Scala uses None for an empty option though

Minor: I'd suggest END, which seems a bit more natural than DONE

dimas-b · 2025-03-31T22:48:24Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PageToken.java

+ */
+public abstract class PageToken {
+
+  public int pageSize;


Why is page size part of the page token? Why would constructing the next page have from the token have to depend on this page size?

Also, page size may be restricted by the total serialized size, not just by the number of items.

It's mostly just convenient to pass them around together. It minimizes API changes as you always need one if you need the other. It's also ergonomic to be able to do like pageTokenBuilder.fromLimit and then treat that just as you would a page token that you got from a client.

There's also nothing inherently wrong about a page token that contains metadata about the requested page size. That said, PageToken could be a bit misleading if you're expecting it to 1:1 correspond to the page-token in the Iceberg spec. From that perspective it's more like a PageRequestDescriptor or something; I'm very open to changing the name and just reached for PageToken as the most obvious choice.

Also, page size may be restricted by the total serialized size, not just by the number of items.

Yeah the Iceberg spec just states:

an upper bound of the number of results that a client will receive

So I think int pageSize captures this, but in theory you could have different PageToken implementations that treat this differently. For example you could have EntityIdPageTokenWithBytesLimit where pageSize is treated as a number of bytes rather than a number of records. The metastore implementations that support this PageSize would have to be amended accordingly..

So far, everything is just on entity count so I didn't canonicalize this into the type structure but I think refactoring it would make sense if we ever want such a PageToken (name pending)

I noticed this elsewhere as well. Page token is a an opaque server generated value - page size is a hint from a client. Two very different things.

"Opaque" means it can in fact contain the page size :)

But we are conflating two things. The real issue here is that we have a class called PageToken which does not represent the page-token parameter that the API receives/returns. It covers both that and page-size. If this is contentious, I suggest the alternative name PageRequestDescriptor. WDYT?

I'd prefer a name like PageRequest.

I'm a little apprehensive about PageRequest because it sounds similar to java.net.http.HttpRequest and I think it might imply some CDI-related stuff (such as being RequestScoped). But maybe we can hash this out on the new interface-only PR

dimas-b · 2025-03-31T22:55:03Z

...ion/persistence/jpa-model/src/main/java/org/apache/polaris/jpa/models/EntityIdPageToken.java

+/**
+ * A {@link PageToken} implementation that tracks the greatest ID from either {@link
+ * PolarisBaseEntity} or {@link ModelEntity} objects supplied in updates. Entities are meant to be
+ * filtered during listing such that only entities with and ID greater than the ID of the token are


Do we assume that entities are loaded in the order of their IDs?

Metastores that use this PageToken implementation must enforce this

If we want to add some new method to the metastore to canonicalize this check I think we can. The current check is similar to some other checks we have throughout the codebase that enforce a certain entity type is passed in.

Oops, I missed that this page token was specific to JPA 🤦

Yeah, it's a little brittle though. Related to your comment above about the varying semantics of pageSize it could make sense to add a method to the metastore interface like supportedPageTokenTypes or something.

Noting: the order of values returned by database sequences is not necessarily strictly increasing. Aka: it is wrong to assume that each retrieved values from a database sequence is higher than any value retrieved before.

It's not an assumption; it's something the metastore has to enforce in order for this page token to be a valid implementation to use.

dimas-b · 2025-03-31T22:56:16Z

.../java/org/apache/polaris/extension/persistence/impl/eclipselink/PolarisEclipseLinkStore.java

+        "SELECT m from ModelEntity m "
+            + "where m.catalogId=:catalogId and m.parentId=:parentId and m.typeCode=:typeCode and m.id > :tokenId";
+
+    if (pageToken instanceof EntityIdPageToken) {


What if it is not an EntityIdPageToken?

Then we should have failed above

Not quite... The check is an OR 🤔

I guess what I meant to say is that having logic conditional on the runtime type of the PageToken looks risky to me... and hard to evolve/maintain.

I'm totally fine with each meta store impl. having its own PageToken implementation, but I do not think token polymorphism makes sense at this level.

Ah, I think I understand your comment better now. You are pointing out that depending on the page token type, the results may or may not get sorted by entity ID? That's true.

Not sorting is more of an optimization for the case where we have a ReadEverythingPageToken (i.e. no page token). We could remove this optimization and always sort, but I added it to improve performance as in the majority of cases we have no page token.

In fact, I am trying not to couple page token implementations with metastore implementations here. In theory, the treemap metastore could also work with EntityIdPageToken.

I agree that the runtime check is a little awkward -- although we have such runtime checks all over the place currently. Is there a way you think we can achieve this same behavior without one?

I'm really hesitant to just always sort here. On the other hand, I think adding methods to PageToken that are really only relevant to one implementation (e.g. requiresSortingByEntityId) does not make a ton of sense from the interface perspective. Further, I think flipping the dependency between the metastore and page token types (e.g. PageToken::rewriteEclipseLinkQueryAsNeeded) is not a good pattern either.

Thx for that extra insight. I need some time for a deeper review 😅 ⏳

eric-maynard · 2025-04-01T00:03:23Z

@singhpk234

We change the BasePersistence / PolarisMetaStoreManager for this, are all stakeholders onboard for this ?

I think these interfaces probably must change in order to support efficient pagination, but I leave it to the reviewers here to determine if the feature is worth the changes.

Are Entity-Ids assumed to be monotonically increasing ? is this a requirement from Polaris ? As entity_id just needs to unique from what i am aware.

They are not assumed to be monotonically increasing by this design, and it's not a requirement. In the EclipseLink store here, we sort by entity ID and return them in that order because it's durable against renames unlike sorting by identifier. But we do not care if some new table is created at a lower entity ID than the current "tail" of the paginated listing operation.

This does mean that the consistency semantics are somewhat nondeterministic from a user POV. That is, if you create a table during a paginated list operation it's not clear whether that table will show up in the list or not. I think this is actually okay though, as you are literally racing one operation against the other.

It's okay (if not ideal) to miss tables that are created after the listing operation is initiated. It's not okay to double-count a renamed table or to miss a table that is actually present during the entire list. The current PR satisfies these principles.

We could make this more deterministic/consistent by adding an explicit predicate to the list operation to filter out tables that were created after the listing operation started. Happy to change this and take it on as another principle to adhere to. My reasoning has been that if we happen to discover a table created during the list we might as well include it.

snazy · 2025-04-01T14:52:14Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/OffsetPageToken.java

+ * are meant to be filtered during listing such that when a token with offset N is supplied, the
+ * first N records are omitted from the results.
+ */
+public class OffsetPageToken extends PageToken {


This mechanism can lead to duplicate results and/or omitted results.
The order of the entities returned is undefined - it can vary on every list() invocation.
Even if the order is (somewhat?) deterministic, a no-longer existing entity in a previous page will lead to a missing result - and vice versa for an entity added "in between" a previous page will lead to duplicate results.

Whether or not this is true depends on the metastore implementation. Just as with the other PageToken implementation, the responsibility falls on the metastore to use a PageToken implementation that enforces the correct semantics. See this comment for another example of this.

i.e. EntityIdPageToken is only usable if by a metastore if it can enforce that the entities in one list operation can be made to always be > the entity ID described by a page token, and that this will result in no duplicates or omitted results. The same is true for OffsetPageToken, but instead of entity IDs it's offsets.

Notably, OffsetPageToken is not really usable by the EclipseLink metastore right now but it's rather used for testing and just as an example.

snazy · 2025-04-01T14:54:38Z

...ion/persistence/jpa-model/src/main/java/org/apache/polaris/jpa/models/EntityIdPageToken.java

+/**
+ * A {@link PageToken} implementation that tracks the greatest ID from either {@link
+ * PolarisBaseEntity} or {@link ModelEntity} objects supplied in updates. Entities are meant to be
+ * filtered during listing such that only entities with and ID greater than the ID of the token are


Noting: the order of values returned by database sequences is not necessarily strictly increasing. Aka: it is wrong to assume that each retrieved values from a database sequence is higher than any value retrieved before.

snazy · 2025-04-01T14:57:41Z

...e/polaris/extension/persistence/impl/eclipselink/PolarisEclipseLinkMetaStoreSessionImpl.java

+      List<ModelEntity> rawData =
+          this.store.lookupFullEntitiesActive(
+              localSession.get(), catalogId, parentId, entityType, unlimitedPageSizeToken);


I disagree that an inconsistency (duplicate and omitted) results is acceptable.

snazy · 2025-04-01T15:00:30Z

...re/src/main/java/org/apache/polaris/core/persistence/pagination/ReadEverythingPageToken.java

+ * A {@link PageToken} implementation for readers who want to read everything. The behavior when
+ * using this token should be the same as when reading without a token.
+ */
+public class ReadEverythingPageToken extends PageToken {


This mixes two different things:
A "page token" is meant as an (opaque) piece of data to indicate where the next page result should start.
"Read everything" is a filter criteria, not a page token.

See this comment for details on why these things are packaged together and named the way they are. The proposed alternative name there is PageRequestDescriptor.

...is-core/src/main/java/org/apache/polaris/core/persistence/dao/entity/ListEntitiesResult.java

snazy · 2025-04-01T15:07:29Z

polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java

@@ -183,4 +183,12 @@ protected FeatureConfiguration(
              "How many times to retry refreshing metadata when the previous error was retryable")
          .defaultValue(2)
          .buildFeatureConfiguration();
+
+  public static final PolarisConfiguration<Boolean> LIST_PAGINATION_ENABLED =


But we need a feature flag on the server side.

What's the reason for that? Whether an Iceberg client supports pagination is clear from the request.

snazy · 2025-04-01T15:10:04Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PageToken.java

+ */
+public abstract class PageToken {
+
+  public int pageSize;


I noticed this elsewhere as well. Page token is a an opaque server generated value - page size is a hint from a client. Two very different things.

snazy · 2025-04-01T15:12:24Z

...e/polaris/extension/persistence/impl/eclipselink/PolarisEclipseLinkMetaStoreSessionImpl.java

+      @Nonnull Function<PolarisBaseEntity, T> transformer,
+      @Nonnull PageToken pageToken) {
+    List<T> data;
+    if (entityFilter.equals(Predicates.alwaysTrue())) {


Does Predicate.alwaysTrue return something that can be compared and will that hold true for future?
Predicates are rather functions/lambdas - those are not meant to be compared with anything.

That's right. This is only meant to capture the predicate that's provided in the method overload that doesn't have a predicate, e.g. here. It is not meant to capture any predicate that's semantically "always true".

snazy · 2025-04-01T15:14:03Z

...e/polaris/extension/persistence/impl/eclipselink/PolarisEclipseLinkMetaStoreSessionImpl.java

+      List<ModelEntity> rawData =
+          this.store.lookupFullEntitiesActive(
+              localSession.get(), catalogId, parentId, entityType, unlimitedPageSizeToken);


I also have a strong concern that there's more data processed here than absolutely necessary.

snazy · 2025-04-01T16:00:29Z

...ce/common/src/test/java/org/apache/polaris/service/persistence/pagination/PageTokenTest.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class PageTokenTest {


I wish there'd be much better test coverage about all the code and use cases have are changed in this PR.

Yeah this is fair. My understanding was that we may get quite of bit of testing for paginated listing "for free" from the Iceberg tests. Is that true, after all?

flyrain · 2025-05-01T17:48:33Z

polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java

+          .description(
+              "If set to true, pagination for APIs like listTables is enabled. The APIs that"
+                  + " currently support pagination are listTables, listViews, and listNamespaces.")
+          .defaultValue(false)


+1 on false by default.

flyrain

Thanks @eric-maynard for working on it.

A cursor-based approach like EntityIdPageToken is admittedly more involved, but it gives us better consistency, stability, and performance at scale, so I’d prefer we go with that. I’m also not sure we need to support multiple approaches at this point.

The current PR touches quite a few areas, which makes it a bit hard to review in one go. Would it make sense to split it?

First PR: focus on the pagination interfaces and add support in the in-memory metastore, so we can align on the contract.
Follow-up PRs: add JDBC/EclipseLink support once the interface is settled.

WDYT?

flyrain · 2025-05-02T23:11:30Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PolarisPage.java

+ * A wrapper for a {@link List} of data and a {@link PageToken} that can be used to continue the
+ * listing operation that generated that data.
+ */
+public class PolarisPage<T> {


OK with either, but leaning toward Page, which is not only shorter, but also consistent with the naming within the package pagination.

flyrain · 2025-05-02T23:14:50Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PolarisPage.java

+  public final PageToken pageToken;
+  public final List<T> data;


The public scope is generally fine here, but with the concerns that we will be in a bad situation in case of adding any processing on either of the field. Still recommend to make them private, and using getter.

Nit: data -> items or entities

a bad situation in case of adding any processing on either of the field.

Can you say more about this? The fields are final, so using a getter is functionally the same as just using the member.

This is essentially a record but within the module that uses Java 11

The public scope makes sense if they are final.

polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PolarisPage.java

flyrain · 2025-05-02T23:37:15Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PageToken.java

+
+  public int pageSize;
+
+  public static final PageToken DONE = null;


Minor: I'd suggest END, which seems a bit more natural than DONE

flyrain · 2025-05-02T23:37:54Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PageToken.java

+  public int pageSize;
+
+  public static final PageToken DONE = null;
+  public static final int DEFAULT_PAGE_SIZE = 1000;


Are we going to make it configurable at the server side? Not a blocker.

flyrain · 2025-05-02T23:57:01Z

service/common/build.gradle.kts

@@ -29,6 +29,7 @@ dependencies {
  implementation(project(":polaris-api-management-service"))
  implementation(project(":polaris-api-iceberg-service"))
  implementation(project(":polaris-api-catalog-service"))
+  implementation(project(":polaris-jpa-model"))


The polaris-jpa-model is only used by the EclipseLink, what about the other persistence implementations, e.g., jdbc one? I don't think the service module can depend on a specific persistence impl.

flyrain · 2025-05-03T04:50:46Z

...ion/persistence/jpa-model/src/main/java/org/apache/polaris/jpa/models/EntityIdPageToken.java

+ * filtered during listing such that only entities with and ID greater than the ID of the token are
+ * returned.
+ */
+public class EntityIdPageToken extends PageToken {


The JPA-model module should only be used by EclipseLink. We need to either found a common place to put this class, or duplicate it in different persistence implmentations.

flyrain · 2025-05-03T05:32:27Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/OffsetPageToken.java

+ * are meant to be filtered during listing such that when a token with offset N is supplied, the
+ * first N records are omitted from the results.
+ */
+public class OffsetPageToken extends PageToken {


Do we need an offset-based approach? I think we could use just one approach(cursor-based one, like EntityIdPageToken) everywhere in Polaris.

The in-memory metastore doesn't necessarily sort on Entity ID, and in the future we could paginate things that don't have an entity ID.

The in-memory metastore doesn't necessarily sort on Entity ID

I think it's fine to sort on id, the perf concern would be minor since the in-memory metastore won't be used at scale.

in the future we could paginate things that don't have an entity ID.

We could still use a cursor-based approach when entity id is missing. The offset approach isn't necessary.

The offset is the cursor. The cursor is based on row index.

flyrain · 2025-05-03T05:34:16Z

polaris-core/src/main/java/org/apache/polaris/core/persistence/pagination/PageToken.java

+ */
+public abstract class PageToken {
+
+  public int pageSize;


I'd prefer a name like PageRequest.

eric-maynard · 2025-05-05T17:19:47Z

@flyrain, please see #1528

* Initial commit Co-authored-by: Aihua Xu <aihua.xu@snowflake.com> Co-authored-by: Alvin Chen <alvin.chen@snowflake.com> Co-authored-by: Benoit Dageville <benoit.dageville@snowflake.com> Co-authored-by: Dennis Huo <huoisme@gmail.com> Co-authored-by: Evan Gilbert <evan.gilbert@snowflake.com> Co-authored-by: Evgeny Zubatov <evgeny.zubatov@snowflake.com> Co-authored-by: Jonas-Taha El Sesiy <github@elsesiy.com> Co-authored-by: Maninder Parmar <maninder.parmar@snowflake.com> Co-authored-by: Michael Collado <collado.mike@gmail.com> Co-authored-by: Sean Lee <sean.lee@snowflake.com> Co-authored-by: Shannon Chen <shannon.chen@snowflake.com> Co-authored-by: Tyler Jones <tyler.jones@snowflake.com> Co-authored-by: Vivo Xu <git@vivoxu.com> * Add brief description to README * Update package naming to separate snowflake code and oss core, extensions, and service impl * Set up new gradle project structure * Add gradle wrapper * Moved docker files to project root and renamed files and classes to Polaris * Update CI scripts to use new layout * Add missing gradlew file * Updates to READMEs and move manual* scripts to snowflake repository root * Fix SparkIntegrationTest after merge * Fix regtest in polaris application Fix json error messages to return clearer validation causes (apache#272) Extended gradle format in root project to apply to oss (apache#273) Improve error message for invalid json and distinguish from invalid values (apache#274) Update repository references to managed-polaris Removed references and made aws resources configurable Fix references to snowflake reg test resources Update README with instructions on running cloud-specific regtests Copy recommended gradle .gitignore contents Update github actions Add @polaris-catalog/polaris team to codeowners * Merge branch 'managed-polaris' into mcollado-polaris-import Co-authored-by: Dennis Huo <huoisme@gmail.com> * Merged changes into polaris-catalog/polaris-dev Co-authored-by: Dennis Huo <huoisme@gmail.com> Co-authored-by: Evgeny Zubatov <evgeny.zubatov@snowflake.com> * Squashed commit of the following: Co-authored-by: Benoit Dageville <benoit.dageville@snowflake.com> Co-authored-by: Dennis Huo <huoisme@gmail.com> Co-authored-by: Eric Maynard <xeyerericx@gmail.com> Co-authored-by: Evgeny Zubatov <evgeny.zubatov@snowflake.com> Co-authored-by: Michael Collado <collado.mike@gmail.com> Co-authored-by: Shannon Chen <shannon.chen@snowflake.com> commit bd256f544c069ff15a7a96ab7f2abc650a2e9812 Author: Shannon Chen <shannon.chen@snowflake.com> Date: Tue Jul 23 23:43:38 2024 +0000 Remove s3gov s3china enums and validate roleArn. Removing the enums because the iceberg spec does not have s3gov or s3china prefix for the url, those are snowflake style supported prefix. commit 855dbb702bdc4fc80ca852b8bf563979e08d63d2 Author: Michael Collado <michael.collado@snowflake.com> Date: Tue Jul 23 10:02:35 2024 -0700 Fix credential vending for view creation (#19) Correctly sets vended credentials for view creation commit 0429e6356cd71b3908600b6c5c17f82493f1d37d Author: Eric Maynard <eric.maynard@snowflake.com> Date: Tue Jul 23 09:49:20 2024 -0700 This PR implements a basic CLI for Polaris, supporting simple commands like: ``` polaris catalogs list polaris catalogs create --type --storage-type s3 --default-base-location s3://my-bucket --role-arn ${ARN} polaris principals update emaynard --property foo=bar --property e=mc2 polaris privileges --catalog my_cat --catalog-role my_role namespace grant --namespace a.b.c TABLE_READ_DATA polaris privileges --catalog my_cat --catalog-role my_role table revoke --namespace a.b.c --table t1 TABLE_READ_DATA ``` commit 01d4c294e6f8b3e77bf205af00ea2e1dbef0d362 Author: Evgeny Zubatov <evgeny.zubatov@snowflake.com> Date: Mon Jul 22 11:12:29 2024 -0700 Service Bootstrap (Part 2): we are removing bootstrap code in init methods and updates to In-Memory store (#8) Changing bootstrap logic, moving bootstrap code to a separate method and only use it during service bootstrapping and first time initialization. So moving forward we will not call bootstrap during SessionManager init code as it used to be, as this will be destructive if service gets restarted. For InMemory Store we have special handling and doing bootstrap on a very first initialization of SessionManager for a given realm. And it makes sense as we can't use our custom dropwizard Bootstrap command for bootstrapping in-memory store (as in-memory store is only valid and available during server process lifetime) commit 2c7f3c43c557e521d7177a4d7dd44157147f0a0c Author: Dennis Huo <dennis.huo@snowflake.com> Date: Fri Jul 19 23:33:05 2024 +0000 Defense-in-depth - make FileIO impl come from StorageConfigurationInfo (#15) Description Rather than specifying ResolvingFileIO, we can be more explicit about the FileIO impl we're allowing. Also only allow custom specification of FileIO in test environments using a feature config flag. Even if there are valid FileIO customizations customers could specify, we have only really vetted the enumerated list of impls, so for example we don't want a customer to be able to force Polaris to try to use Hadoop's S3AFileSystem for S3 files even if it "might" work. This in conjunction with omitting `FILE` from SUPPORTED_CATALOG_STORAGE_TYPES for managed environments (https://github.com/snowflakedb/polaris-k8-config/pull/116/files) ensures we won't have a FileIO impl that's capable of reading unexpected files. commit 498861114994b0508efdbdd2167918be5517f4cb Merge: cf07ac0 c100175 Author: Michael Collado <michael.collado@snowflake.com> Date: Fri Jul 19 13:41:02 2024 -0700 Merge branch 'main' into mcollado-update-aws-region commit cf07ac099644b96f93026b209c9938243c1cce18 Author: Michael Collado <michael.collado@snowflake.com> Date: Fri Jul 19 13:38:22 2024 -0700 Stop setting AWS_REGION env and use client.region spark config in tests commit c10017521145e138ae5cdd903d7d51b4bee9e82c Merge: b1de84a d2df00f Author: Eric Maynard <eric.maynard@snowflake.com> Date: Fri Jul 19 12:43:15 2024 -0700 Merge pull request #12 from snowflakedb/confirm-warehouse-non-null commit b1de84ad47f6bdf5be4318d4664767dfc33bb5a0 Merge: 504dcc0 1f79e19 Author: Michael Collado <michael.collado@snowflake.com> Date: Fri Jul 19 09:25:07 2024 -0700 Merge branch 'main' into mcollado-view-tests commit d4c58a6a19756078309229c1de4dbf5f737dbdd0 Author: Shannon Chen <shannon.chen@snowflake.com> Date: Thu Jul 18 02:58:52 2024 -0700 cross region support commit 504dcc05bb33e686f5765e5b2d91aa4dcfe2e5d1 Author: Michael Collado <michael.collado@snowflake.com> Date: Fri Jul 19 00:00:57 2024 -0700 fix regtest failures commit b7ed5d27e2d71708977cc6fe7eac3ab10e8d9836 Author: Michael Collado <michael.collado@snowflake.com> Date: Thu Jul 18 21:52:46 2024 -0700 Add reg tests to verify view support * Squashed commit of the following: commit 4fb3b6c19a8a8a4961b777ad32dbe1b87d5efe94 Author: Evgeny Zubatov <evgeny.zubatov@snowflake.com> Date: Thu Jul 25 14:02:30 2024 -0700 Adding annotation and enforcing size limits for Principal, Role, Catalog and Catalog Role names. Also blocking "SYSTEM$" prefix from being used in names. Adding case-insensitive regex rule to block "SYSTEM$" commit 2fcc2898ea038c074fed075cdc7ff62e4884e76a Author: Alvin Chen <alvin.chen@snowflake.com> Date: Thu Jul 25 11:28:00 2024 -0700 Replace Dropwizard Metrics with Micrometer (#18)  Since the current Dropwizard Metric library 4.x doesn't support adding custom labels to metrics, we cannot define per-account metrics in order As a result, we're migrating to Micrometer metrics to support custom tagging and align with the metric implementations Major changes by component - `PolarisMetricRegistry` - defines caching for timers and error counters as well as abstracts away the creation of two separate metrics, one with and one without the `account` tag - `TimedApplicationEventListener` - an implementation of the Jersey ApplicationEventListener to listen on requests invoking methods with `@TimedApi` annotation, and handles logic of timing resource/counting errors on success/failure cases respectively - `IcebergMappedException` - removed the original logic for counting errors since the code is now centralized in the above two classes ## Test Manual tested by calling the /metrics endpoint. Following is the result of one successful and one failure invoke of the /oauth endpoint. Note that the timer produces a `summary` and a `gauge`, and doesn't get incremented on failure cases. ``` % curl http://localhost:8182/metrics # HELP polaris_OAuth2Api_getToken_error_total # TYPE polaris_OAuth2Api_getToken_error_total counter polaris_OAuth2Api_getToken_error_total{HTTP_RESPONSE_CODE="401"} 1.0 # HELP polaris_OAuth2Api_getToken_error_realm_total # TYPE polaris_OAuth2Api_getToken_error_realm_total counter polaris_OAuth2Api_getToken_error_realm_total{HTTP_RESPONSE_CODE="401",REALM_ID="testpolaris"} 1.0 # HELP polaris_OAuth2Api_getToken_realm_seconds # TYPE polaris_OAuth2Api_getToken_realm_seconds summary polaris_OAuth2Api_getToken_realm_seconds_count{REALM_ID="testpolaris"} 1 polaris_OAuth2Api_getToken_realm_seconds_sum{REALM_ID="testpolaris"} 0.384 # HELP polaris_OAuth2Api_getToken_realm_seconds_max # TYPE polaris_OAuth2Api_getToken_realm_seconds_max gauge polaris_OAuth2Api_getToken_realm_seconds_max{REALM_ID="testpolaris"} 0.384 # HELP polaris_OAuth2Api_getToken_seconds # TYPE polaris_OAuth2Api_getToken_seconds summary polaris_OAuth2Api_getToken_seconds_count 1 polaris_OAuth2Api_getToken_seconds_sum 0.384 # HELP polaris_OAuth2Api_getToken_seconds_max # TYPE polaris_OAuth2Api_getToken_seconds_max gauge polaris_OAuth2Api_getToken_seconds_max 0.384 # HELP polaris_persistence_loadEntity_realm_seconds # TYPE polaris_persistence_loadEntity_realm_seconds summary polaris_persistence_loadEntity_realm_seconds_count{REALM_ID="testpolaris"} 1 polaris_persistence_loadEntity_realm_seconds_sum{REALM_ID="testpolaris"} 0.041 # HELP polaris_persistence_loadEntity_realm_seconds_max # TYPE polaris_persistence_loadEntity_realm_seconds_max gauge polaris_persistence_loadEntity_realm_seconds_max{REALM_ID="testpolaris"} 0.041 # HELP polaris_persistence_loadEntity_seconds # TYPE polaris_persistence_loadEntity_seconds summary polaris_persistence_loadEntity_seconds_count 1 polaris_persistence_loadEntity_seconds_sum 0.041 # HELP polaris_persistence_loadEntity_seconds_max # TYPE polaris_persistence_loadEntity_seconds_max gauge polaris_persistence_loadEntity_seconds_max 0.041 ``` commit 5abee21b07be00f5f3b18faabe61fb88ecec37e0 Author: Shannon Chen <shannon.chen@snowflake.com> Date: Thu Jul 25 17:14:09 2024 +0000 select view hangs in remote polaris because iceberg SDK could not initialize the s3client since it is missing credentials. It works locally because the SDK S3client initialization work if your local environment have AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set, and our dev environment does have these two variables set, so it was not using vending scopedcreds. This PR does below things: 1. add scoped creds to the fileIO when select view 2. stops retry for more cases, the `select view` hangs because it keeps retying commit 85d41bcbab30c9fc3fad56dea83f80e8146ee79c Author: Eric Maynard <eric.maynard@snowflake.com> Date: Wed Jul 24 16:33:57 2024 -0700 In this PR, I've regenerated the Python clients from the spec by following the steps [here](https://github.com/snowflakedb/managed-polaris/tree/main/polaris/regtests#python-tests). I ran: ``` docker run --rm \ -v ${PWD}:/local openapitools/openapi-generator-cli generate \ -i /local/spec/polaris-management-service.yml \ -g python \ -o /local/regtests/client/python --additional-properties=packageName=polaris.management --additional-properties=apiNamePrefix=polaris docker run --rm \ -v ${PWD}:/local openapitools/openapi-generator-cli generate \ -i /local/spec/rest-catalog-open-api.yaml \ -g python \ -o /local/regtests/client/python --additional-properties=packageName=polaris.catalog --additional-properties=apiNameSuffix="" --additional-properties=apiNamePrefix=Iceberg ``` commit 485d99c89abd7b7c3690f45d96a5043a47032ba3 Author: Eric Maynard <eric.maynard@snowflake.com> Date: Wed Jul 24 11:27:21 2024 -0700 This PR introduces quickstart documentation and adds a basic structure for OSS docs. commit 4310980aecf81cc23bbf583cfb6c360ca738a788 Author: Shannon Chen <shannon.chen@snowflake.com> Date: Wed Jul 24 17:38:14 2024 +0000 Stop retry 403 Access Denied error (#22) commit 95acd5b3e7983b89d47a915c62ac5bb247730313 Author: Benoit Dageville <59930187+sfc-gh-bdagevil@users.noreply.github.com> Date: Tue Jul 23 22:15:34 2024 -0700 * Fix readme statement and snowflake reference in PolarisDefaultDiagServiceImpl --------- Co-authored-by: Daniel Myers <jdanielmyers@gmail.com> Co-authored-by: Anna Filippova <7892219+annafil@users.noreply.github.com> Co-authored-by: Michael Collado <michael.collado@snowflake.com> Co-authored-by: Aihua Xu <aihua.xu@snowflake.com> Co-authored-by: Alvin Chen <alvin.chen@snowflake.com> Co-authored-by: Benoit Dageville <benoit.dageville@snowflake.com> Co-authored-by: Dennis Huo <huoisme@gmail.com> Co-authored-by: Evan Gilbert <evan.gilbert@snowflake.com> Co-authored-by: Evgeny Zubatov <evgeny.zubatov@snowflake.com> Co-authored-by: Jonas-Taha El Sesiy <github@elsesiy.com> Co-authored-by: Maninder Parmar <maninder.parmar@snowflake.com> Co-authored-by: Sean Lee <sean.lee@snowflake.com> Co-authored-by: Shannon Chen <shannon.chen@snowflake.com> Co-authored-by: Tyler Jones <tyler.jones@snowflake.com> Co-authored-by: Vivo Xu <git@vivoxu.com>

eric-maynard marked this pull request as ready for review September 6, 2024 21:27

eric-maynard requested review from jbonofre, ashvina, RussellSpitzer, snazy, vvcephei, takidau and jackye1995 as code owners September 6, 2024 21:27

dbosco reviewed Sep 9, 2024

View reviewed changes

eric-maynard force-pushed the pagination branch from 850baf4 to 549c5ab Compare September 10, 2024 03:18

github-actions bot added the Stale label Dec 5, 2024

eric-maynard removed the Stale label Dec 5, 2024

flyrain added this to the 1.0.0 milestone Dec 19, 2024

eric-maynard mentioned this pull request Feb 28, 2025

Core persistence refactor phase 1 - extract BasePersistence and BaseMetaStoreManager to isolate all "transaction" behaviors #1070

Merged

eric-maynard added 12 commits March 20, 2025 13:18

add missing apis

4f775a6

more tests, fixes

e2fcd5e

clean up drop

467ebcb

autolint

5f5ca42

Merge branch 'main' of github.com:apache/polaris into implement-Polar…

c1b2b73

…isGenericTableCatalog-2

changes per review

34c8e96

revert iceberg messages to comply with oss tests

dab7279

another revert

54a47a5

more iceberg catalog changes

1ffbf07

autolint

113c941

dependency issues

b672986

more wiring

4a9ed69

eric-maynard added 4 commits March 27, 2025 01:11

debugging

663dafb

resolve conflicts

90b4c73

fix failing tests

133b658

fix another test

98661be

singhpk234 reviewed Mar 28, 2025

View reviewed changes

singhpk234 reviewed Mar 31, 2025

View reviewed changes

dimas-b reviewed Mar 31, 2025

View reviewed changes

eric-maynard added 2 commits March 31, 2025 20:57

changes per review

0798b63

autolint

241e7ed

snazy reviewed Apr 1, 2025

View reviewed changes

eric-maynard added 7 commits April 1, 2025 10:03

some fixes

0ad4ced

rebase

fcaa2b7

stable

dc86441

rebase

4b58278

pull main

6050494

updates for new persistence

0fcec2f

fix

6b9031c

eric-maynard mentioned this pull request Apr 28, 2025

Lazy iteration over JDBC ResultSet #1487

Merged

merge conflicts

01a8468

flyrain reviewed May 1, 2025

View reviewed changes

flyrain reviewed May 3, 2025

View reviewed changes

eric-maynard mentioned this pull request May 5, 2025

Interface changes for pagination #1528

Merged

eric-maynard closed this May 5, 2025

github-project-automation bot moved this from PRs In Progress to Done in Basic Kanban Board May 5, 2025

eric-maynard mentioned this pull request May 12, 2025

Persistence implementations for list pagination #1555

Closed


		public int pageSize;

		public static final PageToken DONE = null;

Implement pagination for list APIs #273

Implement pagination for list APIs #273

Uh oh!

Conversation

eric-maynard commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dbosco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dbosco commented Sep 30, 2024

Uh oh!

github-actions bot commented Dec 5, 2024

Uh oh!

dimas-b commented Feb 28, 2025

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

singhpk234 Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard commented Sep 6, 2024 •

edited

Loading

eric-maynard Mar 28, 2025 •

edited

Loading

singhpk234 Mar 31, 2025 •

edited

Loading

eric-maynard Mar 31, 2025 •

edited

Loading

singhpk234 Mar 31, 2025 •

edited

Loading

eric-maynard Apr 1, 2025 •

edited

Loading

eric-maynard Mar 31, 2025 •

edited

Loading

eric-maynard Mar 31, 2025 •

edited

Loading

eric-maynard May 4, 2025 •

edited

Loading