Skip to content

Conversation

@XJDKC
Copy link
Member

@XJDKC XJDKC commented Nov 1, 2025

Currently, RESTCatalog allows users to replace components such as RESTClient, FileIO, AuthManager, and MetricsReporter (with the logic handled in RESTSessionCatalog).

However, one dependent component that remains non-injectable is RESTTableOperations.

This PR adds support for injecting custom implementations of table and view operations in RESTCatalog, enabling users to extend and customize REST catalog behavior more easily. It doesn't change any functionalities.

This PR also allows user to pass in a ioBuilder, RESTSessionCatalog does allow using an ioBuilder to build the FileIO, however, RESTCatalog doesn't use this ability.

 public RESTCatalog(
      Function<Map<String, String>, RESTClient> clientBuilder,
      BiFunction<SessionCatalog.SessionContext, Map<String, String>, FileIO> ioBuilder,
      RESTOperationsFactory operationsFactory) {
    this(SessionCatalog.SessionContext.createEmpty(), clientBuilder, ioBuilder, operationsFactory);
  }

public RESTCatalog(
      SessionCatalog.SessionContext context,
      Function<Map<String, String>, RESTClient> clientBuilder,
      BiFunction<SessionCatalog.SessionContext, Map<String, String>, FileIO> ioBuilder,
      RESTOperationsFactory operationsFactory) {
    this.sessionCatalog = new RESTSessionCatalog(clientBuilder, ioBuilder, operationsFactory);
    // .....
}

@github-actions github-actions bot added the core label Nov 1, 2025
@XJDKC XJDKC force-pushed the rxing-rest-operations-builder branch from c5a8e9a to d849fce Compare November 1, 2025 16:53
* @param endpoints the set of supported REST endpoints
* @return a new RESTViewOperations instance
*/
default RESTViewOperations createViewOperations(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is fileIO handled for view opertions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, FileIO is not required for view operations, because Iceberg views are logical objects that contain only metadata (SQL definitions, schemas, and versions) and do not read or write any physical files.

When a user runs a query against a view, the query engine expands the view's SQL definition, compiles it into a query plan, and resolves the underlying tables. At that point, the engine loads the actual table objects (which include TableOperations and FileIO) to read the physical data files.

@XJDKC
Copy link
Member Author

XJDKC commented Nov 6, 2025

cc: @flyrain @stevenzwu @huaxingao Could you pls take a look when you get a chance? Thanks! 🙏

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @XJDKC for the change. Left some comments.

Comment on lines 77 to 82
public RESTCatalog(
Function<Map<String, String>, RESTClient> clientBuilder,
BiFunction<SessionCatalog.SessionContext, Map<String, String>, FileIO> ioBuilder,
RESTOperationsBuilder operationsBuilder) {
this(SessionCatalog.SessionContext.createEmpty(), clientBuilder, ioBuilder, operationsBuilder);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method necessary if we go with this(SessionCatalog.SessionContext.createEmpty(), clientBuilder); in line 68?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though RESTSessionCatalog allows us to pass in the ioBuilder, but the RESTCatalog doesn't, so I add it in the constructor of RESTCatalog as well.

import org.apache.iceberg.util.LocationUtil;

class RESTTableOperations implements TableOperations {
public class RESTTableOperations implements TableOperations {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this scope change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users only want to make small adjustments to RESTTableOperations (for example, injecting a custom header), they can simply provide a custom implementation that extends RESTTableOperations, without having to copy the entire class.

This makes it much easier for them to upgrade to newer Iceberg SDK versions without dealing with merge conflicts or duplicated code.

I'm okay with either approach here, don't have a strong preference. WDYT?

import org.apache.iceberg.view.ViewOperations;

class RESTViewOperations implements ViewOperations {
public class RESTViewOperations implements ViewOperations {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this scope change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@ggershinsky
Copy link
Contributor

ggershinsky commented Nov 11, 2025

A couple of questions wrt encrypted tables,

  1. What if the encryption.key-id table property is set (https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableProperties.java#L391) but a custom TO implementation ignores it. Do the users expect a table to be encrypted if the encryption.key-id is set? Should the implementors of custom TOs validate them by running the Iceberg unitests (inc TestTableEncryption)?

  2. The standard RESTableOperations class, built in Iceberg, uses a safe approach to getting the metadata object (directly from the REST catalog server, never from the metadata.json file that can be kept in untrusted storage). Can custom TO replacements behave differently in this respect?

If any of these points is a concern, then I believe it can be addressed just by adding a few lines to the RESTOperationsBuilder javadoc API comments. What do you think?

@gaborkaszab
Copy link
Collaborator

Hey @XJDKC,

Just for my information, would you mind explaining a bit more about the motivation and a more concrete use-case where this is needed? Is there a particular functionality in RESTTableOperations that you miss and would be interested in using? My initial gut feeling tells me that exposing table ops and making it injectable is a bit wild. I'm wondering what others think, though.
Technically, if we want this to be injected, shouldn't we expect an interface from the API module as the input param, that is in turn implemented in the core module?

Just an additional nit is that this PR seems to add 2 different changes: A way to inject an IOBuilder through the RESTCatalog (currently only RESTSessionCatalog has this) and a way to inject a REST ops builder. Would it make sense to split these into 2 PRs and test them separately?

@XJDKC
Copy link
Member Author

XJDKC commented Nov 11, 2025

A couple of questions wrt encrypted tables,

  1. What if the encryption.key-id table property is set (https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableProperties.java#L391) but a custom TO implementation ignores it. Do the users expect a table to be encrypted if the encryption.key-id is set? Should the implementors of custom TOs validate them by running the Iceberg unitests (inc TestTableEncryption)?
  2. The standard RESTableOperations class, built in Iceberg, uses a safe approach to getting the metadata object (directly from the REST catalog server, never from the metadata.json file that can be kept in untrusted storage). Can custom TO replacements behave differently in this respect?

If any of these points is a concern, then I believe it can be addressed just by adding a few lines to the RESTOperationsBuilder javadoc API comments. What do you think?

Should the implementors of custom TOs validate them by running the Iceberg unit tests?

For a custom TableOperations (TO) implementation, the responsibility lies entirely with the implementer. They must ensure proper handling of encryption and any other security-sensitive logic.

That said, even for a custom implementation, I'd expect most of the core logic to remain unchanged and continue to rely on the unmanaged components provided by Iceberg sdk.

We can add some comments or documentation notes to call this out explicitly, so that anyone implementing a custom TableOperations is aware of the encryption keys and understands the need to handle encryption properly (IIRC, in your PR, an additional param will be passed). This should help prevent misuse or accidental security gaps when extending the default implementation.

That said, if someone chooses to extend the default TO, they should take full responsibility for doing so safely. The same applies to the ClientBuilder: users may provide their own HttpClient (for example, to support custom logic (shared connection pool, PrivateLink, proxy, or mTLS, ), and it’s their responsibility to ensure it doesn’t break core functionality.

The standard RESTableOperations class, built in Iceberg, uses a safe approach to getting the metadata object (directly from the REST catalog server, never from the metadata.json file that can be kept in untrusted storage). Can custom TO replacements behave differently in this respect?

As mentioned earlier in another thread, that’s already possible even without this PR. Anyone can build their own library or copy the Iceberg SDK code and modify it as they wish. Iceberg is a specification, and the Apache Iceberg repository serves as a reference implementation, we can't prevent developers from customizing it.

@XJDKC
Copy link
Member Author

XJDKC commented Nov 11, 2025

Hey @XJDKC,

Just for my information, would you mind explaining a bit more about the motivation and a more concrete use-case where this is needed? Is there a particular functionality in RESTTableOperations that you miss and would be interested in using? My initial gut feeling tells me that exposing table ops and making it injectable is a bit wild. I'm wondering what others think, though. Technically, if we want this to be injected, shouldn't we expect an interface from the API module as the input param, that is in turn implemented in the core module?

Just an additional nit is that this PR seems to add 2 different changes: A way to inject an IOBuilder through the RESTCatalog (currently only RESTSessionCatalog has this) and a way to inject a REST ops builder. Would it make sense to split these into 2 PRs and test them separately?

would you mind explaining a bit more about the motivation and a more concrete use-case where this is needed? Is there a particular functionality in RESTTableOperations that you miss and would be interested in using?

There isn’t a specific functionality missing, but for some platforms (especially those not using Spark), they often have platform-specific requirements, for example, custom logic for accessing storage, adding table-level headers, logging, or auditing. The default RESTTableOperations isn't designed to accommodate these platform-specific behaviors, nor should it. That's why we need to provide the ability for users to extend or replace it when necessary.

Technically, if we want this to be injected, shouldn't we expect an interface from the API module as the input param, that is in turn implemented in the core module?
I'm fine with either option, the key goal is to make it injectable. I simply followed the existing pattern in the codebase (e.g., the ClientBuilder).

Just an additional nit is that this PR seems to add 2 different changes: A way to inject an IOBuilder through the RESTCatalog (currently only RESTSessionCatalog has this) and a way to inject a REST ops builder. Would it make sense to split these into 2 PRs and test them separately?

You’re right that this PR introduces two related changes, but both serve the same purpose - improving injectability and extensibility. I don't see strong benefits in splitting them, since the changes are closely related and covered in the test.
I don't have a strong preference. If others feel strongly about separating them for clarity or testing purposes, I'm happy to split the FileIO change into a separate PR.

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a nit comment. this looks good to me.

@stevenzwu
Copy link
Contributor

@nastra @amogh-jahagirdar can you also take a look?

* RESTSessionCatalog catalog = new RESTSessionCatalog(clientBuilder, ioBuilder, customFactory);
* </pre>
*/
public interface RESTOperationsFactory {
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only done a cursory glance on these changes but if we really want to override RESTTableOperations, have we thought about just adding a protected API protected RESTTableOperations newTableOps(TableIdentifier identifier) on RESTSessionCatalog and then implementations would override newTableOps. I think in this approach we'd also have to change RESTCatalog to allow passing through a RESTSessionCatalog implementation.

I feel like the benefit of this approach is that we don't have to introduce a new factory whose APIs may need to change as people want to pass through new stuff to their table operations. Instead, in the inheritance based approach, anything tied in the state of the RESTSessionCatalog can easily be passed through the custom RESTableOperations. Basically I think I'm trying to make the argument that if you're going to override with custom table operations, you probably want to override RESTSessionCatalog and use that state to create a table ops instance?

Copy link
Member Author

@XJDKC XJDKC Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @amogh-jahagirdar, thanks for reviewing the PR, what you mentioned makes sense to me.

The reason I went with a builder/factory approach is mainly to stay consistent with the existing pattern, since we already have builders for components like RESTClient and FileIO.

That said, I'm fine with either direction. If we go with your suggested approach, we need a builder / factory for RESTSessionCatalog or maybe a method called newSessionCatalog, and users need to

  1. extend the RESTSessionCatalog to provide the custom TableOperations
  2. extend the RESTCatalog to provide the custom RESTSessionCatalog.
class RESTSessionCatalog {
  // xxx
  RESTTableOperations newTableOps(xxx) {
    // replace this with custom Table Operations
    RESTTableOperations ops =
        new RESTTableOperations(
            tableClient,
            paths.table(finalIdentifier),
            Map::of,
            tableFileIO(context, tableConf, response.credentials()),
            tableMetadata,
            endpoints);
  }
  // xxx
}

class RESTCatalog {
  // xxx
  public RESTCatalog(
      SessionCatalog.SessionContext context,
      Function<Map<String, String>, RESTClient> clientBuilder) {
    this.sessionCatalog = newSessionCatalog(clientBuilder, null);
    this.delegate = sessionCatalog.asCatalog(context);
    this.nsDelegate = (SupportsNamespaces) delegate;
    this.context = context;
    this.viewSessionCatalog = sessionCatalog.asViewCatalog(context);
  }

  RESTSessionCatalog newSessionCatalog(xxx) {
    // replace this with the custom RESTSessionCatalog that provides the custom Table Operations
    return new RESTSessionCatalog(clientBuilder, null); 
  }
  // xxx
}

Actually, I think that’s a cleaner solution and provides more flexibility overall. WDYT?

cc: @stevenzwu @flyrain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with @amogh-jahagirdar 's suggestion with the two override points that @XJDKC described above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogh-jahagirdar. I think the factory approach does seems a bit awkward, most of the state needed to build custom table operation already lives inside RESTSessionCatalog.

The inheritance-based approach (adding protected newTableOps / newViewOps in RESTSessionCatalog and allowing RESTCatalog to accept a custom session catalog) seems more idiomatic with the rest of the catalog implementations and keeps the extensibility point tighter and easier to maintain.

Not strongly opposed to either direction, but the factory pattern does feel less natural here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all for the valuable inputs, let me revise my PR based on the inheritance-based approach!

Copy link
Member Author

@XJDKC XJDKC Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!
I was wondering, would it make sense to mark (some of) the member variables of RESTSessionCatalog as protected, or alternatively, provide some protected getters for them? That would make subclassing and extending RESTSessionCatalog a bit cleaner, since currently these private members aren't accessible to subclasses.

  protected final Function<Map<String, String>, RESTClient> clientBuilder;
  protected final BiFunction<SessionContext, Map<String, String>, FileIO> ioBuilder;
  protected FileIOTracker fileIOTracker = null;
  protected AuthSession catalogAuth = null;
  protected AuthManager authManager;
  protected RESTClient client = null;
  protected ResourcePaths paths = null;
  protected SnapshotMode snapshotMode = null;
  protected Object conf = null;
  protected FileIO io = null;
  protected MetricsReporter reporter = null;
  protected boolean reportingViaRestEnabled;
  protected Integer pageSize = null;
  protected CloseableGroup closeables = null;
  protected Set<Endpoint> endpoints;

WDYT?

@XJDKC
Copy link
Member Author

XJDKC commented Nov 20, 2025

Hi @flyrain @amogh-jahagirdar, when you get a chance, could you please give this PR another review? Thanks! 🙏

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @XJDKC !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants