-
Notifications
You must be signed in to change notification settings - Fork 2.9k
REST: Add Support for Custom Operations Builders in RESTCatalog #14465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c5a8e9a to
d849fce
Compare
core/src/main/java/org/apache/iceberg/rest/RESTOperationsBuilder.java
Outdated
Show resolved
Hide resolved
| * @param endpoints the set of supported REST endpoints | ||
| * @return a new RESTViewOperations instance | ||
| */ | ||
| default RESTViewOperations createViewOperations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is fileIO handled for view opertions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, FileIO is not required for view operations, because Iceberg views are logical objects that contain only metadata (SQL definitions, schemas, and versions) and do not read or write any physical files.
When a user runs a query against a view, the query engine expands the view's SQL definition, compiles it into a query plan, and resolves the underlying tables. At that point, the engine loads the actual table objects (which include TableOperations and FileIO) to read the physical data files.
|
cc: @flyrain @stevenzwu @huaxingao Could you pls take a look when you get a chance? Thanks! 🙏 |
flyrain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @XJDKC for the change. Left some comments.
| public RESTCatalog( | ||
| Function<Map<String, String>, RESTClient> clientBuilder, | ||
| BiFunction<SessionCatalog.SessionContext, Map<String, String>, FileIO> ioBuilder, | ||
| RESTOperationsBuilder operationsBuilder) { | ||
| this(SessionCatalog.SessionContext.createEmpty(), clientBuilder, ioBuilder, operationsBuilder); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method necessary if we go with this(SessionCatalog.SessionContext.createEmpty(), clientBuilder); in line 68?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though RESTSessionCatalog allows us to pass in the ioBuilder, but the RESTCatalog doesn't, so I add it in the constructor of RESTCatalog as well.
core/src/main/java/org/apache/iceberg/rest/RESTOperationsBuilder.java
Outdated
Show resolved
Hide resolved
| import org.apache.iceberg.util.LocationUtil; | ||
|
|
||
| class RESTTableOperations implements TableOperations { | ||
| public class RESTTableOperations implements TableOperations { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this scope change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If users only want to make small adjustments to RESTTableOperations (for example, injecting a custom header), they can simply provide a custom implementation that extends RESTTableOperations, without having to copy the entire class.
This makes it much easier for them to upgrade to newer Iceberg SDK versions without dealing with merge conflicts or duplicated code.
I'm okay with either approach here, don't have a strong preference. WDYT?
| import org.apache.iceberg.view.ViewOperations; | ||
|
|
||
| class RESTViewOperations implements ViewOperations { | ||
| public class RESTViewOperations implements ViewOperations { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this scope change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
|
A couple of questions wrt encrypted tables,
If any of these points is a concern, then I believe it can be addressed just by adding a few lines to the RESTOperationsBuilder javadoc API comments. What do you think? |
|
Hey @XJDKC, Just for my information, would you mind explaining a bit more about the motivation and a more concrete use-case where this is needed? Is there a particular functionality in RESTTableOperations that you miss and would be interested in using? My initial gut feeling tells me that exposing table ops and making it injectable is a bit wild. I'm wondering what others think, though. Just an additional nit is that this PR seems to add 2 different changes: A way to inject an IOBuilder through the RESTCatalog (currently only RESTSessionCatalog has this) and a way to inject a REST ops builder. Would it make sense to split these into 2 PRs and test them separately? |
For a custom TableOperations (TO) implementation, the responsibility lies entirely with the implementer. They must ensure proper handling of encryption and any other security-sensitive logic. That said, even for a custom implementation, I'd expect most of the core logic to remain unchanged and continue to rely on the unmanaged components provided by Iceberg sdk. We can add some comments or documentation notes to call this out explicitly, so that anyone implementing a custom TableOperations is aware of the encryption keys and understands the need to handle encryption properly (IIRC, in your PR, an additional param will be passed). This should help prevent misuse or accidental security gaps when extending the default implementation. That said, if someone chooses to extend the default TO, they should take full responsibility for doing so safely. The same applies to the ClientBuilder: users may provide their own HttpClient (for example, to support custom logic (shared connection pool, PrivateLink, proxy, or mTLS, ), and it’s their responsibility to ensure it doesn’t break core functionality.
As mentioned earlier in another thread, that’s already possible even without this PR. Anyone can build their own library or copy the Iceberg SDK code and modify it as they wish. Iceberg is a specification, and the Apache Iceberg repository serves as a reference implementation, we can't prevent developers from customizing it. |
There isn’t a specific functionality missing, but for some platforms (especially those not using Spark), they often have platform-specific requirements, for example, custom logic for accessing storage, adding table-level headers, logging, or auditing. The default
You’re right that this PR introduces two related changes, but both serve the same purpose - improving injectability and extensibility. I don't see strong benefits in splitting them, since the changes are closely related and covered in the test. |
stevenzwu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a nit comment. this looks good to me.
core/src/main/java/org/apache/iceberg/rest/RESTOperationsFactory.java
Outdated
Show resolved
Hide resolved
|
@nastra @amogh-jahagirdar can you also take a look? |
| * RESTSessionCatalog catalog = new RESTSessionCatalog(clientBuilder, ioBuilder, customFactory); | ||
| * </pre> | ||
| */ | ||
| public interface RESTOperationsFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only done a cursory glance on these changes but if we really want to override RESTTableOperations, have we thought about just adding a protected API protected RESTTableOperations newTableOps(TableIdentifier identifier) on RESTSessionCatalog and then implementations would override newTableOps. I think in this approach we'd also have to change RESTCatalog to allow passing through a RESTSessionCatalog implementation.
I feel like the benefit of this approach is that we don't have to introduce a new factory whose APIs may need to change as people want to pass through new stuff to their table operations. Instead, in the inheritance based approach, anything tied in the state of the RESTSessionCatalog can easily be passed through the custom RESTableOperations. Basically I think I'm trying to make the argument that if you're going to override with custom table operations, you probably want to override RESTSessionCatalog and use that state to create a table ops instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @amogh-jahagirdar, thanks for reviewing the PR, what you mentioned makes sense to me.
The reason I went with a builder/factory approach is mainly to stay consistent with the existing pattern, since we already have builders for components like RESTClient and FileIO.
That said, I'm fine with either direction. If we go with your suggested approach, we need a builder / factory for RESTSessionCatalog or maybe a method called newSessionCatalog, and users need to
- extend the
RESTSessionCatalogto provide the customTableOperations - extend the
RESTCatalogto provide the customRESTSessionCatalog.
class RESTSessionCatalog {
// xxx
RESTTableOperations newTableOps(xxx) {
// replace this with custom Table Operations
RESTTableOperations ops =
new RESTTableOperations(
tableClient,
paths.table(finalIdentifier),
Map::of,
tableFileIO(context, tableConf, response.credentials()),
tableMetadata,
endpoints);
}
// xxx
}
class RESTCatalog {
// xxx
public RESTCatalog(
SessionCatalog.SessionContext context,
Function<Map<String, String>, RESTClient> clientBuilder) {
this.sessionCatalog = newSessionCatalog(clientBuilder, null);
this.delegate = sessionCatalog.asCatalog(context);
this.nsDelegate = (SupportsNamespaces) delegate;
this.context = context;
this.viewSessionCatalog = sessionCatalog.asViewCatalog(context);
}
RESTSessionCatalog newSessionCatalog(xxx) {
// replace this with the custom RESTSessionCatalog that provides the custom Table Operations
return new RESTSessionCatalog(clientBuilder, null);
}
// xxx
}Actually, I think that’s a cleaner solution and provides more flexibility overall. WDYT?
cc: @stevenzwu @flyrain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with @amogh-jahagirdar 's suggestion with the two override points that @XJDKC described above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amogh-jahagirdar. I think the factory approach does seems a bit awkward, most of the state needed to build custom table operation already lives inside RESTSessionCatalog.
The inheritance-based approach (adding protected newTableOps / newViewOps in RESTSessionCatalog and allowing RESTCatalog to accept a custom session catalog) seems more idiomatic with the rest of the catalog implementations and keeps the extensibility point tighter and easier to maintain.
Not strongly opposed to either direction, but the factory pattern does feel less natural here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all for the valuable inputs, let me revise my PR based on the inheritance-based approach!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
I was wondering, would it make sense to mark (some of) the member variables of RESTSessionCatalog as protected, or alternatively, provide some protected getters for them? That would make subclassing and extending RESTSessionCatalog a bit cleaner, since currently these private members aren't accessible to subclasses.
protected final Function<Map<String, String>, RESTClient> clientBuilder;
protected final BiFunction<SessionContext, Map<String, String>, FileIO> ioBuilder;
protected FileIOTracker fileIOTracker = null;
protected AuthSession catalogAuth = null;
protected AuthManager authManager;
protected RESTClient client = null;
protected ResourcePaths paths = null;
protected SnapshotMode snapshotMode = null;
protected Object conf = null;
protected FileIO io = null;
protected MetricsReporter reporter = null;
protected boolean reportingViaRestEnabled;
protected Integer pageSize = null;
protected CloseableGroup closeables = null;
protected Set<Endpoint> endpoints;WDYT?
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java
Outdated
Show resolved
Hide resolved
|
Hi @flyrain @amogh-jahagirdar, when you get a chance, could you please give this PR another review? Thanks! 🙏 |
flyrain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @XJDKC !
Currently, RESTCatalog allows users to replace components such as
RESTClient,FileIO,AuthManager, andMetricsReporter(with the logic handled inRESTSessionCatalog).RESTClient: via builderFileIO: via builder (doesn't allowRESTCatalogto pass in the builder) or reflectionAuthManager: via reflectionMetricsReporter: via reflectionHowever, one dependent component that remains non-injectable is
RESTTableOperations.This PR adds support for injecting custom implementations of table and view operations in
RESTCatalog, enabling users to extend and customize REST catalog behavior more easily. It doesn't change any functionalities.This PR also allows user to pass in a
ioBuilder,RESTSessionCatalogdoes allow using an ioBuilder to build theFileIO, however,RESTCatalogdoesn't use this ability.