-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Core: REST Scan Planning Task Implementation #13400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java
Outdated
Show resolved
Hide resolved
3624aee to
c55e0e4
Compare
c55e0e4 to
0f225bf
Compare
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java
Outdated
Show resolved
Hide resolved
8f6e519 to
49a1392
Compare
49a1392 to
d752e0a
Compare
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still going through things but I do believe this works at least from a correctness perspective. I still would need to give some more thought as to cancellation, client/server backpressure and how this would fit in for engines which immediately start task consumption/execution during planning (like Trino)
|
Thank you for the review, presently i just made E2E machinery work, with later parser changes, i think in addition to the points you mentioned, i was thinking of more also from the POV :
|
52b3957 to
e3de068
Compare
|
Update : Working on refactoring this a bit more.
My understanding was ParallelIterable could be helpful here as its aware of the both consumer and the producer, iceberg/core/src/main/java/org/apache/iceberg/util/ParallelIterable.java Lines 200 to 205 in be03c99
Agree need to think this more thoroughly also from the server POV, from my cursory reading of Trino source code (I am fairly new to it) https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitManager.java#L149 split generation and consuming it, should mostly work, as this seems like this is built in engine itself. for engine which needs all the splits computed first we have no choice but to consume everything. |
540a4d6 to
08b1ce4
Compare
core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java
Outdated
Show resolved
Hide resolved
69d6ac1 to
b85d8e6
Compare
40c9501 to
01d4a24
Compare
core/src/test/java/org/apache/iceberg/rest/RESTCatalogTestInfrastructure.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java
Outdated
Show resolved
Hide resolved
da3b951 to
a68476c
Compare
| return this.planningBehavior == null ? new PlanningBehavior() {} : planningBehavior; | ||
| } | ||
|
|
||
| protected void setPlanningBehavior(PlanningBehavior behavior) { | ||
| this.planningBehavior = behavior; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for now, we can address in a follow on. Not a blocker, just a nit of mine.
core/src/test/java/org/apache/iceberg/rest/RESTCatalogTestInfrastructure.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java
Outdated
Show resolved
Hide resolved
### What changes were proposed in this pull request? Plan scan API supports access control ### Why are the changes needed? Fix: #9337 ### Does this PR introduce _any_ user-facing change? No need. ### How was this patch tested? Iceberg client don't support remote plan scan now. Now we can't test it yet. You can see apache/iceberg#13400
core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java
Outdated
Show resolved
Hide resolved
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been mostly focusing on the tests and I've opened singhpk234#273 against your branch @singhpk234. The one remaining piece I haven't reviewed so far is the ScanTasksIterable (due to its complexity) but all of the other changes look good to me
1aeb23b to
a977a83
Compare
core/src/main/java/org/apache/iceberg/rest/ScanTasksIterable.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/rest/TestRESTScanPlanning.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/ScanTaskIterable.java
Outdated
Show resolved
Hide resolved
| import org.slf4j.Logger; | ||
| import org.slf4j.LoggerFactory; | ||
|
|
||
| class ScanTasksIterable implements CloseableIterable<FileScanTask> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked out the latest code locally, I think we're very close just some final simplifications, published a PR
singhpk234#274 with my edits. Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the thorough feedback Amogh, i responded to the some of the feedbacks and addressed rest in my recent commit ! please let know what do you think considering what i had it in mind.
core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public CloseableIterable<FileScanTask> planFiles() { | ||
| Long startSnapshotId = context().fromSnapshotId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of multiple planFile calls what do you think cancelling the previous plan?
TableScan scan = table.newScan();
scan.planFiles(); // Plan created
scan.planFiles(); // Plan created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout, i think scan needs to be aware of the all the active plans, also i am thinking this will be problematic on concurrent request, let me sleep over it to think more !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is really something we need to worry about. If we put aside remote scan planning, and just take the client side planning case, these are 2 fundamentally different issuances of scanning work (e.g. in the client side case, we'd read manifests, sure on the second time maybe manifests are cached so we avoid the I/O but it's still doing the work and do the work of figuring out the tasks.). I think we just work from that semantic of the API.
07c2b78 to
bea48d5
Compare
bea48d5 to
63929bb
Compare
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @singhpk234 , I think at this point this looks reasonable enough to get in, and I know other changes are depending on this, causing rebase challenges. I still feel like with some reasonable assumptions we can simplify ScanTaskIterable further but that's something we can discuss in follow on, and the current implementation looks reasonable without any races/deadlocks.
I'd say let's give it another day or so before merging for @nastra or @danielcweeks if they want to take another look.
🥞 Stacked PR
Please use this link for viewing incremental changes.
Current Stack status:
* Add Rest Model and Parsers [Merged]
About the change
[1] Add routes for the scan Planning API
[2] Adds RestTable and RestTableScan which can call the scan endpoint
[3] ScanIterable which uses ParallelIterable to fetch scan tasks
Testing
Added unit testing for the E2E req response loop