Skip to content

Conversation

@singhpk234
Copy link
Contributor

@singhpk234 singhpk234 commented Jun 26, 2025

🥞 Stacked PR

Please use this link for viewing incremental changes.

Current Stack status:

About the change

[1] Add routes for the scan Planning API
[2] Adds RestTable and RestTableScan which can call the scan endpoint
[3] ScanIterable which uses ParallelIterable to fetch scan tasks

Testing

Added unit testing for the E2E req response loop

@github-actions github-actions bot added the core label Jun 26, 2025
@singhpk234 singhpk234 force-pushed the feature/part-2-core-integ branch from 3624aee to c55e0e4 Compare June 27, 2025 15:23
@singhpk234 singhpk234 closed this Jun 27, 2025
@singhpk234 singhpk234 reopened this Jun 27, 2025
@singhpk234 singhpk234 force-pushed the feature/part-2-core-integ branch from c55e0e4 to 0f225bf Compare July 11, 2025 23:50
@singhpk234 singhpk234 closed this Jul 12, 2025
@singhpk234 singhpk234 reopened this Jul 12, 2025
@singhpk234 singhpk234 marked this pull request as ready for review July 13, 2025 18:46
@singhpk234 singhpk234 force-pushed the feature/part-2-core-integ branch 2 times, most recently from 8f6e519 to 49a1392 Compare August 15, 2025 23:57
@singhpk234 singhpk234 marked this pull request as draft August 15, 2025 23:57
@singhpk234 singhpk234 marked this pull request as ready for review August 17, 2025 09:56
@singhpk234 singhpk234 force-pushed the feature/part-2-core-integ branch from 49a1392 to d752e0a Compare August 17, 2025 20:22
@amogh-jahagirdar amogh-jahagirdar changed the title Part 2: Integrate Scan Planning to Core Core: REST Scan Planning Task Implementation Aug 18, 2025
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still going through things but I do believe this works at least from a correctness perspective. I still would need to give some more thought as to cancellation, client/server backpressure and how this would fit in for engines which immediately start task consumption/execution during planning (like Trino)

@singhpk234
Copy link
Contributor Author

singhpk234 commented Aug 18, 2025

Thank you for the review, presently i just made E2E machinery work, with later parser changes, i think in addition to the points you mentioned, i was thinking of more also from the POV :

  1. can server force client to call scan plan api ?
  2. can server tell the client what should be interval between the fetch scan tasks ? (i think you refer that as back-pressure in above comment of yours)
  3. I don't know but i keep coming to this feature [CORE] Support file filtering based on schema #4842 , it may be orthogonal
    but since for schema evolution, we certainly don't wanna send back data file objects which certainly don't have the column

@singhpk234 singhpk234 force-pushed the feature/part-2-core-integ branch from 52b3957 to e3de068 Compare August 29, 2025 19:03
@singhpk234
Copy link
Contributor Author

singhpk234 commented Sep 4, 2025

Update : Working on refactoring this a bit more.

client/server backpressure and how this would fit in for engines which immediately start task consumption/execution during planning (like Trino)

My understanding was ParallelIterable could be helpful here as its aware of the both consumer and the producer,
and handle backpressure via yields

// If the consumer is processing records more slowly than the producers, the producers will
// eventually fill the queue and yield, returning continuations. Continuations and new tasks
// are started by checkTasks(). The check here prevents us from restarting continuations or
// starting new tasks before the queue is emptied. Restarting too early would lead to tasks
// yielding very quickly (CPU waste on scheduling).
if (!queue.isEmpty()) {

Agree need to think this more thoroughly also from the server POV,

from my cursory reading of Trino source code (I am fairly new to it) https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitManager.java#L149 split generation and consuming it, should mostly work, as this seems like this is built in engine itself. for engine which needs all the splits computed first we have no choice but to consume everything.

@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/part-2-core-integ branch 2 times, most recently from 540a4d6 to 08b1ce4 Compare September 9, 2025 23:52
@singhpk234 singhpk234 force-pushed the feature/part-2-core-integ branch 2 times, most recently from 69d6ac1 to b85d8e6 Compare September 16, 2025 22:39
@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/part-2-core-integ branch from 40c9501 to 01d4a24 Compare November 27, 2025 00:54
@github-actions github-actions bot added the build label Nov 30, 2025
@singhpk234 singhpk234 requested a review from nastra December 1, 2025 14:20
@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/part-2-core-integ branch 2 times, most recently from da3b951 to a68476c Compare December 3, 2025 01:28
Comment on lines +597 to +610
return this.planningBehavior == null ? new PlanningBehavior() {} : planningBehavior;
}

protected void setPlanningBehavior(PlanningBehavior behavior) {
this.planningBehavior = behavior;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now, we can address in a follow on. Not a blocker, just a nit of mine.

FANNG1 pushed a commit to apache/gravitino that referenced this pull request Dec 3, 2025
### What changes were proposed in this pull request?

Plan scan API supports access control

### Why are the changes needed?

Fix: #9337 

### Does this PR introduce _any_ user-facing change?

No need.

### How was this patch tested?

Iceberg client don't support remote plan scan now. Now we can't test it
yet. You can see apache/iceberg#13400
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been mostly focusing on the tests and I've opened singhpk234#273 against your branch @singhpk234. The one remaining piece I haven't reviewed so far is the ScanTasksIterable (due to its complexity) but all of the other changes look good to me

@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/part-2-core-integ branch 2 times, most recently from 1aeb23b to a977a83 Compare December 3, 2025 22:41
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

class ScanTasksIterable implements CloseableIterable<FileScanTask> {
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out the latest code locally, I think we're very close just some final simplifications, published a PR
singhpk234#274 with my edits. Let me know what you think!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough feedback Amogh, i responded to the some of the feedbacks and addressed rest in my recent commit ! please let know what do you think considering what i had it in mind.


@Override
public CloseableIterable<FileScanTask> planFiles() {
Long startSnapshotId = context().fromSnapshotId();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of multiple planFile calls what do you think cancelling the previous plan?

TableScan scan = table.newScan();
scan.planFiles(); // Plan created
scan.planFiles(); // Plan created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout, i think scan needs to be aware of the all the active plans, also i am thinking this will be problematic on concurrent request, let me sleep over it to think more !

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is really something we need to worry about. If we put aside remote scan planning, and just take the client side planning case, these are 2 fundamentally different issuances of scanning work (e.g. in the client side case, we'd read manifests, sure on the second time maybe manifests are cached so we avoid the I/O but it's still doing the work and do the work of figuring out the tasks.). I think we just work from that semantic of the API.

@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/part-2-core-integ branch from 07c2b78 to bea48d5 Compare December 9, 2025 17:03
@sfc-gh-prsingh sfc-gh-prsingh force-pushed the feature/part-2-core-integ branch from bea48d5 to 63929bb Compare December 9, 2025 17:06
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @singhpk234 , I think at this point this looks reasonable enough to get in, and I know other changes are depending on this, causing rebase challenges. I still feel like with some reasonable assumptions we can simplify ScanTaskIterable further but that's something we can discuss in follow on, and the current implementation looks reasonable without any races/deadlocks.

I'd say let's give it another day or so before merging for @nastra or @danielcweeks if they want to take another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants