-
Notifications
You must be signed in to change notification settings - Fork 2.9k
OpenAPI: Add min-rows-requested field to PlanTableScanRequest #14565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4402,6 +4402,14 @@ components: | |
| description: | ||
| Expression used to filter the table data | ||
| $ref: '#/components/schemas/Expression' | ||
| min-rows-requested: | ||
| description: | ||
| The minimum number of rows requested for the scan. This is used as a hint | ||
| to the server to not have to return more rows than necessary. It is not required | ||
| for the server to return that many rows since the scan may not produce that | ||
| many rows. The server can also return more rows than requested. | ||
|
Comment on lines
+4407
to
+4410
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wondering if we should bake assertion that it should not |
||
| type: integer | ||
| format: int64 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this should be int32 because the LIMIT that Spark pushes down is also an int (https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/read/SupportsPushDownLimit.html)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I would keep this as int64; yes it's super unlikely to want to limit beyond int max, but I believe the protocol should be open and then the client implementation can be bounded however it likes.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agree with Amogh here
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. int64 SG ! just checked PG support limit as int64
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point, that makes total sense to me to keep this as int64 |
||
| case-sensitive: | ||
| description: Enables case sensitive field matching for filter and select | ||
| type: boolean | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the name and description a bit confusing. For me, it is obvious that it might not return exactly the number of rows that are requested. For example, the table might be empty. I would much rather go with something that's analogue to SQL:
Of course, it might still return more rows since the Parquet file contains more rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I look at things the other way since I feel like
limitconnotes a strict upper-bound (that is if there are sufficient matching rows) as how it means in SQL. Another idea:limit-hint?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @Fokko , what about
soft_limit?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to jump in on bike shedding how about
max_rows_requiredI think it lacks the ambiguity if someone isn't familiar with SQL and I think my issue with and hopefully makes it clearer that this is in fact an upper bound?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to pile on.
target-scan-task-numberThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra lower bound may not always be valid. e.g. maybe the table just don't have enough number of rows to satisfy the lower bound.
targetis probably a good name for the desired plan size. Wondering why not usingbytes(instead ofrows) to express the target, similar toread.split.target-size?Anyway, I would suggest
target-plan-size-rowsortarget-plan-size-bytes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why we added wording for this case:
It is not required for the server to return that many rows since the scan may not produce that many rowsWe eventually want to push down the LIMIT from the engine to this property to indicate the number of rows. The server can then do its optimization and return at least that number of rows. As a final step, the engine would then apply the LIMIT on top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't think we want to do bytes, this is ultimately for allowing engines to have any easy way to express a limit so that we limit the amount of work that needs to be done on a server.
IMO, after going through this thread, I think I still find the most compelling option to be what I proposed earlier,
limit-hint. I think that this naming best indicates to a client what the intent is ("To express a limit to the server") and that it's a hint. I also found out that there's some prior art on this naming here as it looks like the Delta Protocol naming for this concept is also called limitHint. https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md#request-bodyI'm also good with the current PR's
min-rows-requested.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood the purpose of this config. I thought it was for the target size of the
PlanTasks from the PlanTableScan, similar to theread.split.target-size.Didn't realize it is for the limit push down. I am fine with either
min-rows-requestedorlimit-hint. slight preference formin-rows-requestedsince it is more descriptive.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks everyone, I would suggest to keep it named as
min-rows-requestedas it precisely describes its purpose and intent.