-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-3032] dbt show --limit
should include the limit in the DWH query, to avoid long-running queries
#8496
Comments
dbt show
bringing the entire model into memorydbt show
bringing the entire model into memory
What should the expected behavior be if I run a command like this? dbt show --inline "select * from my_table limit 500" --limit 10 or like this? dbt show --inline "select * from my_table limit 10" --limit 500 |
dbt show
bringing the entire model into memorydbt show --limit
should include the limit in the DWH query, to avoid long-running queries
@davidharting Agree! I think we'd need to use something like a subquery, to avoid a syntax error that would result here: # no good
select * from my_table limit 500
limit 10;
# works
select * from (
select * from my_table limit 500
) subquery
limit 10; The alternative is trying to parse the query to see if the word |
Nice that makes sense! So expected behavior is simply that the query we are showing is always honored verbatim, and the limit is applied on an "outer layer" via the subquery mechanism above. Simple, predictable, explicit. Love it. |
@davidharting Thanks for pointing this out, I added writing up a test case to the acceptance criteria |
Perhaps nitpicking, but I think this is better characterized like this:
This is the standard and it is the default way that drivers have implemented this feature. This is also how pandas.dataframe.head() works. Our docs are currently clear that
Usage of The way to get pushdown is to literally push the limit down into the SQL. My gut says we modify Adapter.execute() to have a conditional that when two challenges to this approach:
Last thing is that this smells very much like an interface issue rather than a technical one. I don't blame folks for thinking that |
I think your feedback is sort of two-fold here:
I can't weigh in on suggested implementation. But for dbt CLI users, I do still see this as a problem, if not a bug. |
That is in fact news to me! this would represent a breaking change to the IDE using 1.2, as the strategy in IDE 1.1 when using RPC is more like the It sounds like there are some technical justifications for why show may not be able to return rows here, but strictly in the context of the IDE i think we'd want to make sure preview works for all models. |
I do not believe this is (or should be) the case! Have you found it to be otherwise? Incremental models support previewing their compiled SQL ( To @davidharting's point: I strongly believe that dbt-core/adapters should be responsible for owning the pushdown of the Update: @dave-connors-3 @davidharting will be investigating this and getting to the bottom of why incremental models don't appear to be working in the IDE for preview with v1.6. |
While I agree this definitely falls onto adapters I do want to temper expectations here, for many queries and data warehouses adding the limit into the query may not significantly impact the query run time. This is because many query engines don't pushdown of limit clauses into sub queries/with statements. The limit is therefore only applied on the result set of the query with the limit clause added.
If t2 is the expensive part of this query then the limit 10 will have a negligible impact on the runtime. Not saying we shouldn't try to add the limit into the query but that doing so is going to be an incremental improvement not a complete fix. |
Housekeeping
Short description
Most adapters (check impact matrix below) load the entire model into memory while running
dbt show
.The goal of this ticket is to avoid this behavior and only load the relevant rows from the model.
Acceptance criteria
limit
into the query instead of using thefetchmany
method fordbt show
.limit
already in the query.Make any needed changes to the adapter repos
Impact to Other Teams
Adapter impact matrix
Will backports be required?
Backport to
1.6.latest
,1.5.latest
Context
Original bug report in Slack: https://dbt-labs.slack.com/archives/C05FWBP9X1U/p1692909690843649.
Link to similar issue that was fixed in Core: #7390.
The root cause of the issue is that warehouse providers are implementing the
fetchmany
method in the DB-API 2.0 spec inefficiently by loading the entire table into memory and filtering client-side.The long-term fix here is for them to become smarter in how they implement
fetchmany
, similar to what BigQuery has done: googleapis/google-cloud-python#9185.The text was updated successfully, but these errors were encountered: