You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I don't see this as a bug at the moment, we aim to only use one read_parquet call per data source to avoid reading the same columns more than once. This is a specialised example since we actually could separate them out, but I think that this is a little bit of an edge case. I wouldn't focus too much time on this at the moment, although I agree with you that we could be smarter about it. The problem is more that loses in value if there are operations in between read_parquet and the column restriction (like replace, shuffle, ...), since we would do the ops twice in this case. We can certainly make this special case better, but I am not sure if this would help us much in the grand scheme of things
From a preliminary look at the optimized graph, one issue might be that we don't properly push projections into the parquet reads:
Snippet from the graph:
I'd expect
ReadParquet
to only read['l_orderkey', 'l_suppkey']
. Combined with dask/dask-expr#854, this appears to be fairly catastrophic.The text was updated successfully, but these errors were encountered: