[TPC-H] Query 21 times out at scale 100 #1362

hendrikmakait · 2024-02-06T22:05:06Z

From a preliminary look at the optimized graph, one issue might be that we don't properly push projections into the parquet reads:

Snippet from the graph:

Projection: columns=['l_orderkey', 'l_suppkey']
    FusedIO:
        ReadParquet: path='./tpch-data/scale-10/lineitem' columns=['l_orderkey', 'l_suppkey', 'l_commitdate', 'l_receiptdate'] filesystem=None kwargs={'dtype_backend': None}

I'd expect ReadParquet to only read ['l_orderkey', 'l_suppkey']. Combined with dask/dask-expr#854, this appears to be fairly catastrophic.

The text was updated successfully, but these errors were encountered:

phofl · 2024-02-06T23:04:41Z

I don't see this as a bug at the moment, we aim to only use one read_parquet call per data source to avoid reading the same columns more than once. This is a specialised example since we actually could separate them out, but I think that this is a little bit of an edge case. I wouldn't focus too much time on this at the moment, although I agree with you that we could be smarter about it. The problem is more that loses in value if there are operations in between read_parquet and the column restriction (like replace, shuffle, ...), since we would do the ops twice in this case. We can certainly make this special case better, but I am not sure if this would help us much in the grand scheme of things

milesgranger · 2024-02-07T12:06:44Z

Maybe related to the new CI failure #1363?

hendrikmakait added bug Something isn't working tpch labels Feb 6, 2024

hendrikmakait mentioned this issue Feb 8, 2024

[TPC-H] Workers get restarted after runnning out of memory during multiple queries at scale 1000 #1367

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPC-H] Query 21 times out at scale 100 #1362

[TPC-H] Query 21 times out at scale 100 #1362

hendrikmakait commented Feb 6, 2024

phofl commented Feb 6, 2024

milesgranger commented Feb 7, 2024

[TPC-H] Query 21 times out at scale 100 #1362

[TPC-H] Query 21 times out at scale 100 #1362

Comments

hendrikmakait commented Feb 6, 2024

phofl commented Feb 6, 2024

milesgranger commented Feb 7, 2024