You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TPC-DS q72 is intentionally designed to use a bad join order that causes the initial join to produce billions of rows that are then filtered out by later joins. This query is great for testing database join reordering rules, but that isn't very relevant to Spark or Comet, since Spark does not perform join reordering by default. Theoretically, it can be enabled but requires statistics to be generated upfront, and this isn't the typical use case for Spark.
I would rather we spend time making Comet perform well for real-world use cases than spending too much time on the original version of q72.
Since we are not ever planning on running or publishing official TPC-DS benchmarks, I propose that we update our benchmarks derived from TPC-DS to add an option where we can choose to run the official version of q72 or a modified version that uses a sensible join order (which makes it run at least 10x faster and uses much less memory).
The purpose of our benchmarking is to show relative performance between Spark and Comet, so as long as we are running the same queries against both, I think this is fine as long as we document this modification in our benchmarking guide.
We should still retain the option to run the original query though, because we do want to make sure that we can run it in Comet, but I think it is wasteful to be running the original version in CI because of the system resources that it requires.
Here is a version of q72 that uses a more sensible join order. This could possibly be optimized farther.
select i_item_desc
,w_warehouse_name
,d1.d_week_seq
,sum(case when p_promo_sk is null then 1 else 0 end) no_promo
,sum(case when p_promo_sk is not null then 1 else 0 end) promo
,count(*) total_cnt
from catalog_sales
join date_dim d1 on (cs_sold_date_sk =d1.d_date_sk)
join customer_demographics on (cs_bill_cdemo_sk = cd_demo_sk)
join household_demographics on (cs_bill_hdemo_sk = hd_demo_sk)
join item on (i_item_sk = cs_item_sk)
join inventory on (cs_item_sk = inv_item_sk)
join warehouse on (w_warehouse_sk=inv_warehouse_sk)
join date_dim d2 on (inv_date_sk =d2.d_date_sk)
join date_dim d3 on (cs_ship_date_sk =d3.d_date_sk)
left outer join promotion on (cs_promo_sk=p_promo_sk)
left outer join catalog_returns on (cr_item_sk = cs_item_sk and cr_order_number = cs_order_number)
whered1.d_week_seq=d2.d_week_seqand inv_quantity_on_hand < cs_quantity
andd3.d_date>d1.d_date+5and hd_buy_potential ='501-1000'andd1.d_year=1999and cd_marital_status ='S'group by i_item_desc,w_warehouse_name,d1.d_week_seqorder by total_cnt desc, i_item_desc, w_warehouse_name, d_week_seq
LIMIT100;
Describe the potential solution
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
What is the problem the feature request solves?
TPC-DS q72 is intentionally designed to use a bad join order that causes the initial join to produce billions of rows that are then filtered out by later joins. This query is great for testing database join reordering rules, but that isn't very relevant to Spark or Comet, since Spark does not perform join reordering by default. Theoretically, it can be enabled but requires statistics to be generated upfront, and this isn't the typical use case for Spark.
I would rather we spend time making Comet perform well for real-world use cases than spending too much time on the original version of q72.
Since we are not ever planning on running or publishing official TPC-DS benchmarks, I propose that we update our benchmarks derived from TPC-DS to add an option where we can choose to run the official version of q72 or a modified version that uses a sensible join order (which makes it run at least 10x faster and uses much less memory).
The purpose of our benchmarking is to show relative performance between Spark and Comet, so as long as we are running the same queries against both, I think this is fine as long as we document this modification in our benchmarking guide.
We should still retain the option to run the original query though, because we do want to make sure that we can run it in Comet, but I think it is wasteful to be running the original version in CI because of the system resources that it requires.
Here is a version of q72 that uses a more sensible join order. This could possibly be optimized farther.
Describe the potential solution
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: