Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Common Subexpression Elimination optimizer rule #942

Open
Tracked by #391
andygrove opened this issue Sep 13, 2024 · 7 comments
Open
Tracked by #391

Implement Common Subexpression Elimination optimizer rule #942

andygrove opened this issue Sep 13, 2024 · 7 comments
Labels
enhancement New feature or request performance
Milestone

Comments

@andygrove
Copy link
Member

andygrove commented Sep 13, 2024

What is the problem the feature request solves?

When running TPC-H q1 in Spark/Comet, the expression l_extendedprice#21 * (1 - l_discount#22) appears twice in the query and currently gets evaluated twice. This could be optimized out so that it is only evaluated once. I was able to test this by manually rewriting the query.

Original Query

select
	l_returnflag,
	l_linestatus,
	sum(l_quantity) as sum_qty,
	sum(l_extendedprice) as sum_base_price,
	sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
	sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
	avg(l_quantity) as avg_qty,
	avg(l_extendedprice) as avg_price,
	avg(l_discount) as avg_disc,
	count(*) as count_order
from
	lineitem
where
	l_shipdate <= date '1998-12-01' - interval '68 days'
group by
	l_returnflag,
	l_linestatus
order by
	l_returnflag,
	l_linestatus;

Optimized Query

select
	l_returnflag,
	l_linestatus,
	sum(l_quantity) as sum_qty,
	sum(l_extendedprice) as sum_base_price,
	sum(foo) as sum_disc_price,
	sum(foo * (1 + l_tax)) as sum_charge,
	avg(l_quantity) as avg_qty,
	avg(l_extendedprice) as avg_price,
	avg(l_discount) as avg_disc,
	count(*) as count_order
from (select
          l_returnflag,
          l_linestatus,
          l_quantity,
          l_extendedprice,
          l_extendedprice * (1 - l_discount) as foo,
          l_tax,
          l_discount
  from lineitem
  where
	l_shipdate <= date '1998-12-01' - interval '68 days')
group by
	l_returnflag,
	l_linestatus
order by
	l_returnflag,
	l_linestatus;

Timings (Original)

13.752424478530884,
11.648030281066895,
11.35965895652771,
11.335061311721802,
11.383593797683716,
11.291191101074219,
11.31091046333313,
11.351991653442383,
11.32134222984314,
11.374904155731201

Timings (Optimized)

12.734412908554077,
10.684742212295532,
10.157625198364258,
10.06518030166626,
10.043614149093628,
9.986022233963013,
9.939271688461304,
9.925782918930054,
10.024176120758057,
10.018519401550293

Spark UI (Original)

Screenshot from 2024-09-13 12-16-22

Spark UI (Optimized)

Screenshot from 2024-09-13 12-15-51

Describe the potential solution

No response

Additional context

No response

@andygrove andygrove added enhancement New feature or request performance labels Sep 13, 2024
@viirya
Copy link
Member

viirya commented Sep 13, 2024

Good finding. I think this kind of optimization should be in Spark optimizer instead.

@viirya
Copy link
Member

viirya commented Sep 13, 2024

I remember Spark SQL has corresponding optimization rule. But not sure why it doesn't affect the query.

@andygrove
Copy link
Member Author

Related to this, it would be nice if we could improve the metrics for CometHashAggregate to show the time for evaluating the aggregate input expressions. I am not sure how much work that would be though.

@viirya
Copy link
Member

viirya commented Sep 15, 2024

Related to this, it would be nice if we could improve the metrics for CometHashAggregate to show the time for evaluating the aggregate input expressions. I am not sure how much work that would be though.

Sounds good. It should be added into DataFusion hash aggregate operator.

@andygrove andygrove changed the title Implement Common Subquery Elimination optimizer rule Implement Common Subexpression Elimination optimizer rule Sep 23, 2024
@andygrove
Copy link
Member Author

andygrove commented Sep 24, 2024

Good finding. I think this kind of optimization should be in Spark optimizer instead.

It would make sense for Spark to add this, but I think that it could also be beneficial for DataFusion to support this as a physical optimizer rule.

I filed apache/datafusion#12599

@eejbyfeldt
Copy link
Contributor

Spark has it, but not at the plan level. Instead they do it as part of their code generation: https://github.com/apache/spark/blob/v3.5.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1064-L1098C1

@andygrove
Copy link
Member Author

There is now a DataFusion PR to add this feature: apache/datafusion#13046

@andygrove andygrove added this to the 0.4.0 milestone Oct 22, 2024
@andygrove andygrove modified the milestones: 0.4.0, 0.5.0 Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

3 participants