-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark #19344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #82167 has finished for PR 19344 at commit
|
|
@gatorsmile if you get time, please check this. thanks. |
|
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using the plan that has been analyzed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The analyzer rule CTESubstitution will replace With
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yea. Since the original code does so, I just added the logic. But, the suggestion sounds good to me, so I'll update soon. Thanks.
0df2663 to
dd84919
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the plan is successfully analyzed, UnresolvedRelation should not exist
|
Test build #82280 has finished for PR 19344 at commit
|
|
Test build #82281 has finished for PR 19344 at commit
|
3f31048 to
f7359af
Compare
|
@gatorsmile ok, fixed. Also, I checked this code could collect all the relations. |
f7359af to
489f2a2
Compare
|
Test build #82301 has finished for PR 19344 at commit
|
|
Test build #82303 has finished for PR 19344 at commit
|
| case _ => | ||
| } | ||
| } | ||
| spark.sql(queryString).queryExecution.analyzed.map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach
| } | ||
| spark.sql(queryString).queryExecution.analyzed.map { | ||
| case SubqueryAlias(name, _: LogicalRelation) => | ||
| queryRelations.add(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add another case for HiveTableRelation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC ditto; HiveTableRelation never happens here.
| } | ||
| } | ||
| spark.sql(queryString).queryExecution.analyzed.map { | ||
| case SubqueryAlias(name, _: LogicalRelation) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using LogicalRelation 's catalogTable? Just issue an exception if it is None. I think this benchmark will not hit None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked again and I found we can't use catalogTable here because these TPCDS tables are locally temporary ones (IIUC these tables are always transformed into ScalaAlias(LocalRelation)).
6dfb004 to
00cfb21
Compare
|
Test build #82309 has finished for PR 19344 at commit
|
|
Test build #82311 has finished for PR 19344 at commit
|
00cfb21 to
5691cf6
Compare
|
Test build #82314 has finished for PR 19344 at commit
|
|
retest this please |
| } | ||
| } | ||
| spark.sql(queryString).queryExecution.analyzed.foreach { | ||
| case SubqueryAlias(name, _: LogicalRelation) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add all the three scenarios, although the current codes only use temp views?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will do
|
Test build #82329 has finished for PR 19344 at commit
|
|
Test build #82340 has finished for PR 19344 at commit
|
cbac959 to
8d8a9ff
Compare
|
LGTM pending Jenkins |
|
Test build #82341 has finished for PR 19344 at commit
|
|
btw, could we also add |
|
These modified test cases are not following the standards. Impala added extra (partition) predicates. The perf results are misleading. |
|
Thanks! Merged to master. |
|
ok, thanks! |
What changes were proposed in this pull request?
Since the current code ignores WITH clauses to check input relations in TPCDS queries, this leads to inaccurate per-row processing time for benchmark results. For example, in
q2, this fix could catch all the input relations:web_sales,date_dim, andcatalog_sales(the current code catchesdate_dimonly). The one-third of the TPCDS queries uses WITH clauses, so I think it is worth fixing this.How was this patch tested?
Manually checked.