Skip to content

Conversation

@xdcjie
Copy link

@xdcjie xdcjie commented May 29, 2018

What changes were proposed in this pull request?

Transform query do not have Project Node, query like:

select transform(a, b) using 'func' from e

and it' logic plan is:

'ScriptTransformation['a, 'b], func, [key#0, value#0]
+- 'UnresolvedRelation e (PlanTest.scala:97)

so that it will scan all the column data of relation.

In this PR, I propose to add Project Node for transform query, so that it scan required data by prune columns, for above transform query, it's logic plan will be:

'ScriptTransformation['a, 'b], func, [key#0, value#0]
+- 'Project ['a, 'b]
+- 'UnresolvedRelation e

it will scan only two column data of relation.

In summary, Add Project Node for transform query can reduce the time of scan and assemble data.(In our scenario, the relation(table) have 700 columns.)

How was this patch tested?

Modify existing test ("transform query spec")

@maropu
Copy link
Member

maropu commented May 29, 2018

@gatorsmile Can you trigger this?

// Add project.
val namedExpressions = expressions.map {
case e: NamedExpression => e
case e: Expression => UnresolvedAlias(e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: case e: _ => UnresolvedAlias(e)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type of expressions is Expression, so i think
case e: _ => UnresolvedAlias(e) and case e: Expression => UnresolvedAlias(e) is equivalent.
Did have other reasons to change this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a style issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style is updated, review this, please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, "case e: _ => UnresolveAlias(e)" will occur "unbound wildcard type" complie error. I will revert to "case e: Expression => UnresolvedAlias(e)".

@maropu
Copy link
Member

maropu commented May 29, 2018

Could you add explain result differences with/without this pr in the description?

@xdcjie
Copy link
Author

xdcjie commented May 29, 2018

@maropu I updated the commet. In summary, with this pr can reduce the time of scan and assemble data. In our scenario, the relation(table) have 700 columns.

@xdcjie
Copy link
Author

xdcjie commented May 31, 2018

@maropu @gatorsmile Do you have any comment/suggestion for this PR? Thanks.

@xdcjie
Copy link
Author

xdcjie commented Jun 5, 2018

@maropu @gatorsmile @liancheng @HyukjinKwon Can you help me review this pr ? Thanks。 #

@maropu
Copy link
Member

maropu commented Jun 16, 2018

kindly pining @gatorsmile @HyukjinKwon @ueshin

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 11, 2018

Test build #92842 has finished for PR 21447 at commit 3997ceb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xdcjie
Copy link
Author

xdcjie commented Jul 12, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jul 12, 2018

Test build #92916 has started for PR 21447 at commit 958888b.

@xdcjie
Copy link
Author

xdcjie commented Jul 13, 2018

Test build havs finished, be killed, why the result not shows here? retest this please!

@HyukjinKwon
Copy link
Member

Wait, why is it against branch-2.2 not master?

@SparkQA
Copy link

SparkQA commented Jul 16, 2018

Test build #93103 has finished for PR 21447 at commit 12b688d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xdcjie
Copy link
Author

xdcjie commented Jul 17, 2018

@HyukjinKwon Our project is based on the branch-2.2, we will merge the patch to local branch manually if it is against master. we prefer following community to local branch.

If you don't like this approach, I will close the PR and make a new PR to against master.

@HyukjinKwon
Copy link
Member

Yea, usually we do it for master first and then backport it to other branches to reduce the diff and master has the fix. I would appreciate it if you go in this way.

@gatorsmile
Copy link
Member

Sorry, the fix does not look good to me. We should let the optimizer add the project automatically.

@HyukjinKwon
Copy link
Member

@xdcjie, mind closing this one?

@xdcjie
Copy link
Author

xdcjie commented Jul 17, 2018

Ok!

@xdcjie xdcjie closed this Jul 17, 2018
@gatorsmile
Copy link
Member

@maropu Do you want to take this over and add such a project in ColumnPruning?

@xuanyuanking
Copy link
Member

I want to give a follow up PR and cc @gatorsmile @maropu for a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants