Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

This PR implements stack table generating function.

How was this patch tested?

Pass the Jenkins tests including new testcases.

@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61675 has finished for PR 14033 at commit 6de93a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Stack(children: Seq[Expression])

@dongjoon-hyun
Copy link
Member Author

Rebased to resolve conflicts.

@dongjoon-hyun
Copy link
Member Author

cc @rxin and @cloud-fan .

@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61693 has finished for PR 14033 at commit d6691f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Stack(children: Seq[Expression])

@SparkQA
Copy link

SparkQA commented Jul 3, 2016

Test build #61696 has finished for PR 14033 at commit f02e1dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Stack(children: Seq[Expression])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we override checkInputDataTypes here, the ImplicitCastInputTypes is useless now.
We need to take care of all type check logic in checkInputDataTypes ourselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review again, @cloud-fan .
For this, I added type casting tests here.
https://github.com/apache/spark/pull/14033/files#diff-a2587541e08bf6e23df33738488d070aR30

Did I miss something there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. what if the first argument is not int type? and I'm also surprised that stack(1, 1.0, 2) works, we will cast 1.0 to int type, according to the definition of inputTypes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, there is misleading comment. The first argument 1 is the number of row. Its type is checked by type-checker.
The type of first argument of data, 1.0, rules the followings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I modified the description, the first data type rules, more clearly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala> sql("select stack(1.0,2,3)");
java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to java.lang.Integer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should throw AnalysisException instead of ClassCastException, the type checking is not working here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see now what you mean! Correctly, I missed that.
I'll add the logic and testcase. Thank you again.

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61705 has finished for PR 14033 at commit e21bdd9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61707 has finished for PR 14033 at commit 6b82f6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61709 has finished for PR 14033 at commit 3e3a794.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check what's the type coercion rule for hive? It looks to me that the values for same column should be same type, but not all values should be same type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I overlooked that. Yes, right. It should be applied for the columns.
Hmm. Let me check and investigate more. Thank you for pointing out that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Here is the result from Hive. I'll fix this PR like Hive.

hive> select stack(2, 2.0, 3, 4, 5.0);
FAILED: UDFArgumentException Argument 1's type (double) should be equal to argument 3's type (int)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, sorry for this. I completely misunderstood the behavior of this function.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan .
I rewrote the testcases and logics.
I squashed the commits into a single one because the diff became messy.

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61721 has finished for PR 14033 at commit 51738d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Stack(children: Seq[Expression])


private lazy val numRows = try {
children.head.eval().asInstanceOf[Int]
} catch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to try-catch here, it's guaranteed to be int after the type checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed since numRows and numFields is used in elementSchema first before checkInputDataTypes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elementSchema is a method, where do we call it before the type checking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, indeed. Without that, all test passes. elementSchema is not called before.
During developing, I thought I found a case for that. But, I must be confused at some mixed cases.

@dongjoon-hyun
Copy link
Member Author

Thank you for fast review always, @cloud-fan !
I fixed one of three comments.

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61729 has finished for PR 14033 at commit 4c5ed83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61741 has finished for PR 14033 at commit 9d93ddc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61742 has finished for PR 14033 at commit 8abcab5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case class Stack(children: Seq[Expression])
extends Expression with Generator with CodegenFallback {

private lazy val numRows = children.head.eval().asInstanceOf[Int]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain a bit more about this? It will be good if we can expression the logic more clear using math.ceil or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will use math.ceil clearly.

@dongjoon-hyun
Copy link
Member Author

Thank you so much always, @cloud-fan !

case (e, index) => StructField(s"col$index", e.dataType)
})

override def eval(input: InternalRow): TraversableOnce[InternalRow] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to call toArray here, as we will access it by index in a loop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right!


checkTuple(
Stack(Seq(2, "a", "b", "c").map(Literal(_))),
Seq(create_row("a", "b"), create_row("c", null)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add some test cases for type checking failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jul 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we cannot test that here since it's Expression-level testing.

I remembered the reason why I added try-catch before. (Currently, it is removed.)

+ checkTuple(Stack(Seq(1.0).map(Literal(_))), Seq(create_row()))
...
java.lang.Double cannot be cast to java.lang.Integer
java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Integer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pointer. Nice!

@cloud-fan
Copy link
Contributor

LGTM except some minor comments, thanks for working on it!

@dongjoon-hyun
Copy link
Member Author

It's my pleasure. I'm still learning Spark. :)

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan .
I added toArray and couldn't add type-checking tests on expression-level suite.
I hope that is okay for you. If not, please let me know.

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61759 has finished for PR 14033 at commit 9913dd2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Now, the testcase became much stronger.

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61772 has finished for PR 14033 at commit 6cef803.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Master branch is broken now. I'll retry to test later.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61769 has finished for PR 14033 at commit 7ed75b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61777 has finished for PR 14033 at commit 6cef803.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/**
* Separate v1, ..., vk into n rows. Each row will have k/n columns. n must be constant.
* {{{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: double )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Thank you, @tejasapatil !

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61796 has finished for PR 14033 at commit 4445248.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private lazy val numFields = Math.ceil((children.length - 1.0) / numRows).toInt

override def checkInputDataTypes(): TypeCheckResult = {
if (children.length <= 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also include the number of args passed and the args ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? Could you give some example what you want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I merged this PR before see your comments. Yea including the number of args makes the error message more friendly, but not a big deal, @dongjoon-hyun you can fix it in your next PR by the way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Now I understand the meaning. Sure, someday later.
Maybe, the pattern is popular, so we can fix all of the error message together in a single PR.

@asfgit asfgit closed this in d0d2850 Jul 6, 2016
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@dongjoon-hyun
Copy link
Member Author

Thank you for merging, @cloud-fan !

asfgit pushed a commit that referenced this pull request Jul 8, 2016
This PR implements `stack` table generating function.

Pass the Jenkins tests including new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14033 from dongjoon-hyun/SPARK-16286.

(cherry picked from commit d0d2850)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@dongjoon-hyun dongjoon-hyun deleted the SPARK-16286 branch July 20, 2016 07:42
@cloud-fan
Copy link
Contributor

just found out that we didn't implement a type coercion rule for stack...

@dongjoon-hyun
Copy link
Member Author

Oh, I'll look at that, @cloud-fan .

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Mar 10, 2017

Do you mean the column-wise type coercion to pass the following case (differently from hive)?

hive> select stack(2,1,'a',3,2);
FAILED: UDFArgumentException Argument 2's type (string) should be equal to argument 4's type (int)

@cloud-fan
Copy link
Contributor

let's follow hive, e.g. select stack(2, 1, null, 3, 2) can process

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Mar 10, 2017

Ah, I understand the case. Hive allows NULL. I'll try to support the NULL case. Thanks, @cloud-fan .

hive> select stack(3, 1, 'a', 2, 'b', 3, 'c');
OK
1	a
2	b
3	c

hive> select stack(3, 1, 'a', 2, 'b', 3, null);
OK
1	a
2	b
3	NULL

hive> select stack(3, 1, 'a', 2, 'b', 3, 1);
FAILED: UDFArgumentException Argument 2's type (string) should be equal to argument 6's type (int)

hive> select stack(3, 1, NULL, 2, 'b', 3, 1);
FAILED: UDFArgumentException Argument 2's type (void) should be equal to argument 6's type (int)

@dongjoon-hyun
Copy link
Member Author

I created the JIRA issue. I'll make a PR soon.
https://issues.apache.org/jira/browse/SPARK-19910

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants