-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16286][SQL] Implement stack table generating function #14033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #61675 has finished for PR 14033 at commit
|
|
Rebased to resolve conflicts. |
|
cc @rxin and @cloud-fan . |
|
Test build #61693 has finished for PR 14033 at commit
|
|
Test build #61696 has finished for PR 14033 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we override checkInputDataTypes here, the ImplicitCastInputTypes is useless now.
We need to take care of all type check logic in checkInputDataTypes ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review again, @cloud-fan .
For this, I added type casting tests here.
https://github.com/apache/spark/pull/14033/files#diff-a2587541e08bf6e23df33738488d070aR30
Did I miss something there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. what if the first argument is not int type? and I'm also surprised that stack(1, 1.0, 2) works, we will cast 1.0 to int type, according to the definition of inputTypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, there is misleading comment. The first argument 1 is the number of row. Its type is checked by type-checker.
The type of first argument of data, 1.0, rules the followings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I modified the description, the first data type rules, more clearly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala> sql("select stack(1.0,2,3)");
java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to java.lang.Integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should throw AnalysisException instead of ClassCastException, the type checking is not working here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see now what you mean! Correctly, I missed that.
I'll add the logic and testcase. Thank you again.
|
Test build #61705 has finished for PR 14033 at commit
|
|
Test build #61707 has finished for PR 14033 at commit
|
|
Test build #61709 has finished for PR 14033 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check what's the type coercion rule for hive? It looks to me that the values for same column should be same type, but not all values should be same type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I overlooked that. Yes, right. It should be applied for the columns.
Hmm. Let me check and investigate more. Thank you for pointing out that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Here is the result from Hive. I'll fix this PR like Hive.
hive> select stack(2, 2.0, 3, 4, 5.0);
FAILED: UDFArgumentException Argument 1's type (double) should be equal to argument 3's type (int)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, sorry for this. I completely misunderstood the behavior of this function.
|
Hi, @cloud-fan . |
|
Test build #61721 has finished for PR 14033 at commit
|
|
|
||
| private lazy val numRows = try { | ||
| children.head.eval().asInstanceOf[Int] | ||
| } catch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to try-catch here, it's guaranteed to be int after the type checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed since numRows and numFields is used in elementSchema first before checkInputDataTypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elementSchema is a method, where do we call it before the type checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, indeed. Without that, all test passes. elementSchema is not called before.
During developing, I thought I found a case for that. But, I must be confused at some mixed cases.
|
Thank you for fast review always, @cloud-fan ! |
|
Test build #61729 has finished for PR 14033 at commit
|
|
Test build #61741 has finished for PR 14033 at commit
|
|
Test build #61742 has finished for PR 14033 at commit
|
| case class Stack(children: Seq[Expression]) | ||
| extends Expression with Generator with CodegenFallback { | ||
|
|
||
| private lazy val numRows = children.head.eval().asInstanceOf[Int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain a bit more about this? It will be good if we can expression the logic more clear using math.ceil or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will use math.ceil clearly.
…dundant modula operation.
|
Thank you so much always, @cloud-fan ! |
| case (e, index) => StructField(s"col$index", e.dataType) | ||
| }) | ||
|
|
||
| override def eval(input: InternalRow): TraversableOnce[InternalRow] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to call toArray here, as we will access it by index in a loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right!
|
|
||
| checkTuple( | ||
| Stack(Seq(2, "a", "b", "c").map(Literal(_))), | ||
| Seq(create_row("a", "b"), create_row("c", null))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add some test cases for type checking failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, we cannot test that here since it's Expression-level testing.
I remembered the reason why I added try-catch before. (Currently, it is removed.)
+ checkTuple(Stack(Seq(1.0).map(Literal(_))), Seq(create_row()))
...
java.lang.Double cannot be cast to java.lang.Integer
java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can follow this one: 85f2303#diff-e4663e57952b37150642b33b998715a8R94
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the pointer. Nice!
|
LGTM except some minor comments, thanks for working on it! |
|
It's my pleasure. I'm still learning Spark. :) |
|
Hi, @cloud-fan . |
|
Test build #61759 has finished for PR 14033 at commit
|
|
Now, the testcase became much stronger. |
|
Test build #61772 has finished for PR 14033 at commit
|
|
Master branch is broken now. I'll retry to test later. |
|
Retest this please |
|
Test build #61769 has finished for PR 14033 at commit
|
|
Test build #61777 has finished for PR 14033 at commit
|
|
|
||
| /** | ||
| * Separate v1, ..., vk into n rows. Each row will have k/n columns. n must be constant. | ||
| * {{{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: double )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Thank you, @tejasapatil !
|
Test build #61796 has finished for PR 14033 at commit
|
| private lazy val numFields = Math.ceil((children.length - 1.0) / numRows).toInt | ||
|
|
||
| override def checkInputDataTypes(): TypeCheckResult = { | ||
| if (children.length <= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also include the number of args passed and the args ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? Could you give some example what you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I merged this PR before see your comments. Yea including the number of args makes the error message more friendly, but not a big deal, @dongjoon-hyun you can fix it in your next PR by the way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Now I understand the meaning. Sure, someday later.
Maybe, the pattern is popular, so we can fix all of the error message together in a single PR.
|
thanks, merging to master! |
|
Thank you for merging, @cloud-fan ! |
This PR implements `stack` table generating function. Pass the Jenkins tests including new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14033 from dongjoon-hyun/SPARK-16286. (cherry picked from commit d0d2850) Signed-off-by: Reynold Xin <rxin@databricks.com>
|
just found out that we didn't implement a type coercion rule for |
|
Oh, I'll look at that, @cloud-fan . |
|
Do you mean the column-wise type coercion to pass the following case (differently from hive)? |
|
let's follow hive, e.g. |
|
Ah, I understand the case. Hive allows NULL. I'll try to support the NULL case. Thanks, @cloud-fan . hive> select stack(3, 1, 'a', 2, 'b', 3, 'c');
OK
1 a
2 b
3 c
hive> select stack(3, 1, 'a', 2, 'b', 3, null);
OK
1 a
2 b
3 NULL
hive> select stack(3, 1, 'a', 2, 'b', 3, 1);
FAILED: UDFArgumentException Argument 2's type (string) should be equal to argument 6's type (int)
hive> select stack(3, 1, NULL, 2, 'b', 3, 1);
FAILED: UDFArgumentException Argument 2's type (void) should be equal to argument 6's type (int) |
|
I created the JIRA issue. I'll make a PR soon. |
What changes were proposed in this pull request?
This PR implements
stacktable generating function.How was this patch tested?
Pass the Jenkins tests including new testcases.