-
Notifications
You must be signed in to change notification settings - Fork 252
fix: Fallback to Spark for lpad/rpad for unsupported arguments & fix negative length handling #2630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2630 +/- ##
============================================
+ Coverage 56.12% 59.17% +3.04%
- Complexity 976 1447 +471
============================================
Files 119 147 +28
Lines 11743 13743 +2000
Branches 2251 2360 +109
============================================
+ Hits 6591 8132 +1541
- Misses 4012 4388 +376
- Partials 1140 1223 +83 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
cc @coderfender since they were looking at this recently: #2099 |
|
Thank you for the mention @mbutrovich . @andygrove let me know if I can help in anyways to get this fixed soon |
It would be great if you could review the new test and make sure it covers everything the original tests covered. I believe that the underlying issues are resolved now. |
|
Sure @andygrove |
|
LGTM . Thank you for the prompt fix @andygrove |
|
Moving to draft until #2635 is merged |
| is_left_pad, | ||
| )?), | ||
| Some(string) => { | ||
| if length < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its better to put happy path first in if stmt for compute intensive parts, so CPU won't have to execute eagerly instructions and then fall it back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have updated this.
parthchandra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
| val edgeCases = Seq( | ||
| "é", // unicode 'e\\u{301}' | ||
| "é", // unicode '\\u{e9}' | ||
| "తెలుగు") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what makes this an edge case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first two were added in #772 to make sure Comet was consistent with Spark even though Rust and Java have different ways of representing unicode and graphemes.
| if (expr.str.isInstanceOf[Literal]) { | ||
| return Unsupported(Some("Scalar values are not supported for the str argument")) | ||
| } | ||
| if (!expr.pad.isInstanceOf[Literal]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we'll ever hit this. As far as I can see (in functions.lpad), Spark expects the pad argument to be a literal as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark doesn't require pad to be a literal:
scala> spark.sql("select a, lpad('foo', 6, a) from t1").show
+---+---------------+
| a|lpad(foo, 6, a)|
+---+---------------+
| $| $$$foo|
| @| @@@foo|
+---+---------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case class StringRPad(str: Expression, len: Expression, pad: Expression = Literal(" "))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is a good example of the benefit of fuzz testing (which is how this issue was discovered). The fuzzer will generate test cases that most developers would not consider. It does seem unlikely that anyone would want to use a column for the pad value, but I suppose it is possible that someone may have that requirement.
|
|
||
| // test all combinations of scalar and array arguments | ||
| for (str <- Seq("'hello'", "str")) { | ||
| for (len <- Seq("6", "-6", "0", "len % 10")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| df.createOrReplaceTempView("t1") | ||
|
|
||
| // test all combinations of scalar and array arguments | ||
| for (str <- Seq("'hello'", "str")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark doc says it also supports binary string input: e.g unhex('aabb').
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks. That opens up another set of issues! 😭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added separate tests for binary inputs
hsiang-c
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Which issue does this PR close?
Closes #2624
Closes #2631
Rationale for this change
Fix various bugs that caused queries to fail at runtime.
What changes are included in this PR?
How are these changes tested?
New tests