Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Oct 22, 2025

Which issue does this PR close?

Closes #2624
Closes #2631

Rationale for this change

Fix various bugs that caused queries to fail at runtime.

What changes are included in this PR?

  • Replace existing tests with one comprehensive test that is re-used for both lpad and road
  • Fix handling of negative length
  • Fallback to Spark for literal str argument
  • Fallback to Spark for non-literal pad argument

How are these changes tested?

New tests

@codecov-commenter
Copy link

codecov-commenter commented Oct 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.17%. Comparing base (f09f8af) to head (a265595).
⚠️ Report is 636 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2630      +/-   ##
============================================
+ Coverage     56.12%   59.17%   +3.04%     
- Complexity      976     1447     +471     
============================================
  Files           119      147      +28     
  Lines         11743    13743    +2000     
  Branches       2251     2360     +109     
============================================
+ Hits           6591     8132    +1541     
- Misses         4012     4388     +376     
- Partials       1140     1223      +83     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove changed the title fix: Fallback to Spark for lpad/rpad if pad argument is not a literal fix: Fallback to Spark for lpad/rpad for unsupported arguments & fix negative length handling Oct 22, 2025
@andygrove andygrove marked this pull request as ready for review October 22, 2025 19:44
@mbutrovich
Copy link
Contributor

cc @coderfender since they were looking at this recently: #2099

@coderfender
Copy link
Contributor

Thank you for the mention @mbutrovich . @andygrove let me know if I can help in anyways to get this fixed soon

@andygrove
Copy link
Member Author

Thank you for the mention @mbutrovich . @andygrove let me know if I can help in anyways to get this fixed soon

It would be great if you could review the new test and make sure it covers everything the original tests covered. I believe that the underlying issues are resolved now.

@coderfender
Copy link
Contributor

Sure @andygrove

@coderfender
Copy link
Contributor

LGTM . Thank you for the prompt fix @andygrove

@andygrove andygrove marked this pull request as draft October 23, 2025 17:51
@andygrove
Copy link
Member Author

Moving to draft until #2635 is merged

@andygrove andygrove marked this pull request as ready for review October 23, 2025 21:39
is_left_pad,
)?),
Some(string) => {
if length < 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its better to put happy path first in if stmt for compute intensive parts, so CPU won't have to execute eagerly instructions and then fall it back

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have updated this.

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

val edgeCases = Seq(
"", // unicode 'e\\u{301}'
"é", // unicode '\\u{e9}'
"తెలుగు")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what makes this an edge case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two were added in #772 to make sure Comet was consistent with Spark even though Rust and Java have different ways of representing unicode and graphemes.

if (expr.str.isInstanceOf[Literal]) {
return Unsupported(Some("Scalar values are not supported for the str argument"))
}
if (!expr.pad.isInstanceOf[Literal]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we'll ever hit this. As far as I can see (in functions.lpad), Spark expects the pad argument to be a literal as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark doesn't require pad to be a literal:

scala> spark.sql("select a, lpad('foo', 6, a) from t1").show
+---+---------------+
|  a|lpad(foo, 6, a)|
+---+---------------+
|  $|         $$$foo|
|  @|         @@@foo|
+---+---------------+

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case class StringRPad(str: Expression, len: Expression, pad: Expression = Literal(" "))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is a good example of the benefit of fuzz testing (which is how this issue was discovered). The fuzzer will generate test cases that most developers would not consider. It does seem unlikely that anyone would want to use a column for the pad value, but I suppose it is possible that someone may have that requirement.


// test all combinations of scalar and array arguments
for (str <- Seq("'hello'", "str")) {
for (len <- Seq("6", "-6", "0", "len % 10")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

df.createOrReplaceTempView("t1")

// test all combinations of scalar and array arguments
for (str <- Seq("'hello'", "str")) {
Copy link
Contributor

@hsiang-c hsiang-c Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark doc says it also supports binary string input: e.g unhex('aabb').

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. That opens up another set of issues! 😭

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added separate tests for binary inputs

Copy link
Contributor

@hsiang-c hsiang-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andygrove andygrove merged commit 00922cf into apache:main Oct 24, 2025
102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

lpad/rpad fail with capacity overflow due to overflow Fuzz test failure: lpad/rpad fail at runtime with "unsupported arguments"

7 participants