Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for org.apache.spark.sql.catalyst.expressions.ArrayPosition [databricks] #12308

Open
wants to merge 4 commits into
base: branch-25.04
Choose a base branch
from

Conversation

ustcfy
Copy link
Collaborator

@ustcfy ustcfy commented Mar 11, 2025

Closes #5224

This PR adds support for org.apache.spark.sql.catalyst.expressions.ArrayPosition.

Ran a simple perf test for 5 times

// use big data gen
import org.apache.spark.sql.tests.datagen._
val dataTable = DBGen().addTable("data", "a array<byte>, b byte", 10000000)
dataTable.toDF(spark).write.mode("OVERWRITE").parquet("PERF")

// spark-rapids
val df = spark.read.parquet("PERF")
// `key` is a scalar
spark.time(df.selectExpr("max(array_position(a, 0))").show())
// `key` is a column
spark.time(df.selectExpr("max(array_position(a, b))").show())

Results:

  • key is a scalar

    Run 1 Run 2 Run 3 Run 4 Run 5
    GPU 208 ms 209 ms 263 ms 265 ms 225 ms
    CPU 898 ms 985 ms 996 ms 953 ms 868 ms
  • key is a column

    Run 1 Run 2 Run 3 Run 4 Run 5
    GPU 212 ms 237 ms 220 ms 211 ms 221 ms
    CPU 956 ms 994 ms 892 ms 914 ms 932 ms

The results showed that the GpuArrayPosition (TITAN RTX) demonstrated approximately 4 times speedup compared to the CpuArrayPosition (8 CPU cores).

Signed-off-by: Yan Feng <fengyan_@mail.ustc.edu.cn>
@ustcfy ustcfy self-assigned this Mar 11, 2025
@firestarman firestarman changed the title Add support for org.apache.spark.sql.catalyst.expressions.ArrayPosition Add support for org.apache.spark.sql.catalyst.expressions.ArrayPosition [databricks] Mar 11, 2025
…ypeExtractors.scala

Co-authored-by: Liangcai Li <firestarmanllc@gmail.com>
Comment on lines 2707 to 2709
(in, conf, p, r) => new BinaryExprMeta[ArrayPosition](in, conf, p, r) {
override def convertToGpu(lhs: Expression, rhs: Expression): GpuExpression =
GpuArrayPosition(lhs, rhs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anonymous classes are unnecessarily difficult to debug when the conversion code throws #10838

Look at https://github.com/NVIDIA/spark-rapids/pull/10839/files for an example

ustcfy added 2 commits March 11, 2025 14:11
Signed-off-by: Yan Feng <fengyan_@mail.ustc.edu.cn>
Signed-off-by: Yan Feng <fengyan_@mail.ustc.edu.cn>
Copy link
Collaborator

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@ustcfy
Copy link
Collaborator Author

ustcfy commented Mar 12, 2025

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA]Support array_position
4 participants