Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace udf/udaf with Spark SQL #70

Merged
merged 1 commit into from
Feb 2, 2022
Merged

Replace udf/udaf with Spark SQL #70

merged 1 commit into from
Feb 2, 2022

Conversation

Jiaweihu08
Copy link
Member

Replace MaxWeightEstimation used in DoublePassOTreeDataAnalyzer.analyze with a simple Spark SQL code lit(1) / sum(lit(1.0) / col("normalizedWeight").

The latter is capable of leveraging vectorization, and reduce the time spent on GC.

During a testing, this change reduced the query time from 2.2h to 46mins, as shown in Stage 22 from the following image.
image

Test code:

val spark = SparkSession.builder().getOrCreate()

val delta = spark.read
  .format("delta")
  .load(
    "s3://qbeast-benchmarking-us-east-1/datasets/1000gb/delta/original-1tb-delta/store_sales")

val current_partitions = delta.rdd.getNumPartitions
val target_partitions = current_partitions * 4

// scalastyle:off println
println(s"Current num of partitions: $current_partitions")
println(s"Target num of partitions: $target_partitions")

val delta_repartitioned = delta.repartition(target_partitions)

val qid = QTableID("jiawei_rules")

val revision = SparkRevisionFactory
  .createNewRevision(
    qid,
    delta_repartitioned.schema,
    Map("columnsToIndex" -> "ss_item_sk,ss_ticket_number"))

val indexStatus = IndexStatus.empty(revision)

// scalastyle:on println
DoublePassOTreeDataAnalyzer.analyze(delta_repartitioned, indexStatus, false)

Cluster from EMR, 3 workers of c5.xlarge EC2 instances.

@Jiaweihu08 Jiaweihu08 requested a review from cugni February 2, 2022 11:07
Copy link
Member

@cugni cugni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much testing, for such a little change :) Great work 🔝

@cugni cugni merged commit 9f628ce into Qbeast-io:main Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants