Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 321: Pre-commit hook #319

Merged
merged 8 commits into from
Jun 19, 2024
Merged

Conversation

Jiaweihu08
Copy link
Member

@Jiaweihu08 Jiaweihu08 commented Apr 23, 2024

Pre-commit Hooks

This PR introduces pre-commit hooks for the qbeast format, which enables the execution of custom code just before a write or optimization is committed. More details are described in Issue #321.

Fixes #321 and #323

The hook should extend io.qbeast.spark.delta.hook.PreCommitHook, overriding the run method, which has access to the sequence of Actions created by the operation. The same method returns a Map[String, String], which will be used as tags for the transaction's CommitInfo.

Hooks for Writes:

  1. You can use more than one hook, as shown in the case below: myHook1, and myHook2.
  2. For each hook you want to use, provide their class names with the option name: qbeastPreCommitHook.<custom-hook-name>.
  3. Add an option with the name qbeastPreCommitHook.<custom-hook-name>.arg for the ones that take initiation arguments. Currently, only one String argument is allowed for each hook.
(df
  .write
  .format("qbeast")
  .option("qbeastPreCommitHook.myHook1", classOf[SimpleHook].getCanonicalName)
  .option("qbeastPreCommitHook.myHook2", classOf[StatefulHook].getCanonicalName)
  .option("qbeastPreCommitHook.myHook2.arg", myStringHookArg)
  .save(pathToTable)

Hooks for Optimizations:

val qt = QbeastTable.forPath(spark, tablePath)
val options = Map(
  "qbeastPreCommitHook.myHook1" -> classOf[SimpleHook].getCanonicalName,
  "qbeastPreCommitHook.myHook2" -> classOf[StatefulHook].getCanonicalName,
  "qbeastPreCommitHook.myHook2.arg" -> "myStringHookArg"
)
qt.optimize(filesToOptimize, options)

Example:

  1. Create your custom hook
import io.qbeast.spark.delta.hook.PreCommitHook
import io.qbeast.spark.delta.hook.PreCommitHook.PreCommitHookOutput
import org.apache.spark.sql.delta.actions.Action

class SimpleHook extends PreCommitHook {

  override val name: String = "SimpleHook"

  override def run(actions: Seq[Action]): PreCommitHookOutput = {
    Map("clsName" -> "SimpleHook")
  }

}
  1. Add the hook project jar to the spark-shell session
$SPARK_HOME350/bin/spark-shell \
--master "local[1]" \
--jars <path-to-qbeast-spark-jar>,<path-to-hook-jar> \
--packages io.delta:delta-spark_2.12:3.1.0 \
--conf spark.sql.extensions=io.qbeast.spark.internal.QbeastSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog
  1. Write data with a SimpleHook
import io.test.hook.SimpleHook
import spark.implicits._

val tmpDir = "/tmp/test"
val df = spark.sparkContext.range(0, 100).toDF()

(df
	.write
	.mode("append")
	.format("qbeast")
	.option("columnsToIndex", "value")
	.option("qbeastPreCommitHook.hook", classOf[SimpleHook].getCanonicalName)
	.save(tmpDir)
)
  1. Check results
cat /tmp/test/_delta_log/00000000000000000000.json | jq
{
  "commitInfo": {
    ...,
    "tags": {
      "clsName": "SimpleHook"
    },
    ...
  }
}

@Jiaweihu08 Jiaweihu08 requested review from cugni and osopardo1 April 23, 2024 08:46
@fpj
Copy link
Contributor

fpj commented Apr 23, 2024

What's the issue number describing what issue this is resolving?

@osopardo1
Copy link
Member

What's the issue number describing what issue this is resolving?

There's none. I will open one 👍

@osopardo1 osopardo1 changed the title Pre-commit hook Issue 321: Pre-commit hook Apr 23, 2024
Copy link
Member

@osopardo1 osopardo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments on the code:

@osopardo1
Copy link
Member

The code seems ready to review, but we will need to provide a better description of the feature, alongside logging and documentation.

@osopardo1 osopardo1 marked this pull request as ready for review May 13, 2024 09:16
@Jiaweihu08 Jiaweihu08 requested a review from osopardo1 May 24, 2024 15:30
Copy link

codecov bot commented May 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.44%. Comparing base (d38a94c) to head (eafe4cc).
Report is 6 commits behind head on main.

Current head eafe4cc differs from pull request most recent head 8d2900f

Please upload reports for the commit 8d2900f to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #319      +/-   ##
==========================================
+ Coverage   91.32%   91.44%   +0.11%     
==========================================
  Files          98       99       +1     
  Lines        2652     2688      +36     
  Branches      343      338       -5     
==========================================
+ Hits         2422     2458      +36     
  Misses        230      230              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@osopardo1 osopardo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks very good!! Just a few things on documentation/options..

@Jiaweihu08 Jiaweihu08 requested a review from osopardo1 June 18, 2024 12:42
@osopardo1 osopardo1 merged commit 523c7f1 into Qbeast-io:main Jun 19, 2024
1 check passed
@Jiaweihu08 Jiaweihu08 deleted the precommit-hooks branch June 19, 2024 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Commit Hooks to write extra information within the same transaction
3 participants