[Prototype][SPARK-53751][SDP] Explicit Checkpoint Location #52487

JiaqiWang18 · 2025-09-30T07:24:18Z

What changes were proposed in this pull request?

Prototype PR for explicit checkpoint location that follows below format:

checkpoints-root/
      ├── myst/                            # Table "myst"
      │    ├── flow1/                      # Flow to myst
      │    │    ├── 0/                    # Versioned checkpoint (0)
      │    │    │    ├── commits/
      │    │    │    ├── offsets/
      │    │    │    └── sources/
      │    │    └── 1/                    # Versioned checkpoint (1)
      │    │
      │    └── flow2/                     # Another flow to myst
      │         ├── 0/                    # Versioned checkpoint (0)
      │         │    ├── commits/
      │         │    ├── offsets/
      │         │    └── sources/
      │         └── 1/                    # Versioned checkpoint (1)
      │
      └── mysink/                         # Sink "mysink"
            └── flowA/                     # Flow to mysink
                 ├── 0/                    # Versioned checkpoint (0)
                 │    ├── commits/
                 │    ├── offsets/
                 │    └── sources/
                 └── 1/                    # Versioned checkpoint (1)

Backend changes should be mostly finalized, all tests should pass.

For user-facing API, use a spark.sql.pipelines.storageRoot for now, as it is easier to revert.

Why are the changes needed?

To unblock development of sinks
Path toward supporting multi-flow

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

JiaqiWang18 · 2025-09-30T07:29:20Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/PipelineExecution.scala

    val resolvedGraph = resolveGraph()
+    if (context.fullRefreshTables.nonEmpty) {
+      State.reset(resolvedGraph, context)
+    }


with explicit storage location for checkpoint, we shouldn't need to create the tables and obtain its path beforehand. resolvedGraph should suffice.

JiaqiWang18 · 2025-09-30T07:30:34Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DatasetManager.scala

+        case (true, _) => // Already performed reset for full refresh mv/st - no-op
+        case (false, true) => // Incremental refresh of a st - no-op
+        case (false, false) => // Incremental refresh of a mv - truncate
+          context.spark.sql(s"TRUNCATE TABLE ${table.identifier.quotedString}")


this match is to avoid calling TRUNCATE twice for full refresh MVs

jackywang-db added 6 commits September 29, 2025 14:23

wip

e774ebd

non-option

3c143f0

more tests

db9a8e6

fix double truncate for full refresh mv

c59caba

fix persisted view

248747d

fix tests and prototype with spark conf

287ff33

github-actions bot added SQL CONNECT labels Sep 30, 2025

nit

5cda0f9

JiaqiWang18 commented Sep 30, 2025

View reviewed changes

fmt

0c9ad6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Prototype][SPARK-53751][SDP] Explicit Checkpoint Location #52487

[Prototype][SPARK-53751][SDP] Explicit Checkpoint Location #52487

JiaqiWang18 commented Sep 30, 2025 •

edited

Loading

Uh oh!

JiaqiWang18 Sep 30, 2025

Uh oh!

JiaqiWang18 Sep 30, 2025

Uh oh!

Uh oh!

[Prototype][SPARK-53751][SDP] Explicit Checkpoint Location #52487

Are you sure you want to change the base?

[Prototype][SPARK-53751][SDP] Explicit Checkpoint Location #52487

Conversation

JiaqiWang18 commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

JiaqiWang18 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JiaqiWang18 commented Sep 30, 2025 •

edited

Loading