[FLINK-31238] Use IngestDB to speed up Rocksdb rescaling recovery (rebased) #24031

StefanRRichter · 2024-01-05T14:34:48Z

Rebased and slightly refactored version of #23169.

StefanRRichter · 2024-01-05T14:36:00Z

...ava/org/apache/flink/contrib/streaming/state/restore/RocksDBIncrementalRestoreOperation.java

+                    for (Pair<RegisteredStateMetaInfoBase, ExportImportFilesMetaData> entry :
+                            exportedCFAndMetaData) {
+                        ExportImportFilesMetaData cfMetaData = entry.getValue();
+                        // TODO: method files() doesn't exist in the RocksDB API


@mayuehappy Is it correct to remove this code? I could not find a files() method in the RocksDB API.

flinkbot · 2024-01-05T14:38:19Z

CI report:

6d68d32 Azure: FAILURE
05a7a0d Azure: PENDING

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

rkhachatryan

Thanks for rebasing the PR!
I've left some comments, PTAL.

Besides that, I think it makes sense to test the change for correctness using ITCase randomization, WDYT?
(see TestStreamEnvironment.randomizeConfiguration)

...ksdb/src/main/java/org/apache/flink/contrib/streaming/state/EmbeddedRocksDBStateBackend.java

...rc/main/java/org/apache/flink/contrib/streaming/state/RocksDBIncrementalCheckpointUtils.java

...nd-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBOperationUtils.java

...t/java/org/apache/flink/contrib/streaming/state/RocksIncrementalCheckpointRescalingTest.java

...ava/org/apache/flink/contrib/streaming/state/restore/RocksDBIncrementalRestoreOperation.java

...nd-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/restore/RocksDBHandle.java

...ava/org/apache/flink/contrib/streaming/state/restore/RocksDBIncrementalRestoreOperation.java

...nd-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBOperationUtils.java

...t/java/org/apache/flink/contrib/streaming/state/RocksIncrementalCheckpointRescalingTest.java

StefanRRichter · 2024-01-10T15:10:28Z

@rkhachatryan Can you take one more look? I have fixed all problems that I encountered in the original PR and also addressed your comments. Using TestStreamEnvironment.randomizeConfiguration seems a bit problematic because the RocksDB config keys are not visible in that module. We also still need to change to frocksDB version and the notice file once we have the release.

rkhachatryan

Thanks for updating the PR @StefanRRichter,

LGTM in general.

Regarding randomised testing,

Using TestStreamEnvironment.randomizeConfiguration seems a bit problematic because the RocksDB config keys are not visible in that module.

Do I understand correctly, that adding rocksdb module as a test dependency would create a cycle?
In that case, should we just use string constants to enable this feature randomly?

rkhachatryan · 2024-01-11T09:59:16Z

.../src/test/java/org/apache/flink/contrib/streaming/state/EmbeddedRocksDBStateBackendTest.java

+                        .setEnableIncrementalCheckpointing(enableIncrementalCheckpointing)
+                        .setUseIngestDbRestoreMode(useIngestDB);


I see that testCorrectMergeOperatorSet sets up the backend independently of this method - we could also add setUseIngestDbRestoreMode there.

rkhachatryan · 2024-01-11T10:08:09Z

...rc/main/java/org/apache/flink/contrib/streaming/state/RocksDBIncrementalCheckpointUtils.java

+                if (exportedSstFiles != null && exportedSstFiles.length > 0) {
+                    resultOutput
+                            .computeIfAbsent(stateMetaInfo, (key) -> new ArrayList<>())
+                            .add(cfMetaData);
+                }


else cfMetaData.close(); ?

rkhachatryan · 2024-01-25T17:22:54Z

One more thing, we should probably update the UI and show the new flag under Checkpoints / Configuration. But that can be a separate PR.

mayuehappy · 2024-02-07T09:14:24Z

...ava/org/apache/flink/contrib/streaming/state/restore/RocksDBIncrementalRestoreOperation.java

@@ -488,6 +495,8 @@ private void rescaleClipIngestDB(
                List<ColumnFamilyHandle> tmpColumnFamilyHandles =
                        tmpRestoreDBInfo.columnFamilyHandles;

+                // Check if the data in all SST files referenced in the handle is within the
+                // proclaimed key-groups range of the handle.
                if (RocksDBIncrementalCheckpointUtils.isSstDataInKeyGroupRange(


Here we need to check
Is key in proclaimed range or is there overlap between the checking stateHandles?
For example, the range of proclaimed is [1,5] [6,10] but the actual range is [1,7] [8,9]. Should it be possible to export in this case?

It is possible to import the case in your example, but the cost of detecting the case is that we need to compare all handles against each other which we can only do after all DBs have been opened. I'm excluding this case on purpose to avoid it, because I also assume that this missed opportunities will be very very rare. In particular because subtasks over time will compact their data to the proclaimed range from normal compaction activity.

Why I say it's rare because it means that while the handle [6,10] was created by a task with range [6,10] we never wrote a key from that falls into key-group 6 or 7 if the actual range is just [8, 9]. Because of the hash partitioning nature, that is highly unlikely if the task didn't idle the whole time.

haha , that makes sense to me

mayuehappy · 2024-02-07T09:28:34Z

...ava/org/apache/flink/contrib/streaming/state/restore/RocksDBIncrementalRestoreOperation.java

-            // if we have remaining handles to restore, we will insert by copy with from temporary
-            // instances to base DB.
+            // if we have remaining unopened handles to restore, we will insert by copy via
+            // temporary instances to base DB.


In the old code, we called choseTheBestStateHandleForInitial to choose the best state handle to init the initial db .Because if we use TheBestHandle as a temporary DB instead of the main DB, we may need to write a lot of data when copying. Can we maintain this logic int the new code ?
When choosing db to export, we prioritize TheBestHandle to ensure that we will not copy TheBestHandle during the subsequent copying phase. Can this ensure that there is no regression compared to the old code

Yes, I'm already working on adding that code again here.

mayuehappy · 2024-02-07T09:57:35Z

...ava/org/apache/flink/contrib/streaming/state/restore/RocksDBIncrementalRestoreOperation.java

                            stateHandle);
+                    // Use Range delete to clip the temp db to the target range of the backend
+                    RocksDBIncrementalCheckpointUtils.clipDBWithKeyGroupRange(


Another point is, if these handles do not overlaps, can we first import them before clip them with deleteRange.
Because in current code, first we deleteRange and then exports, each DB will generate a new small file containing the RangeDeletion tombstone during export. So can we deleteRange together after importing, so that we can reduce the number of small files?

Makes sense to try that. Your code was doing the delete before they export, so I was also wondering about this but didn't bother to change. Let me try that.

I changed the code accordingly. It didn't work at first because the import code didn't add the ColumnFamilyHandles to the DbHandle. Now it seems to work :)

…rt 1)

…rt 2)

… is available. Then this commit should be reverted.

…local and downloaded remote state).

pnowojski

Broadly speaking LGTM % other reviewers' open issues.

StefanRRichter requested review from rkhachatryan and pnowojski January 5, 2024 14:34

StefanRRichter commented Jan 5, 2024

View reviewed changes

rkhachatryan reviewed Jan 5, 2024

View reviewed changes

StefanRRichter requested a review from rkhachatryan January 10, 2024 15:10

rkhachatryan reviewed Jan 11, 2024

View reviewed changes

StefanRRichter force-pushed the srichter-FLINK-31238-ingest-db branch from e3d0938 to 74505a9 Compare January 11, 2024 13:20

StefanRRichter force-pushed the srichter-FLINK-31238-ingest-db branch from 74505a9 to 776790d Compare January 19, 2024 09:50

StefanRRichter mentioned this pull request Jan 22, 2024

[FLINK-34199] Add tracing for durations of rescaling/restoring RocksDB incremental checkpoints from downloaded and local state #24168

Closed

StefanRRichter force-pushed the srichter-FLINK-31238-ingest-db branch 3 times, most recently from 16393d3 to d356083 Compare January 25, 2024 17:17

StefanRRichter force-pushed the srichter-FLINK-31238-ingest-db branch 4 times, most recently from 620593b to 3818b94 Compare January 31, 2024 10:06

StefanRRichter force-pushed the srichter-FLINK-31238-ingest-db branch 2 times, most recently from 49b417f to 720e6a8 Compare February 6, 2024 16:24

mayuehappy reviewed Feb 7, 2024

View reviewed changes

StefanRRichter force-pushed the srichter-FLINK-31238-ingest-db branch 4 times, most recently from 8c90c30 to 30f4b15 Compare February 12, 2024 12:36

mayuehappy and others added 2 commits February 12, 2024 13:37

[FLINK-31238] Use IngestDB to speed up Rocksdb rescaling recovery (pa…

d5d3068

…rt 1)

[FLINK-31238] Use IngestDB to speed up Rocksdb rescaling recovery (pa…

67868f8

…rt 2)

StefanRRichter force-pushed the srichter-FLINK-31238-ingest-db branch from 30f4b15 to 4e09aa0 Compare February 12, 2024 13:34

StefanRRichter changed the title ~~[FLINK-31238] [WIP] Use IngestDB to speed up Rocksdb rescaling recovery (rebased)~~ [FLINK-31238] Use IngestDB to speed up Rocksdb rescaling recovery (rebased) Feb 12, 2024

[FLINK-31238] Deactivate parts of the code until new FRocksDB release…

8533c88

… is available. Then this commit should be reverted.

StefanRRichter force-pushed the srichter-FLINK-31238-ingest-db branch from 4e09aa0 to 0a33220 Compare February 12, 2024 14:55

[FLINK-34199] Add tracing for durations of rescaling/restoring (from …

fc4c962

…local and downloaded remote state).

StefanRRichter force-pushed the srichter-FLINK-31238-ingest-db branch 2 times, most recently from 6d84dec to 6d68d32 Compare February 12, 2024 18:28

[FLINK-34199] Add documentation.

05a7a0d

StefanRRichter force-pushed the srichter-FLINK-31238-ingest-db branch from 6d68d32 to 05a7a0d Compare February 13, 2024 12:36

StefanRRichter marked this pull request as ready for review February 13, 2024 12:59

pnowojski approved these changes Feb 13, 2024

View reviewed changes

StefanRRichter merged commit 6bf9767 into apache:master Feb 13, 2024

flinkbot added the component=Runtime/StateBackends label Apr 4, 2024

		.setEnableIncrementalCheckpointing(enableIncrementalCheckpointing)
		.setUseIngestDbRestoreMode(useIngestDB);

[FLINK-31238] Use IngestDB to speed up Rocksdb rescaling recovery (rebased) #24031

[FLINK-31238] Use IngestDB to speed up Rocksdb rescaling recovery (rebased) #24031

Uh oh!

Conversation

StefanRRichter commented Jan 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flinkbot commented Jan 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StefanRRichter commented Jan 10, 2024

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkhachatryan commented Jan 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pnowojski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flinkbot commented Jan 5, 2024 •

edited

Loading