Skip to content

Conversation

@ankitsol
Copy link

The incremental backup collects all the WALs generated since the last full or incremental backup and processes them into HFiles for optimized restoration. With the new continuous backup feature, the necessary WALs are already available at the backup location and are used directly, eliminating the need to retrieve them from the source cluster.

The incremental backup command itself remains unchanged.

If previous backups are part of a continuous backup: Uses the WALs already backed up in the backup location.

If previous backups are not part of a continuous backup: Operates as a traditional incremental backup, collecting WALs from the source cluster.

JIRA: https://issues.apache.org/jira/browse/HBASE-28990

vinayakphegde and others added 3 commits February 18, 2025 11:45
…p to External Storage (apache#6633)

* HBASE-28996: Implement Custom ReplicationEndpoint to Enable WAL Backup to External Storage

* fix spotless error
…ckup (apache#6710)

* HBASE-29025: Enhance the full backup command to support continuous backup

* add new check for full backup command regards to continuous backup flag

* minor fixes
@ankitsol ankitsol marked this pull request as ready for review March 19, 2025 05:35
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

Copy link
Contributor

@vinayakphegde vinayakphegde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to review the implementation (excluding the tests) and have shared a few comments—please take a look.

Regarding the handling of bulkloaded files in incremental backup:
We're still aiming to preserve the bulkloaded files in the source cluster and use those, instead of relying on the copies present in the backup system via continuous backup—even when continuous backup is enabled. Let me know your thoughts on this approach.

* start time of continuous backup for that table.
* @throws IOException if an I/O error occurs while accessing the backup system table.
*/
public Map<TableName, Long> getContinuousBackupTableSet() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are we using this method?

throw new IOException(msg);
}
}
tableList = Lists.newArrayList(incrTableSet);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rmdmattingly, I had a quick question for clarification — this is not related to the changes in this PR, but rather about the current incremental backup process.

It seems like we're using incrTableSet to determine the tables for incremental backup. Could you help us understand why we consider all tables marked for incremental backup instead of just the user-specified tables?

Understanding the reasoning behind this will help us better integrate the continuous backup functionality with the existing logic. Appreciate your insights — thank you!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason: [HBASE-14038] The incremental backup is controlled by the 'incremental backup table set'.
For example, if the table set contains (table1, table2, table3). Incremental backup will back up the WALs, which cover all the tables in the table set.
It is to avoid copying the same set of WALs, which would the likely case if you backup up table1, then backup table2.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

Copy link
Contributor

@kgeisz kgeisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I just have some minor suggestions.

private List<String> getBackupLogs(long startTs) throws IOException {
// get log files from backup dir
String walBackupDir = conf.get(CONF_CONTINUOUS_BACKUP_WAL_DIR);
if (walBackupDir == null || walBackupDir.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - You can use Strings.isNullOrEmpty() from org.apache.hbase.thirdparty.com.google.common.base:

Suggested change
if (walBackupDir == null || walBackupDir.isEmpty()) {
if (Strings.isNullOrEmpty(walBackupDir)) {

Comment on lines 773 to 776
public static long getReplicationCheckpoint(Connection conn) throws IOException {
// TODO this will be fixed in PR https://github.com/apache/hbase/pull/6717
return System.currentTimeMillis();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this above lines or just rebase your commit on top of the merged #6717


import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for * , please change your IDE setup

import static org.apache.hadoop.hbase.backup.BackupRestoreConstants.*;
import static org.apache.hadoop.hbase.backup.replication.ContinuousBackupReplicationEndpoint.CONF_STAGED_WAL_FLUSH_INITIAL_DELAY;
import static org.apache.hadoop.hbase.backup.replication.ContinuousBackupReplicationEndpoint.CONF_STAGED_WAL_FLUSH_INTERVAL;
import static org.junit.Assert.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for * , please change your IDE setup

*/
package org.apache.hadoop.hbase.backup;

import static org.apache.hadoop.hbase.backup.BackupRestoreConstants.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please do not use */wildcard for imports

import org.apache.hadoop.hbase.backup.impl.BackupSystemTable;
import org.apache.hadoop.hbase.backup.impl.BulkLoad;
import org.apache.hadoop.hbase.backup.util.BackupUtils;
import org.apache.hadoop.hbase.client.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for * , please change your IDE setup

Comment on lines 83 to 91
@Before
public void beforeTest() throws IOException {
super.beforeTest();
}

@After
public void afterTest() throws IOException {
super.afterTest();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think you need to call this here and the parent class should run it for you, please consider to remove it

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@taklwu
Copy link
Contributor

taklwu commented Jun 5, 2025

can you check those failing tests of TestIncrementalBackup ?

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 39s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ HBASE-28957 Compile Tests _
+1 💚 mvninstall 4m 32s HBASE-28957 passed
+1 💚 compile 0m 38s HBASE-28957 passed
-0 ⚠️ checkstyle 0m 16s /buildtool-branch-checkstyle-hbase-backup.txt The patch fails to run checkstyle in hbase-backup
+1 💚 spotbugs 0m 46s HBASE-28957 passed
+1 💚 spotless 1m 2s branch has no errors when running spotless:check.
-0 ⚠️ patch 1m 11s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+1 💚 mvninstall 4m 22s the patch passed
+1 💚 compile 0m 41s the patch passed
-0 ⚠️ javac 0m 41s /results-compile-javac-hbase-backup.txt hbase-backup generated 2 new + 109 unchanged - 0 fixed = 111 total (was 109)
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 12s /buildtool-patch-checkstyle-hbase-backup.txt The patch fails to run checkstyle in hbase-backup
+1 💚 spotbugs 0m 58s the patch passed
+1 💚 hadoopcheck 16m 42s Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚 spotless 1m 11s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 14s The patch does not generate ASF License warnings.
42m 32s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6788/8/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #6788
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux 13822d3ad071 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision HBASE-28957 / 64fd7b8
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 83 (vs. ulimit of 30000)
modules C: hbase-backup U: hbase-backup
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6788/8/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 45s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ HBASE-28957 Compile Tests _
+1 💚 mvninstall 3m 37s HBASE-28957 passed
+1 💚 compile 0m 26s HBASE-28957 passed
+1 💚 javadoc 0m 15s HBASE-28957 passed
+1 💚 shadedjars 7m 1s branch has no errors when building our shaded downstream artifacts.
-0 ⚠️ patch 7m 11s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 52s the patch passed
+1 💚 compile 0m 23s the patch passed
+1 💚 javac 0m 23s the patch passed
+1 💚 javadoc 0m 16s the patch passed
+1 💚 shadedjars 6m 53s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
-1 ❌ unit 23m 10s /patch-unit-hbase-backup.txt hbase-backup in the patch failed.
47m 46s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6788/8/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #6788
Optional Tests javac javadoc unit compile shadedjars
uname Linux bbb94d8941b5 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision HBASE-28957 / 64fd7b8
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6788/8/testReport/
Max. process+thread count 3762 (vs. ulimit of 30000)
modules C: hbase-backup U: hbase-backup
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6788/8/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@ankitsol
Copy link
Author

ankitsol commented Jun 6, 2025

can you check those failing tests of TestIncrementalBackup ?

@taklwu TestIncrementalBackup has 2 flaky tests (TestIncBackupRestoreWithOriginalSplits and TestIncBackupRestore), there are other JIRAs tracking this test like HBASE-28461

I have ran these test without my change with same failure

Copy link
Contributor

@taklwu taklwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 https://ci-hbase.apache.org/job/HBase-Flaky-Tests/job/master/24305/testReport/ you made a good point that TestIncrementalBackup is flaky

@taklwu taklwu requested review from kgeisz and vinayakphegde June 6, 2025 16:09
Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. But let me repeat my concern about unit tests real quick:
#7007 (review)

Copy link
Contributor

@kgeisz kgeisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me. I only have a few minor comments.

assertEquals("Backup history should increase", before + 1, backups.size());
for (BackupInfo data : List.of(backups.get(0))) {
String backupId = data.getBackupId();
incrementalBackupid = backupId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why do we need incrementalBackupid? Can't we just define backupId in place of incrementalBackupid above on Line 123 and use it down below on Line 135?

assertEquals(expectedRowCount, TEST_UTIL.countRows(table1));
List<BulkLoad> bulkloadsTemp = systemTable.readBulkloadRows(List.of(table1));
assertEquals(1, bulkloadsTemp.size());
BulkLoad bulk7 = bulkloadsTemp.get(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this variable supposed to be used somewhere? If not, it can probably just be deleted.

if (list != null) {
for (FileStatus fstat : list) {
if (fstat.isDirectory()) {
LOG.info("Found directory {}", Objects.toString(fstat.getPath()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can just use fstat.getPath() here.

Suggested change
LOG.info("Found directory {}", Objects.toString(fstat.getPath()));
LOG.info("Found directory {}", fstat.getPath());

LOG.info("Found directory {}", Objects.toString(fstat.getPath()));
files.addAll(listFiles(fs, root, fstat.getPath()));
} else {
LOG.info("Found file {}", Objects.toString(fstat.getPath()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Same here.

Suggested change
LOG.info("Found file {}", Objects.toString(fstat.getPath()));
LOG.info("Found file {}", fstat.getPath());

// Gather the bulk loads being tracked by the system, which can be deleted (since their data
// will be part of the snapshot being taken). We gather this list before taking the actual
// snapshots for the same reason as the log rolls.
List<BulkLoad> bulkLoadsToDelete = backupManager.readBulkloadRows(tableList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn’t quite figure out why this step is needed. Could you please explain the reasoning behind it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed so subsequent incremental backup (just after this full backup) captures only bulkload from this full backup, so fullbackup deletes existing bulkload entries from system table

So an incremental backup post a full backup will capture only bulkload entries from fullbackup

This change is actually part of https://github.com/apache/hbase/pull/6506/files#diff-18c753bdf4ce717d55d98646bf46723970d2bd4a470a815eb9512c7d94398274 which is merged, I added here because it was necessary for bulkload test I added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have a few questions:

  1. Why is this needed? Can’t we get the bulk loaded files from the backup location, just like we do with WAL files? Why not handle bulkload files the same way?
  2. Also, one thing I noticed — by default, when backup is enabled in the cluster, we start accumulating all HFiles (both store files and bulkloaded files) using BackupObserver. These files are registered under bulkLoadTableName to prevent cleaner chores from deleting them during incremental backups, etc. But in our case, we don’t want this behavior, right? The whole idea is to back up files into the backup location, not retain them on the source cluster. I think we should enable BackupObserver only for the traditional (non-continuous) backup path. Otherwise, we’re unnecessarily accumulating files in the source cluster.
  3. I also noticed this:
  • In this code, we read all WAL files since the last backup, convert them to HFiles, and store them in the backup location.
  • Again, in this line, we’re reading bulkloaded files from the table and backing them up.
  • However, the list of bulkloaded files also includes regular store files — as seen here — which means we’re potentially processing normal HFiles twice?
  1. Regarding: “I added here because it was necessary for the bulkload test I added”. -- ideally, we shouldn’t modify core logic only to support a specific test case. It might be better to adapt the test to the intended behavior.

@anmolnar @kgeisz @taklwu — please share your thoughts as well.

Copy link
Author

@ankitsol ankitsol Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1) & 2) points I see 2 advantages with the coded approach in this PR. First, this behaviour would be same for both continuous and non-continuous incremental backup (ie using bulkload files from source cluster). Second, using source cluster hfiles wrt bulkload operation instead of backup hfiles would reduce processing time during backup and cost of storage space of backup area. Backing up bulkload operation would also delay backup of WALs since backing up WALs and bulkload files are serial in execution.

Backing bulkload files idea was necessary when we were planning to use WALPlayer with bulkload restore capability. Now I don't see any advantage of backing up of bulkload files

  1. BackupObserver.preCommitStoreFile() in invoked only for bulkload operation so for bulkloaded hfiles only one time copy happens.

  2. This code actually resolves a bug for properly handling of bulkload operation, no modification of core logic although it might seem like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question is what happens if the source cluster/storage is gone and users need a DR?
We could still get back data from snapshots and WALs from the external storage. Is there a way to restore the bulkloads in such a situation?
Also, what happens if those bulkload hfiles are compacted? Do we track it somewhere?

Copy link
Author

@ankitsol ankitsol Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kota-SH Post bulkload operation, user is advised to take a backup, as part of continuous backup and PITR design (I think this is not yet documented, I will add a comment in design to add this part)

Earlier we implemented WALPlayer with bulkload restore functionality but that could have resulted in timeouts or performance issues, so we dropped this WALPlayer modification and decided to ask user to take a backup (ideally incremental backup) post a bulkload operation so restore is fast

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kota-SH Regarding compaction, there is preCommitStoreFile() hook which registers these bulkloaded files in backup system table and avoids deletion during compaction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, maybe asking differently, is this a one-way approach for continuous backup if we couple with the optimization of HBASE-29003 that reduce the additional bulkload HFiles of the source cluster?

and without this change, or as @Kota-SH pointed about the if the source cluster/directory is not accessible, which backup can we use ? especially can we still have incremental recovery?

My two cents on this approach, building on top of HBASE-29003, is that it seems reasonable. at least incremental backup already has this code change that uses the source cluster/storage. I was wondered the feedback on the design docs is also suggesting us to work closer with the logic of incremental backup, and such we could avoid introducing similar logic but in fact is serving the same thing.

Meanwhile, it's worth thinking of
a. the original plan that copies all the bulkloaded HFiles in between a incremental backup was too slow, other than this approach, do we have any alternative?
b. are we 100% against the continuous backup reads HFiles/bulkload HFiles from the source storage? HBASE-29003 made a very good point about storage usage, especially the HDFS use cases with 3 replicas.


point 3 and 4, I assumed @ankitsol already answered, so I don't have comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense — thanks for the discussion, folks! 👍

taklwu

This comment was marked as off-topic.

@taklwu taklwu merged commit d19b335 into apache:HBASE-28957 Jun 20, 2025
1 check failed
anmolnar pushed a commit that referenced this pull request Jul 28, 2025
Signed-off-by: Tak Lon (Stephen) Wu <taklwu@apache.org>
Signed-off-by: Andor Molnár andor@apache.org
Reviewed by: Kota-SH <shanmukhaharipriya@gmail.com>
Reviewed by: Vinayak Hegde <vinayakph123@gmail.com>	   
Reviewed by: Kevin Geiszler <kevin.j.geiszler@gmail.com>
vinayakphegde pushed a commit to vinayakphegde/hbase that referenced this pull request Jul 29, 2025
)

Signed-off-by: Tak Lon (Stephen) Wu <taklwu@apache.org>
Signed-off-by: Andor Molnár andor@apache.org
Reviewed by: Kota-SH <shanmukhaharipriya@gmail.com>
Reviewed by: Vinayak Hegde <vinayakph123@gmail.com>	   
Reviewed by: Kevin Geiszler <kevin.j.geiszler@gmail.com>
vinayakphegde pushed a commit to vinayakphegde/hbase that referenced this pull request Jul 29, 2025
)

Signed-off-by: Tak Lon (Stephen) Wu <taklwu@apache.org>
Signed-off-by: Andor Molnár andor@apache.org
Reviewed by: Kota-SH <shanmukhaharipriya@gmail.com>
Reviewed by: Vinayak Hegde <vinayakph123@gmail.com>	   
Reviewed by: Kevin Geiszler <kevin.j.geiszler@gmail.com>
anmolnar pushed a commit that referenced this pull request Sep 11, 2025
Signed-off-by: Tak Lon (Stephen) Wu <taklwu@apache.org>
Signed-off-by: Andor Molnár andor@apache.org
Reviewed by: Kota-SH <shanmukhaharipriya@gmail.com>
Reviewed by: Vinayak Hegde <vinayakph123@gmail.com>	   
Reviewed by: Kevin Geiszler <kevin.j.geiszler@gmail.com>
anmolnar pushed a commit that referenced this pull request Nov 6, 2025
Signed-off-by: Tak Lon (Stephen) Wu <taklwu@apache.org>
Signed-off-by: Andor Molnár andor@apache.org
Reviewed by: Kota-SH <shanmukhaharipriya@gmail.com>
Reviewed by: Vinayak Hegde <vinayakph123@gmail.com>
Reviewed by: Kevin Geiszler <kevin.j.geiszler@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants