Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-28697 Don't clean bulk load system entries until backup is complete #6089

Merged
merged 1 commit into from
Sep 2, 2024

Conversation

rmdmattingly
Copy link
Contributor

https://issues.apache.org/jira/browse/HBASE-28697

I've been thinking through the incremental backup order of operations, and I think we delete rows from the bulk loads system table too early and, consequently, make it possible to produce a "successful" incremental backup that is missing bulk loads.

To summarize the steps here, starting in IncrementalTableBackupCilent#execute:

  1. We take an incremental backup of the WALs generated since the last backup
  2. We ensure any bulk loads done since the last backup are appropriately represented in the new backup by going through the system table and copying the appropriate files to the backup directory
  3. We delete all of the system table rows which told us about these bulk loads
  4. We generate a backup manifest and mark the backup as complete
  5. If we began deleting any of the system table rows regarding bulk loads, but fail in steps 3 and 4 before we are able to mark the backup as complete, then we'll be in a precarious spot. If we retry an incremental backup then it may succeed, but it would not know to persist the bulk loaded files for which we have already deleted system table references.

We could consider this issue an extension or replacement of https://issues.apache.org/jira/browse/HBASE-28084 in some ways, depending on what solution we land on. I think that we could fix this specific issue by reordering the bulk load table cleanup, but there will always be gotchas like this. Maybe it is simpler to require that the next backup be a full backup after any incremental failure.

cc @hgromer @ndimiduk @DieterDP-ng

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 30s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+1 💚 mvninstall 3m 45s master passed
+1 💚 compile 0m 34s master passed
+1 💚 checkstyle 0m 12s master passed
+1 💚 spotbugs 0m 37s master passed
+1 💚 spotless 0m 51s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 29s the patch passed
+1 💚 compile 0m 32s the patch passed
+1 💚 javac 0m 32s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 10s the patch passed
+1 💚 spotbugs 0m 43s the patch passed
+1 💚 hadoopcheck 12m 18s Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚 spotless 0m 50s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 10s The patch does not generate ASF License warnings.
32m 23s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6089/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #6089
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux 5a5e33cf3512 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 5d872aa
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 81 (vs. ulimit of 30000)
modules C: hbase-backup U: hbase-backup
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6089/1/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 17s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 4m 39s master passed
+1 💚 compile 0m 29s master passed
+1 💚 javadoc 0m 20s master passed
+1 💚 shadedjars 6m 14s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 6s the patch passed
+1 💚 compile 0m 25s the patch passed
+1 💚 javac 0m 25s the patch passed
+1 💚 javadoc 0m 17s the patch passed
+1 💚 shadedjars 5m 53s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 10m 26s hbase-backup in the patch passed.
33m 16s
Subsystem Report/Notes
Docker ClientAPI=1.46 ServerAPI=1.46 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6089/1/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #6089
Optional Tests javac javadoc unit compile shadedjars
uname Linux ec260b53bacd 5.4.0-182-generic #202-Ubuntu SMP Fri Apr 26 12:29:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 5d872aa
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6089/1/testReport/
Max. process+thread count 3174 (vs. ulimit of 30000)
modules C: hbase-backup U: hbase-backup
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6089/1/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Member

@ndimiduk ndimiduk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems alright to me but I'd appreciate to hear from another voice with familiarity in this code path. Specifically, are there semantic implications in other parts of the backup system for having the completeBackup called before the deletes occur?

*/
@SuppressWarnings("unchecked")
protected Map<byte[], List<Path>>[] handleBulkLoad(List<TableName> sTableList)
throws IOException {
protected List<byte[]> handleBulkLoad(List<TableName> sTableList) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it okay to drop the table context of the rowkeys in the returned value? a rowkey is only meaningful in the context of its table (or region).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this a bit offline. We can purge these rowkeys because they are only returned by handleBulkload if we have bulk loaded the keys in this backup.

Right now an inopportune failure would result in us missing bulk load data on subsequent incremental backups, but with this change an inopportune failure would result is us backing up duplicative files which should be just a little bit wasteful, but otherwise innocuous

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth having some backup consistency check that can detect and purge extra files? Or do we think that backups will cycle out and the redundancy will be dropped the next time a full backup is taken?

@rmdmattingly
Copy link
Contributor Author

@DieterDP-ng any thoughts on this PR?

*/
@SuppressWarnings("unchecked")
protected Map<byte[], List<Path>>[] handleBulkLoad(List<TableName> sTableList)
throws IOException {
protected List<byte[]> handleBulkLoad(List<TableName> sTableList) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth having some backup consistency check that can detect and purge extra files? Or do we think that backups will cycle out and the redundancy will be dropped the next time a full backup is taken?

@ndimiduk ndimiduk merged commit ed6613e into apache:master Sep 2, 2024
1 check passed
@ndimiduk ndimiduk deleted the HBASE-28697-master branch September 2, 2024 08:38
ndimiduk pushed a commit to ndimiduk/hbase that referenced this pull request Sep 2, 2024
…lete (apache#6089)

Co-authored-by: Ray Mattingly <rmattingly@hubspot.com>
ndimiduk pushed a commit to ndimiduk/hbase that referenced this pull request Sep 2, 2024
…lete (apache#6089)

Co-authored-by: Ray Mattingly <rmattingly@hubspot.com>
ndimiduk pushed a commit to ndimiduk/hbase that referenced this pull request Sep 2, 2024
…lete (apache#6089)

Co-authored-by: Ray Mattingly <rmattingly@hubspot.com>
ndimiduk pushed a commit that referenced this pull request Sep 2, 2024
…lete (#6089)

Co-authored-by: Ray Mattingly <rmattingly@hubspot.com>
ndimiduk pushed a commit that referenced this pull request Sep 2, 2024
…lete (#6089)

Co-authored-by: Ray Mattingly <rmattingly@hubspot.com>
ndimiduk pushed a commit that referenced this pull request Sep 2, 2024
…lete (#6089)

Co-authored-by: Ray Mattingly <rmattingly@hubspot.com>
@DieterDP-ng
Copy link
Contributor

@DieterDP-ng any thoughts on this PR?

Sorry for the late reply. These changes look OK to me.

This seems alright to me but I'd appreciate to hear from another voice with familiarity in this code path. Specifically, are there semantic implications in other parts of the backup system for having the completBackup called before the deletes occur?

I'm not aware of any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants