-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-29255: Integrate backup WAL cleanup logic with the delete command #7007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
c033241 to
bdc0de5
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me overall, aside from one nit comment.
One more thing, do you think a lot of these System.err/out.println() statements can be replaced with LOG.info/error()? I know we want to give some feedback to the user via the Terminal, but it seems like a lot of these messages should go to the log (like the messages in BackupCommands.updateBackupTableStartTimes(), BackupCommands. deleteOldWALFiles(), etc.)
| Configuration conf = getConf() != null ? getConf() : HBaseConfiguration.create(); | ||
| String backupWalDir = conf.get(CONF_CONTINUOUS_BACKUP_WAL_DIR); | ||
|
|
||
| if (backupWalDir == null || backupWalDir.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - You can use Strings.isNullOrEmpty() from org.apache.hbase.thirdparty.com.google.common.base
| if (backupWalDir == null || backupWalDir.isEmpty()) { | |
| if (Strings.isNullOrEmpty(backupWalDir)) { |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Good point. we have lot of println lines everywhere in backup and restore code. let me create a new Jira to address this issue. |
| return; | ||
| } | ||
|
|
||
| try (Connection conn = ConnectionFactory.createConnection(conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: avoid using generic name like conn, use specific like masterConn
| return; | ||
| } | ||
|
|
||
| try (Connection conn = ConnectionFactory.createConnection(conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This connection creation is unnecessary I feel. Super class already has a connection open. Please verify If you can reuse it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, we'll reuse that!
| // If WAL files of that day are older than cutoff time, delete them | ||
| if (dayStart + ONE_DAY_IN_MILLISECONDS - 1 < cutoffTime) { | ||
| System.out.println("Deleting outdated WAL directory: " + dirPath); | ||
| fs.delete(dirPath, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an api to delete in batches, we should use it. Also based on the nos of the file you are deleting this method can take lot of time. May be we can asynchronous here. Please give a thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an api to delete in batches, we should use it.
Yeah, I checked but couldn’t find any API that supports batch deletion.
Also based on the nos of the file you are deleting this method can take lot of time. May be we can asynchronous here. Please give a thought
About going async — it’s a good idea, but it might add some complexity. We’d need to track if the delete actually finished, retry on failure, and maybe notify the user when it’s done.
So we should probably think about whether the added complexity is worth the gain. Also, right now, all our backup and restore commands (like full backup, incremental, restore) are synchronous anyway, and those can take hours.
I think async is definitely a good direction — just that it probably makes sense to build a proper framework around it first, so we can handle retries, tracking, and notifications across the board. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets build a job co-ordinator framework with zookeeper. We should build that outside the scope of this ticket off course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me create a jira for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point guys, but before going down this rabbit hole, please do some performance tests for justification. Try to delete 100, 10000 and 1 million files in a single directory and share how much time does it take synchronously. Delete/unlink operations should be relatively quick in any filesystem, but let's see how it works with S3.
abhradeepkundu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One discussion point. One change request.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| */ | ||
| public void updateContinuousBackupTableSet(Set<TableName> tablesToUpdate, long newStartTimestamp) | ||
| throws IOException { | ||
| try (Table table = connection.getTable(tableName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Add a null check for tablesToUpdate
abhradeepkundu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more minor comment, But overall LGTM
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
anmolnar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vinayakphegde, patch looks good to me. However I have the same criticism that I mentioned previously: unit tests are missing.
Since all of your helper methods are private you cannot test them individually, so you need to set up an entire starship in your test case, call the command and verify the output. This is end 2 end testing. You will get a yes/no answer to your question about whether my function is working. If the answer is yes, we're fine, but if it's no, you'll have no idea about where the problem is and you have to debug.
Unit testing individual methods gives more detail about what's working and what's not.
| // If WAL files of that day are older than cutoff time, delete them | ||
| if (dayStart + ONE_DAY_IN_MILLISECONDS - 1 < cutoffTime) { | ||
| System.out.println("Deleting outdated WAL directory: " + dirPath); | ||
| fs.delete(dirPath, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point guys, but before going down this rabbit hole, please do some performance tests for justification. Try to delete 100, 10000 and 1 million files in a single directory and share how much time does it take synchronously. Delete/unlink operations should be relatively quick in any filesystem, but let's see how it works with S3.
| /** | ||
| * Updates the start time for continuous backups if older than cutoff timestamp. | ||
| * @param sysTable Backup system table | ||
| * @param cutoffTimestamp Timestamp before which WALs are no longer needed | ||
| */ | ||
| private void updateBackupTableStartTimes(BackupSystemTable sysTable, long cutoffTimestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @vinayakphegde, this is the function that led me to ask for clarification on why we need to update the start times of the continuous backups. Maybe you could add another line or two to the docstring here that elaborates on why we need to do this? That may make it more clear to others in the future.
Sure @anmolnar I will try to add more unit tests to cover these methods. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
hbase-backup/pom.xml
Outdated
| <dependency> | ||
| <groupId>org.mockito</groupId> | ||
| <artifactId>mockito-inline</artifactId> | ||
| <version>4.11.0</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to set version here, because it's already in the top level pom.
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
anmolnar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
@vinayakphegde The test failures don't look good, you might want to keep an eye on them. They're all incremental backup related, but might nothing to do with this patch.
|
@anmolnar all the tests are incremental backup related. we are trying fix those tests separately. |
…nd (#7007) * Store bulkload files in daywise bucket as well * Integrate backup WAL cleanup logic with the delete command * address the review comments * address the review comments * address the review comments * add more unit tests to cover all cases * address the review comments
…nd (apache#7007) * Store bulkload files in daywise bucket as well * Integrate backup WAL cleanup logic with the delete command * address the review comments * address the review comments * address the review comments * add more unit tests to cover all cases * address the review comments
…nd (apache#7007) * Store bulkload files in daywise bucket as well * Integrate backup WAL cleanup logic with the delete command * address the review comments * address the review comments * address the review comments * add more unit tests to cover all cases * address the review comments
…nd (#7007) * Store bulkload files in daywise bucket as well * Integrate backup WAL cleanup logic with the delete command * address the review comments * address the review comments * address the review comments * add more unit tests to cover all cases * address the review comments
…nd (apache#7007) * Store bulkload files in daywise bucket as well * Integrate backup WAL cleanup logic with the delete command * address the review comments * address the review comments * address the review comments * add more unit tests to cover all cases * address the review comments
…nd (apache#7007) * Store bulkload files in daywise bucket as well * Integrate backup WAL cleanup logic with the delete command * address the review comments * address the review comments * address the review comments * add more unit tests to cover all cases * address the review comments
…nd (#7007) * Store bulkload files in daywise bucket as well * Integrate backup WAL cleanup logic with the delete command * address the review comments * address the review comments * address the review comments * add more unit tests to cover all cases * address the review comments
No description provided.