Generalize work item definition in BackupEngineImpl #13228

mszeszko-meta · 2024-12-19T09:01:34Z

Summary

This change refactors existing CopyOrCreateWorkItem async task definition to a more generic one (WorkItem) with an assigned type indicative of intended action. This would allow us to reuse existing, battle-tested async tasks initialization code to handle wider range of incoming use cases in B/R space.

Motivation

Historically, the two main use cases for BackupEngineImpl's async work items were either creating a file in backup workflow or copying files in restore workflow. However, as we're now exploring opportunities in incremental restore (and potentially speeding up backup verification), we need the work item abstraction to be capable of processing different workflow types concurrently (computing checksum comes to mind).

Test plan

Since this is purely cosmetic change where behavior remains intact, existing test collateral will suffice.

…epresenting the intended underlying action(s)

facebook-github-bot · 2024-12-19T09:02:42Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-12-19T19:20:22Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-12-19T19:22:06Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-12-19T20:00:06Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-12-19T20:00:49Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pdillinger

Otherwise looks good. Just update the unexpected failure path

utilities/backup/backup_engine.cc

pdillinger · 2024-12-19T23:48:31Z

utilities/backup/backup_engine.cc

@@ -595,8 +595,29 @@ class BackupEngineImpl {
                             Temperature file_temp, RateLimiter* rate_limiter,
                             std::string* db_id, std::string* db_session_id);

-  struct CopyOrCreateResult {
-    ~CopyOrCreateResult() {
+  struct WorkItemResult {


It looks like the current plan is to have WorkItemResult and WorkItem contain all the fields for the various types of work item. This is probably ok for the foreseeable future but std::variant would likely be better if things become more fragmented.

Good callout!

facebook-github-bot · 2024-12-20T00:24:13Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

… type

facebook-github-bot · 2024-12-20T00:25:57Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2024-12-20T00:26:34Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pdillinger · 2024-12-20T00:40:49Z

utilities/backup/backup_engine.cc

          }
+          work_item.result.set_value(std::move(result));
+        } else {
+          result.io_status = IOStatus::InvalidArgument(


I believe this result gets dropped. Maybe move the work_item.result.set_value(...) above to after the else block.

Fix coming up in #13238. Thanks for catching it, Peter!

facebook-github-bot · 2024-12-20T01:01:02Z

@mszeszko-meta merged this pull request in f7b4216.

Summary: Followup to #13228. This fix is not a critical one in a sense that `else`-branch is only supposed to act as a guard just in case when new work item type is being introduced, scheduled but not handled. However, we're in control of the work item types and currently we only support a single one (which has appropriate handling logic to it). Pull Request resolved: #13238 Reviewed By: pdillinger Differential Revision: D67512001 Pulled By: mszeszko-meta fbshipit-source-id: 71e74b3dac388882dd3757871f500c334667fbd1

Summary: With this change we are adding native library support for incremental restores. When designing the solution we decided to follow 'tiered' approach where users can pick one of the three predefined, and for now, mutually exclusive restore modes (`kKeepLatestDbSessionIdFiles`, `kVerifyChecksum` and `kPurgeAllFiles` [default]) - trading write IO / CPU for the degree of certainty that the existing destination db files match selected backup files contents. New mode option is exposed via existing `RestoreOptions` configuration, which by this time has been already well-baked into our APIs. Restore engine will consume this configuration and infer which of the existing destination db files are 'in policy' to be retained during restore. ### Motivation This work is motivated by internal customer who is running write-heavy, 1M+ QPS service and is using RocksDB restore functionality to scale up their fleet. Given already high QPS on their end, additional write IO from restores as-is today is contributing to prolonged spikes which lead the service to hit BLOB storage write quotas, which finally results in slowing down the pace of their scaling. See [T206217267](https://www.internalfb.com/intern/tasks/?t=206217267) for more. ### Impact Enable faster service scaling by reducing write IO footprint on BLOB storage (coming from restore) to the absolute minimum. ### Key technical nuances 1. According to prior investigations, the risk of collisions on [file #, db session id, file size] metadata triplets is low enough to the point that we can confidently use it to uniquely describe the file and its' *perceived* contents, which is the rationale behind the `kKeepLatestDbSessionIdFiles` mode. To find more about the risks / tradeoffs for using this mode, please check the related comment in `backup_engine.cc`. This mode is only supported for SSTs where we persist the `db_session_id` information in the metadata footer. 2. `kVerifyChecksum` mode requires a full blob / SST file scan (assuming backup file has its' `checksum_hex` metadata set appropriately, if not additional file scan for backup file). While it saves us on write IOs (if checksums match), it's still fairly complex and _potentially_ CPU intensive operation. 3. We're extending the `WorkItemType` enum introduced in #13228 to accommodate a new simple request to `ComputeChecksum`, which will enable us to run 2) in parallel. This will become increasingly more important as we're moving towards disaggregated storage and holding up the sequence of checksum evaluations on a single lagging remote file scan would not be acceptable. 4. Note that it's necessary to compute the checksum on the restored file if corresponding backup file and existing destination db file checksums didn't match. ### Test plan ✅ 1. Manual testing using debugger: ✅ 2. Automated tests: * `./backup_engine_test --gtest_filter=*IncrementalRestore*` covering the following scenarios: ✅ * Full clean restore * Integration with `exclude files` feature (with proper writes counting) * User workflow simulation: happy path with mix of added new files and deleted original backup files, * Existing db files corruptions and the difference in handling between `kVerifyChecksum` and `kKeepLatestDbSessionIdFiles` modes. * `./backup_engine_test --gtest_filter=*ExcludedFiles*` ✅ * Integrate existing test collateral with newly introduced restore modes Pull Request resolved: #13239 Reviewed By: pdillinger Differential Revision: D67513875 Pulled By: mszeszko-meta fbshipit-source-id: 273642accd7c97ea52e42f9dc1cc1479f86cf30e

Generalize definition of CopyOrCreateWorkItem to WorkItem with type r…

71afbca

…epresenting the intended underlying action(s)

facebook-github-bot added the CLA Signed label Dec 19, 2024

Satisfy linter - take 2

e7aa0bc

mszeszko-meta force-pushed the generalize_backup_engine_work_item branch from 54fb17f to e7aa0bc Compare December 19, 2024 20:00

pdillinger approved these changes Dec 19, 2024

View reviewed changes

Report InvalidArgument IOStatus upstream in case of unknown work item…

a774bfb

… type

mszeszko-meta force-pushed the generalize_backup_engine_work_item branch from 76cbdcd to a774bfb Compare December 20, 2024 00:25

pdillinger reviewed Dec 20, 2024

View reviewed changes

facebook-github-bot closed this in f7b4216 Dec 20, 2024

facebook-github-bot added the Merged label Dec 20, 2024

This was referenced Dec 20, 2024

Properly propagate the result io_status handle upstream #13238

Closed

Native support for incremental restore #13239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize work item definition in BackupEngineImpl #13228

Generalize work item definition in BackupEngineImpl #13228

mszeszko-meta commented Dec 19, 2024

facebook-github-bot commented Dec 19, 2024

facebook-github-bot commented Dec 19, 2024

facebook-github-bot commented Dec 19, 2024

facebook-github-bot commented Dec 19, 2024

facebook-github-bot commented Dec 19, 2024

pdillinger left a comment

pdillinger Dec 19, 2024

mszeszko-meta Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

pdillinger Dec 20, 2024

mszeszko-meta Dec 20, 2024 •

edited

Loading

facebook-github-bot commented Dec 20, 2024

Generalize work item definition in BackupEngineImpl #13228

Generalize work item definition in BackupEngineImpl #13228

Conversation

mszeszko-meta commented Dec 19, 2024

Summary

Motivation

Test plan

facebook-github-bot commented Dec 19, 2024

facebook-github-bot commented Dec 19, 2024

facebook-github-bot commented Dec 19, 2024

facebook-github-bot commented Dec 19, 2024

facebook-github-bot commented Dec 19, 2024

pdillinger left a comment

Choose a reason for hiding this comment

pdillinger Dec 19, 2024

Choose a reason for hiding this comment

mszeszko-meta Dec 20, 2024

Choose a reason for hiding this comment

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

facebook-github-bot commented Dec 20, 2024

pdillinger Dec 20, 2024

Choose a reason for hiding this comment

mszeszko-meta Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

facebook-github-bot commented Dec 20, 2024

mszeszko-meta Dec 20, 2024 •

edited

Loading