diff --git a/src/main/asciidoc/_chapters/backup_restore.adoc b/src/main/asciidoc/_chapters/backup_restore.adoc index c2beca0b9aee..6e7c6a04c2b3 100644 --- a/src/main/asciidoc/_chapters/backup_restore.adoc +++ b/src/main/asciidoc/_chapters/backup_restore.adoc @@ -804,16 +804,28 @@ providing a comparable level of security. This is a manual step which users *mus [[br.technical.details]] == Technical Details of Incremental Backup and Restore -HBase incremental backups enable more efficient capture of HBase table images than previous attempts at serial backup and restore -solutions, such as those that only used HBase Export and Import APIs. Incremental backups use Write Ahead Logs (WALs) to capture -the data changes since the previous backup was created. A WAL roll (create new WALs) is executed across all RegionServers to track -the WALs that need to be in the backup. - -After the incremental backup image is created, the source backup files usually are on same node as the data source. A process similar -to the DistCp (distributed copy) tool is used to move the source backup files to the target file systems. When a table restore operation -starts, a two-step process is initiated. First, the full backup is restored from the full backup image. Second, all WAL files from -incremental backups between the last full backup and the incremental backup being restored are converted to HFiles, which the HBase -Bulk Load utility automatically imports as restored data in the table. +HBase incremental backups enable more efficient capture of HBase table images than previous attempts +at serial backup and restore solutions, such as those that only used HBase Export and Import APIs. +Incremental backups use Write Ahead Logs (WALs) to capture the data changes since the +previous backup was created. A WAL roll (create new WALs) is executed across all RegionServers +to track the WALs that need to be in the backup. +In addition to WALs, incremental backups also track bulk-loaded HFiles for tables under backup. + +Incremental backup gathers all WAL files generated since the last backup from the source cluster, +converts them to HFiles in a `.tmp` directory under the `BACKUP_ROOT`, and then moves these +HFiles to their final location under the backup root directory to form the backup image. +It also reads bulk load records from the backup system table, forms the paths for the corresponding +bulk-loaded HFiles, and copies those files to the backup destination. +Bulk-loaded files are preserved (not deleted by cleaner chores) until they've been included in a +backup (for each backup root). +A process similar to the DistCp (distributed copy) tool is used to move the backup files to the +target file system. + +When a table restore operation starts, a two-step process is initiated. +First, the full backup is restored from the full backup image. +Second, all HFiles from incremental backups between the last full backup and the incremental backup +being restored (including bulk-loaded HFiles) are bulk loaded into the table using the +HBase Bulk Load utility. You can only restore on a live HBase cluster because the data must be redistributed to complete the restore operation successfully. @@ -872,8 +884,9 @@ data at the full 80MB/s and `-w` is used to limit the job from spawning 16 worke Like we did for full backups, we have to understand the incremental backup process to approximate its runtime and cost. -* Identify new write-ahead logs since last full or incremental backup: negligible. Apriori knowledge from the backup system table(s). +* Identify new write-ahead logs since the last full or incremental backup: negligible. Apriori knowledge from the backup system table(s). * Read, filter, and write "minimized" HFiles equivalent to the WALs: dominated by the speed of writing data. Relative to write speed of HDFS. +* Read bulk load records from the backup system table, form the paths for bulk-loaded HFiles, and copy them to the backup destination. * DistCp the HFiles to the destination: <>. For the second step, the dominating cost of this operation would be the re-writing the data (under the assumption that a majority of the