-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: merge 'small' files #66856
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
miretskiy
approved these changes
Jun 24, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pbardea)
pbardea
approved these changes
Jun 24, 2021
dt
force-pushed
the
backup-small
branch
2 times, most recently
from
June 24, 2021 19:55
3f98d26
to
22f6b76
Compare
When backing up, each range is asked to export its data to the BACKUP storage destination. However if a range contains very little data to backup, which is very often be the case during an incremental backup if only a handful of rows in that range were modified, the resulting file may be very small. If a cluster has tens of thousands of ranges, having each write separate, small files produces a backup made up of tens of thousands of tiny files. Running such a backup every hour or more often rapidly produces potentially millions of files very quickly. This adds up in storage costs, metadata and tracking overhead, etc. This change adds a setting bulkio.backup.merge_file_size under which a range will _return_ the file it would have written to the backup storage destination instead of writing it. This leverages the fact that the backup process will merge the returned file with other returned files until it has a file of the desired target size. Release note (ops change): the new setting bulkio.backup.merge_file_size allows BACKUP to buffer and merge smaller files to reduce the number of small individual files created by BACKUP.
TFTR! bors r+ |
Build succeeded: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When backing up, each range is asked to export its data to the BACKUP
storage destination. However if a range contains very little data to
backup, which is very often be the case during an incremental backup if
only a handful of rows in that range were modified, the resulting file
may be very small. If a cluster has tens of thousands of ranges, having
each write separate, small files produces a backup made up of tens of
thousands of tiny files. Running such a backup every hour or more often
rapidly produces potentially millions of files very quickly. This adds
up in storage costs, metadata and tracking overhead, etc.
This change adds a setting bulkio.backup.merge_file_size under which a
range will return the file it would have written to the backup storage
destination instead of writing it. This leverages the fact that the
backup process will merge the returned file with other returned files
until it has a file of the desired target size.
Release note (ops change): the new setting bulkio.backup.merge_file_size
allows BACKUP to buffer and merge smaller files to reduce the number of
small individual files created by BACKUP.