-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix higher disk utilization regression #9204
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
O_SYNC was added with writing TSM files to fix an issue where the final fsync at the end cause the process to stall. This ends up increase disk util to much so this change switches to use multiple fsyncs while writing the TSM file instead of O_SYNC or one large one at the end.
With the recent changes to compactions and snapshotting, the current default can create lots of small level 1 TSM files. This increases the default in order to create larger level 1 files and less disk utilization.
The default max-concurrent-compactions settings allows up to 50% of cores to be used for compactions. When the number of cores is high (>8), this can lead to high disk utilization. Capping at 4 and combined with high snapshot sizes seems to keep the compaction backlog reasonable and not tax the disks as much. Systems with lots of IOPS, RAM and CPU cores may want to increase these.
This runs the scheduler every 5s instead of every 1s as well as reduces the scope of a level 1 plan.
The disk based temp index for writing a TSM file was used for compactions other than snapshot compactions. That meant it was used even for smaller compactiont that would not use much memory. An unintended side-effect of this is higher disk IO when copying the index to the final file. This switches when to use the index based on the estimated size of the new index that will be written. This isn't exact, but seems to work kick in at higher cardinality and larger compactions when it is necessary to avoid OOMs.
stuartcarnie
approved these changes
Dec 7, 2017
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
9 tasks
e-dard
added a commit
that referenced
this pull request
Jul 18, 2018
PR #9204 introduced a maximum default concurrent compaction limit of 4. The idea was to reduce IO utilisation on large systems with many cores, and high write load. Often on these systems, disks were not scaled appropriately to to the write volume, and while the write path could keep up, compactions would saturate disks. In #9225 work was done to reduce IO saturation by limiting the compaction throughput. To some extent, both #9204 and #9225 work towards solving the same problem. We have recently begun to notice larger clusters to suffer from situations where compactions are not keeping up because they have been scaled up, but the limit of 4 has stayed in place. While users can manually override the setting, it seems more user friendly if we remove the limit by default, and set it manually in cases where compactions are causing too much IO on large boxes.
e-dard
added a commit
that referenced
this pull request
Jul 18, 2018
PR #9204 introduced a maximum default concurrent compaction limit of 4. The idea was to reduce IO utilisation on large systems with many cores, and high write load. Often on these systems, disks were not scaled appropriately to to the write volume, and while the write path could keep up, compactions would saturate disks. In #9225 work was done to reduce IO saturation by limiting the compaction throughput. To some extent, both #9204 and #9225 work towards solving the same problem. We have recently begun to notice larger clusters to suffer from situations where compactions are not keeping up because they have been scaled up, but the limit of 4 has stayed in place. While users can manually override the setting, it seems more user friendly if we remove the limit by default, and set it manually in cases where compactions are causing too much IO on large boxes.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes a performance issue in 1.4 that can be seen on systems with higher number of cores (8+).
There were several factors involved:
O_SYNC
to avoid long process stalls when fsyncing the final file. This increases disk IO, so this has been switched to run a few fsyncs while writing the file which avoids the stalls and reduces disk IO.Fixes #9201
This is a test run of 1.4.2, this PR and 1.3 while writing 1B values, 2.5M series and 5 concurrent writers. This was run a
c4.4xlarge
(16 cores) w/ a gp2 1500/3000 IOPS EBS volume.The disk utilization is much higher in 1.4.2 for this workload. This PR is still slightly higher than 1.3, but not drastically as 1.4.2.
Heap is in line w/ 1.4.2 and slightly lower than 1.3.
Write throughput is similar to 1.3 as well. Looks like it regressed in 1.4.
Compactions that are run are significantly reduced:
From looking at the types of compactions being run, it looks like the slightly higher disk IO that still remains might be due to running more full compactions (because files are bigger). That is something I'm going to look into further.
This PR also seems to keep up with the backlog of TSM files and keeps the overall data size smaller. I'm going to try a longer test run to see how this compares to 1.3.
Required for all non-trivial PRs