Fix higher disk utilization regression #9204

jwilder · 2017-12-06T21:48:18Z

This fixes a performance issue in 1.4 that can be seen on systems with higher number of cores (8+).

There were several factors involved:

Compactions ran more aggressively in 1.4 and can increase disk utilization since more are running concurrently. The scheduling and planning is now dialed back to run less frequently.
The cache snapshot size was too small and since compactions 1) run faster and 2) run more frequently, this creates lots of smaller TSM files that need to be compacted frequently which increases disk IO. The cache snapshot size config setting was updated to a larger value to allow for fewer and larger snapshots.
TSM writing was switched to use O_SYNC to avoid long process stalls when fsyncing the final file. This increases disk IO, so this has been switched to run a few fsyncs while writing the file which avoids the stalls and reduces disk IO.
The upper bound on concurrent compactions was too high when larger number of cores were available. If a system has 16 cores, we'd would cap at 8, but that is likely too high for a system with less than 3000 IOPS.

This is a test run of 1.4.2, this PR and 1.3 while writing 1B values, 2.5M series and 5 concurrent writers. This was run a c4.4xlarge (16 cores) w/ a gp2 1500/3000 IOPS EBS volume.

The disk utilization is much higher in 1.4.2 for this workload. This PR is still slightly higher than 1.3, but not drastically as 1.4.2.

Heap is in line w/ 1.4.2 and slightly lower than 1.3.

Write throughput is similar to 1.3 as well. Looks like it regressed in 1.4.

Compactions that are run are significantly reduced:

From looking at the types of compactions being run, it looks like the slightly higher disk IO that still remains might be due to running more full compactions (because files are bigger). That is something I'm going to look into further.

This PR also seems to keep up with the backlog of TSM files and keeps the overall data size smaller. I'm going to try a longer test run to see how this compares to 1.3.

Required for all non-trivial PRs

Rebased/mergable
Tests pass
CHANGELOG.md updated
Sign CLA (if not already signed)

O_SYNC was added with writing TSM files to fix an issue where the final fsync at the end cause the process to stall. This ends up increase disk util to much so this change switches to use multiple fsyncs while writing the TSM file instead of O_SYNC or one large one at the end.

With the recent changes to compactions and snapshotting, the current default can create lots of small level 1 TSM files. This increases the default in order to create larger level 1 files and less disk utilization.

The default max-concurrent-compactions settings allows up to 50% of cores to be used for compactions. When the number of cores is high (>8), this can lead to high disk utilization. Capping at 4 and combined with high snapshot sizes seems to keep the compaction backlog reasonable and not tax the disks as much. Systems with lots of IOPS, RAM and CPU cores may want to increase these.

This runs the scheduler every 5s instead of every 1s as well as reduces the scope of a level 1 plan.

The disk based temp index for writing a TSM file was used for compactions other than snapshot compactions. That meant it was used even for smaller compactiont that would not use much memory. An unintended side-effect of this is higher disk IO when copying the index to the final file. This switches when to use the index based on the estimated size of the new index that will be written. This isn't exact, but seems to work kick in at higher cardinality and larger compactions when it is necessary to avoid OOMs.

stuartcarnie

LGTM 👍

PR #9204 introduced a maximum default concurrent compaction limit of 4. The idea was to reduce IO utilisation on large systems with many cores, and high write load. Often on these systems, disks were not scaled appropriately to to the write volume, and while the write path could keep up, compactions would saturate disks. In #9225 work was done to reduce IO saturation by limiting the compaction throughput. To some extent, both #9204 and #9225 work towards solving the same problem. We have recently begun to notice larger clusters to suffer from situations where compactions are not keeping up because they have been scaled up, but the limit of 4 has stayed in place. While users can manually override the setting, it seems more user friendly if we remove the limit by default, and set it manually in cases where compactions are causing too much IO on large boxes.

jwilder added 6 commits December 6, 2017 09:35

Increase cache-snapshot-memory-size default

e584cb6

With the recent changes to compactions and snapshotting, the current default can create lots of small level 1 TSM files. This increases the default in order to create larger level 1 files and less disk utilization.

Schedule compactions less aggressively

0a85ce2

This runs the scheduler every 5s instead of every 1s as well as reduces the scope of a level 1 plan.

Update changelog

0b929fe

jwilder added the needs-backport/1.4 label Dec 6, 2017

jwilder requested review from stuartcarnie and e-dard December 6, 2017 21:48

ghost assigned jwilder Dec 6, 2017

ghost added the review label Dec 6, 2017

stuartcarnie approved these changes Dec 7, 2017

View reviewed changes

jwilder merged commit f250b64 into master Dec 7, 2017

ghost removed the review label Dec 7, 2017

jwilder deleted the jw-tsm-sync branch December 7, 2017 14:57

jwilder mentioned this pull request Dec 7, 2017

Backport high disk utilization fix #9206

Merged

9 tasks

jwilder removed the needs-backport/1.4 label Dec 7, 2017

oiooj mentioned this pull request Dec 8, 2017

Investigate disk i/o utilization not stable #9209

Closed

e-dard mentioned this pull request Jul 18, 2018

Remove max concurrent compaction limit #10102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix higher disk utilization regression #9204

Fix higher disk utilization regression #9204

jwilder commented Dec 6, 2017

stuartcarnie left a comment

Fix higher disk utilization regression #9204

Fix higher disk utilization regression #9204

Conversation

jwilder commented Dec 6, 2017

Required for all non-trivial PRs

stuartcarnie left a comment

Choose a reason for hiding this comment