Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix higher disk utilization regression #9204

Merged
merged 6 commits into from
Dec 7, 2017
Merged

Fix higher disk utilization regression #9204

merged 6 commits into from
Dec 7, 2017

Conversation

jwilder
Copy link
Contributor

@jwilder jwilder commented Dec 6, 2017

This fixes a performance issue in 1.4 that can be seen on systems with higher number of cores (8+).

There were several factors involved:

  • Compactions ran more aggressively in 1.4 and can increase disk utilization since more are running concurrently. The scheduling and planning is now dialed back to run less frequently.
  • The cache snapshot size was too small and since compactions 1) run faster and 2) run more frequently, this creates lots of smaller TSM files that need to be compacted frequently which increases disk IO. The cache snapshot size config setting was updated to a larger value to allow for fewer and larger snapshots.
  • TSM writing was switched to use O_SYNC to avoid long process stalls when fsyncing the final file. This increases disk IO, so this has been switched to run a few fsyncs while writing the file which avoids the stalls and reduces disk IO.
  • The upper bound on concurrent compactions was too high when larger number of cores were available. If a system has 16 cores, we'd would cap at 8, but that is likely too high for a system with less than 3000 IOPS.

Fixes #9201

This is a test run of 1.4.2, this PR and 1.3 while writing 1B values, 2.5M series and 5 concurrent writers. This was run a c4.4xlarge (16 cores) w/ a gp2 1500/3000 IOPS EBS volume.

The disk utilization is much higher in 1.4.2 for this workload. This PR is still slightly higher than 1.3, but not drastically as 1.4.2.

screen shot 2017-12-06 at 2 22 47 pm

Heap is in line w/ 1.4.2 and slightly lower than 1.3.

screen shot 2017-12-06 at 2 25 19 pm

Write throughput is similar to 1.3 as well. Looks like it regressed in 1.4.

screen shot 2017-12-06 at 2 26 53 pm

Compactions that are run are significantly reduced:

screen shot 2017-12-06 at 2 29 25 pm

From looking at the types of compactions being run, it looks like the slightly higher disk IO that still remains might be due to running more full compactions (because files are bigger). That is something I'm going to look into further.

screen shot 2017-12-06 at 2 32 44 pm

This PR also seems to keep up with the backlog of TSM files and keeps the overall data size smaller. I'm going to try a longer test run to see how this compares to 1.3.

screen shot 2017-12-06 at 2 40 40 pm screen shot 2017-12-06 at 2 42 04 pm

Required for all non-trivial PRs
  • Rebased/mergable
  • Tests pass
  • CHANGELOG.md updated
  • Sign CLA (if not already signed)

O_SYNC was added with writing TSM files to fix an issue where the
final fsync at the end cause the process to stall.  This ends up
increase disk util to much so this change switches to use multiple
fsyncs while writing the TSM file instead of O_SYNC or one large
one at the end.
With the recent changes to compactions and snapshotting, the current
default can create lots of small level 1 TSM files.  This increases
the default in order to create larger level 1 files and less disk
utilization.
The default max-concurrent-compactions settings allows up to 50%
of cores to be used for compactions.  When the number of cores is
high (>8), this can lead to high disk utilization.  Capping at
4 and combined with high snapshot sizes seems to keep the compaction
backlog reasonable and not tax the disks as much.  Systems with lots
of IOPS, RAM and CPU cores may want to increase these.
This runs the scheduler every 5s instead of every 1s as well as reduces
the scope of a level 1 plan.
The disk based temp index for writing a TSM file was used for
compactions other than snapshot compactions.  That meant it was
used even for smaller compactiont that would not use much memory.
An unintended side-effect of this is higher disk IO when copying
the index to the final file.

This switches when to use the index based on the estimated size of
the new index that will be written.  This isn't exact, but seems to
work kick in at higher cardinality and larger compactions when it
is necessary to avoid OOMs.
@ghost ghost assigned jwilder Dec 6, 2017
@ghost ghost added the review label Dec 6, 2017
Copy link
Contributor

@stuartcarnie stuartcarnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@jwilder jwilder merged commit f250b64 into master Dec 7, 2017
@ghost ghost removed the review label Dec 7, 2017
@jwilder jwilder deleted the jw-tsm-sync branch December 7, 2017 14:57
@jwilder jwilder mentioned this pull request Dec 7, 2017
9 tasks
e-dard added a commit that referenced this pull request Jul 18, 2018
PR #9204 introduced a maximum default concurrent compaction limit of 4.
The idea was to reduce IO utilisation on large systems with many cores,
and high write load. Often on these systems, disks were not scaled
appropriately to to the write volume, and while the write path could
keep up, compactions would saturate disks.

In #9225 work was done to reduce IO saturation by limiting the
compaction throughput. To some extent, both #9204 and #9225 work towards
solving the same problem.

We have recently begun to notice larger clusters to suffer from
situations where compactions are not keeping up because they have been
scaled up, but the limit of 4 has stayed in place. While users can
manually override the setting, it seems more user friendly if we remove
the limit by default, and set it manually in cases where compactions are
causing too much IO on large boxes.
e-dard added a commit that referenced this pull request Jul 18, 2018
PR #9204 introduced a maximum default concurrent compaction limit of 4.
The idea was to reduce IO utilisation on large systems with many cores,
and high write load. Often on these systems, disks were not scaled
appropriately to to the write volume, and while the write path could
keep up, compactions would saturate disks.

In #9225 work was done to reduce IO saturation by limiting the
compaction throughput. To some extent, both #9204 and #9225 work towards
solving the same problem.

We have recently begun to notice larger clusters to suffer from
situations where compactions are not keeping up because they have been
scaled up, but the limit of 4 has stayed in place. While users can
manually override the setting, it seems more user friendly if we remove
the limit by default, and set it manually in cases where compactions are
causing too much IO on large boxes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate higher disk i/o utilization on 1.4.2.
2 participants