Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

Use concurrent.Writer in place of bufio.Writer #149

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

alin-amana
Copy link
Contributor

Use concurrent.Writer in place of bufio.Writer and remove all Flush() calls from under mutex protection to reduce the likelihood of commits blocking writing to slow/busy disks.

concurrent.Writer is a highly concurrent thread safe drop-in replacement for bufio.Writer. I wrote and tested it over the past week and have been running a Prometheus instance using it for WAL writes for the past couple of days. Essentially it allows for Flush() calls to safely proceed in parallel with writes as well as isolating concurrent writes from one another (with isolation having the ACID meaning, i.e. serialization of writes).

This is a partial solution to scrapes and rule evals blocking on disk writes. It works perfectly as long as there is buffer space in the writer, but will still block if the buffer fills up. A complete solution that would avoid blocking on disk altogether would periodically scan the memory for new data points and log them to disk completely asynchronously (with regard to scrape and eval execution). Unfortunately I don't have the necessary understanding of Prometheus' internals to implement that instead.

I have been running a patched Prometheus beta.4 instance alongside a stock beta.4 instance on a particularly disk challenged machine and have seen zero instances of scrape or eval blocking as opposed to virtually hourly such occurrences for the stock instance. I have also restarted the instance a number of times to check for possible WAL corruption and saw no log messages mentioning this. The writer itself has excellent test coverage, from single-threaded tests to concurrent writes, flushes, errors and resets. It adds very little CPU overhead (less than 2x for in-memory writes compared to a naive mutex protected implementation) and will attempt to flush as little as possible except for explicit Flush() calls (when it will always flush).

… calls from under mutex protection to reduce the likelihood of commits blocking on slow disk.
@fabxc
Copy link
Contributor

fabxc commented Sep 18, 2017

Thanks for putting in all the effort. Sounds very interesting. Will take it for a spin :)

@alin-amana
Copy link
Contributor Author

I've just looked at the test failure and it's caused by my moving the Writer.Flush() call in SegmentWAL.cut() from the synchronous flow into the asynchronous goroutine. I didn't think it would cause issues in Prometheus (I might be wrong) but it definitely causes the test to be flaky, possibly because it may not actually flush before the bulk of the test executes. That may explain why the first sample it reads is (3, 4), the first entry in the second segment. Maybe closing the WAL would actually ensure the flush completes, as Close() does a flush of its own and they would be serialized.

@alin-amana
Copy link
Contributor Author

Please ignore my suggestion to use Close(), it would only flush the last file and have no effect on the flakiness.

I can either put the Writer.Flush() call back into the synchronous path (which would be a shame, as I don't think it's necessary outside of tests -- but I might be wrong, though) or have SegmentWAL.cut() return a channel or some other kind of synchronization mechanism so the caller may optionally wait for full disk syncing.

@alin-amana
Copy link
Contributor Author

Oh BTW: I added the option to have concurrent.Writer automatically trigger an asynchronous flush whenever a specified fraction of the buffer is in use.

This could be used instead of or in addition to the periodic SegmentWAL flushes and it should help further prevent blocking on write, particularly for large Prometheus instances, which are more likely to fill the 8 MB buffer in the 5 second interval between flushes and then have to wait synchronously for a flush. I have not included it in this PR (even though it would only be a one line change) because it's a different feature and so should be committed separately.

@alin-amana
Copy link
Contributor Author

Also BTW: if you prefer having the code in prometheus/tsdb or prometheus/common I'd be happy to copy it over. I have no strong feelings about doing it one way or the other, just thought it may be useful for others too so I started off with a separate repo.

@alin-amana
Copy link
Contributor Author

I have been running one patched and one non-patched Prometheus instance side-by-side for the last week+ now. I have noticed no issues, apart from some WAL corruption before beta5, which occurred for both the patched and original builds.

Regarding consistency, i.e. reliably scraping and evaluating at configured intervals, it's a night and day difference. Here is the actual measured scraping interval of the official beta5 build:

Scrape interval - original build

And here it is for the patched build:

Scrape interval - patched build

The difference is orders of magnitude (1500% spikes/troughs vs 2% spikes/troughs). Again, this only holds for small to medium Prometheus deployments, that don't fill the 8 MB WAL writer buffer in 5 seconds, but it's much better than nothing.

@alin-amana
Copy link
Contributor Author

I've merged in the latest updates so there are no conflicts. @fabxc, have you had a chance to take a look at it?

@krasi-georgiev
Copy link
Contributor

@alin-amana is this still relevant after the WAL rewrite?

@alin-amana
Copy link
Contributor Author

[Sorry, missed your comment in the mass of email.]

I have not had the time to look at the WAL rewrite, TBH. So I don't know what kind of changes are there that might make this change unnecessary.

But unless:
(a) WAL writing has become completely asynchronous, meaning that writing samples to the TSDB will not block when the disk is 100% busy/unavailable for a couple of minutes; or
(b) flushing WAL writes never happens under mutex lock (including flushing the writer, not merely the fileutil.Fdatasync call);
I think this change is still pertinent.

IIRC (this was already a year ago) the problem was that bufio.Writer flushes would at times block on disk and, on the disk challenged VM I was playing with, this condition could last for tens of seconds. Unfortunately bufio.Writer is not thread safe, so flushes and writes have to be protected by mutex, meaning that during the times when flushes were blocked, no new samples could be committed to the TSDB, even though the buffer was mostly empty.

As noted before, this is not a perfect solution, meaning it's still possible for scrapes and rule evals to block on disk given enough throughput (i.e. enough to fill the buffer before the flush could complete). Fully preventing blocking could be achieved either via dynamically sized buffers or by converting the WAL into a write-behind-log.

@krasi-georgiev
Copy link
Contributor

this sound like a good improvment.
When you find the time I would really appreciate if you have a look at the new WAL and if these changes are still relevant please rebase and I will spend the time to review it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants