-
Notifications
You must be signed in to change notification settings - Fork 179
Use concurrent.Writer in place of bufio.Writer #149
base: master
Are you sure you want to change the base?
Use concurrent.Writer in place of bufio.Writer #149
Conversation
… calls from under mutex protection to reduce the likelihood of commits blocking on slow disk.
Thanks for putting in all the effort. Sounds very interesting. Will take it for a spin :) |
I've just looked at the test failure and it's caused by my moving the Writer.Flush() call in SegmentWAL.cut() from the synchronous flow into the asynchronous goroutine. I didn't think it would cause issues in Prometheus (I might be wrong) but it definitely causes the test to be flaky, possibly because it may not actually flush before the bulk of the test executes. That may explain why the first sample it reads is (3, 4), the first entry in the second segment. Maybe closing the WAL would actually ensure the flush completes, as Close() does a flush of its own and they would be serialized. |
Please ignore my suggestion to use I can either put the |
Oh BTW: I added the option to have This could be used instead of or in addition to the periodic |
…aller to wait on the asynchronous finishing of the current segment. Mostly for testing.
Also BTW: if you prefer having the code in prometheus/tsdb or prometheus/common I'd be happy to copy it over. I have no strong feelings about doing it one way or the other, just thought it may be useful for others too so I started off with a separate repo. |
…and_flush_for_wal
…and_flush_for_wal
I have been running one patched and one non-patched Prometheus instance side-by-side for the last week+ now. I have noticed no issues, apart from some WAL corruption before beta5, which occurred for both the patched and original builds. Regarding consistency, i.e. reliably scraping and evaluating at configured intervals, it's a night and day difference. Here is the actual measured scraping interval of the official beta5 build: And here it is for the patched build: The difference is orders of magnitude (1500% spikes/troughs vs 2% spikes/troughs). Again, this only holds for small to medium Prometheus deployments, that don't fill the 8 MB WAL writer buffer in 5 seconds, but it's much better than nothing. |
…amana/prometheus-tsdb into concurrent_write_and_flush_for_wal
I've merged in the latest updates so there are no conflicts. @fabxc, have you had a chance to take a look at it? |
@alin-amana is this still relevant after the WAL rewrite? |
[Sorry, missed your comment in the mass of email.] I have not had the time to look at the WAL rewrite, TBH. So I don't know what kind of changes are there that might make this change unnecessary. But unless: IIRC (this was already a year ago) the problem was that As noted before, this is not a perfect solution, meaning it's still possible for scrapes and rule evals to block on disk given enough throughput (i.e. enough to fill the buffer before the flush could complete). Fully preventing blocking could be achieved either via dynamically sized buffers or by converting the WAL into a write-behind-log. |
this sound like a good improvment. |
Use
concurrent.Writer
in place ofbufio.Writer
and remove allFlush()
calls from under mutex protection to reduce the likelihood of commits blocking writing to slow/busy disks.concurrent.Writer
is a highly concurrent thread safe drop-in replacement forbufio.Writer
. I wrote and tested it over the past week and have been running a Prometheus instance using it for WAL writes for the past couple of days. Essentially it allows forFlush()
calls to safely proceed in parallel with writes as well as isolating concurrent writes from one another (with isolation having the ACID meaning, i.e. serialization of writes).This is a partial solution to scrapes and rule evals blocking on disk writes. It works perfectly as long as there is buffer space in the writer, but will still block if the buffer fills up. A complete solution that would avoid blocking on disk altogether would periodically scan the memory for new data points and log them to disk completely asynchronously (with regard to scrape and eval execution). Unfortunately I don't have the necessary understanding of Prometheus' internals to implement that instead.
I have been running a patched Prometheus beta.4 instance alongside a stock beta.4 instance on a particularly disk challenged machine and have seen zero instances of scrape or eval blocking as opposed to virtually hourly such occurrences for the stock instance. I have also restarted the instance a number of times to check for possible WAL corruption and saw no log messages mentioning this. The writer itself has excellent test coverage, from single-threaded tests to concurrent writes, flushes, errors and resets. It adds very little CPU overhead (less than 2x for in-memory writes compared to a naive mutex protected implementation) and will attempt to flush as little as possible except for explicit
Flush()
calls (when it will always flush).