-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Use our own metrics-rs Handle implementation #7014
Conversation
cc @tobz, if you have a minute, we'd love your input. |
read_histogram_with_clear
read_histogram_with_clear
I'm not sure I can entirely speak to the more philosophical questions in the description, at least not without thinking about it more, but: The non-clearing variant is meant for when you want a long-running bucket of values to calculate fully over i.e. you always need to rehydrate some downstream data structure from t=0. Admittedly, this isn't super useful because of the unbounded growth, and honestly, could probably stand to be removed. :P In practice, under benchmark scenarios, there are still "quiescent" periods. As soon as you read/clear the histogram, and those blocks are no longer being read, they become eligible for reclamation. If you only have one "task" doing the read/clear, then you have no concurrent readers, and thus no contention, and thus they're almost immediately reclaimed. |
That's what I would have expected but it's different than what I'm seeing in practice. With this PR as the base I run vector like so: #!/usr/bin/env bash
set -o errexit
set -o pipefail
set -o nounset
# set -o xtrace
RUSTFLAGS="-g" cargo build --release --no-default-features --features "sources-stdin,sinks-http"
yes "msg" | valgrind --tool=massif ./target/release/vector --config vector-tohttp.toml with config data_dir = "/tmp"
[sources.stdin]
type = "stdin"
[sinks.http]
type = "http"
inputs = ["stdin"]
compression = "none"
encoding.codec = "json"
healthcheck.enabled = false
batch.max_bytes = 128
uri = "http://127.0.0.1:8080/" and still see steady memory growth in the same spot. Here's a massif dump: massif.out.1496054.zip I'm going to upgrade metrics -- essentially pull in #6442 -- and see how that fares, in case you folks have fixed something upstream. |
Hmm, yep. Even with metrics-rs fully upgraded in 4a6bee5 we still have a leak in the histogram: massif.out.1531813.zip |
DHAT info: dhat.out.1543297.zip |
I use macOS, and I both failed to find a reasonably simply way to compile/install Looking at the textual data, though, for the Massif output, I don't see any references to Not sure if you can post a screenshot from the DHAT viewer. |
D'oh, my bad. Let me get you some screenshots. |
@tobz FWIW I recognize that vector is a big project to debug in. I'm working on making a smaller reproducible example that mirrors our usage. |
Extract from [vector](https://github.com/timberio/vector/) to help debug vectordotdev/vector#7014. Notably the version in this commit _does not_ display the issue we see in vector. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
@tobz alright, I've managed to reproduce in https://github.com/blt/metrics_histo. Notably the problem does not exist in blt/metrics_histo@85e0a91 but only by blt/metrics_histo@170e02d once the tracing integration is added. Here's the massif dump I've grabbed: Here's a screenshot: I will note that before the tracing integration was added memory use was still very high but in spikes, like you'd expect with an epoch reclamation. 160Mb for the one -- very fast -- histogram in this example program. |
Hmm, alright, that is interesting. I'll have to take a look tomorrow. At first blush, I'm struggling to imagine how enabling tracing integration would break block reclamation, or maybe there's somehow a leak in the tracing integration that shows up incorrectly in Valgrind output? Definitely a head scratcher. |
It is, I agree. I'm going to read the related code paths in-depth today. I think the signal we have out of valgrind these days is accurate but it can be... indirect sometimes. |
Alright, @tobz an update. As I was working through metrics-rs histogram code it dawned on me that vector, at least, probably doesn't need perfect sample capture. 0f1a773 pivots away from the use of metrics_util's Handle for something that's closer to vector's particular use case. I can make a ticket for metrics-rs with my findings -- shared here already -- if you'd like. |
read_histogram_with_clear
The throughput of this PR's build versus master is equivalent for our |
Flamegraph: https://gist.github.com/blt/49858036c35f674a9aac7478089952f7 This is basically what we see from master branch, further substantiating the performance parity. Test failures look like they'll be easy to repair and are mostly related to my adjusting the internal data model. |
@tsantero ping, relevant to your interests. |
This commit calls `read_histogram_with_clear` rather than `read_histogram` as the later leaves the histogram's underlying block storage in place between polls. This leads to steady memory growth of vector. However, this does not arrest growth of the sample storage. The metrics-rs histogram code is quite complicated and it's believable to me that there's a leak in there, or we're misuing metrics-rs histograms. We might also consider, now that we're on the subject, if metrics-rs's histogram is the correct model for us. There are two problems. Firstly, the histogram stores every sample we send it in an atomic linked list of fixed size blocks, implying that storage _will_ grow bounded only by how often we poll. Secondarily, the collection of these fixed blocks are preformed only when a quiescent period is available, in the manner of crossbeam-epoch. We should expect that these are rare in a loaded vector, though the present issue is something different. I _think_ a fixed array of atomic integers would be sufficient as a histogram for vector -- bounded buckets are cheap -- but this is still the germ of a thought. REF #6673 #6461 #6456 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit upgrades to the current metrics-rs. We know this causes a performance degredation in vector -- see metrics-rs/metrics#179 -- but we want to know if the observed leak is part of our older metrics-rs or is still present upstream. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit follows on with my previous thoughts and replaces the metrics-rs handle with something that more closely matches our internal_metrics use case. Benefits include static memory use compared to the epoch mechanism in metrics-rs, though we do lose perfect sample capture. In my experiments this reduces vector's memory usage from 200Mb to 6Mb, controlling for the leak. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though you did suggest one more micro-optimization, and there is a new warning in benches/metrics_snapshot.rs
pub(crate) fn record(&mut self, value: f64) { | ||
let mut prev_bound = f64::NEG_INFINITY; | ||
for (bound, bucket) in self.buckets.iter_mut() { | ||
if value > prev_bound && value <= *bound { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to get rid of one of these comparisons. I think you were referring to this idea too here: https://github.com/timberio/vector/pull/7014/files#r608864011
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, well. After poking at this for a while I couldn't come up with any other implementation that makes a measurable difference in runtime, since I want to avoid assuming foreknowledge about bucket spacing. The logic strikes me as reasonable explicable so I'll let it alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's unlikely to make a visible difference except in microbenchmarks.
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
👏 nice work |
This commit introduces our own metrics-rs
Handle
implementation, one that takes constant space per metric type. This significantly reduces our happy path memory use under load and should resolve memory leaks noted in the closing issues. The leak culprit is metrics-rsHistogram
combined with tracing integration. Their implementation relies on an atomic linked list of fixed-sized blocks to store samples. The storage required by this method grows until a quiescent period is found, a period that happens regularly when metrics-rs is not integrated with tracing and does not happen at present for reasons unknown. However, in our testing we've seen that a single histogram under heavy load can consume approximately 200Mb before being garbage collected, a burden vector doesn't need to bear as our own needs for histogram are served by a simple counting bucket arrangement. Own own histogram uses a fixed, boxed slice of atomic ints as buckets. This is the shape of the data that vector was exposing already and we remove a little internal translation.I believe this PR will significantly reduce the memory use of a loaded vector, leak not withstanding. I am unsure of the performance impact as yet. I will update the PR comments with massif results and an all-up benchmark run.
REF #6673
Closes #6461 #6456
Signed-off-by: Brian L. Troutwine brian@troutwine.us