chore: Use our own metrics-rs Handle implementation #7014

blt · 2021-04-05T19:27:07Z

This commit introduces our own metrics-rs Handle implementation, one that takes constant space per metric type. This significantly reduces our happy path memory use under load and should resolve memory leaks noted in the closing issues. The leak culprit is metrics-rs Histogram combined with tracing integration. Their implementation relies on an atomic linked list of fixed-sized blocks to store samples. The storage required by this method grows until a quiescent period is found, a period that happens regularly when metrics-rs is not integrated with tracing and does not happen at present for reasons unknown. However, in our testing we've seen that a single histogram under heavy load can consume approximately 200Mb before being garbage collected, a burden vector doesn't need to bear as our own needs for histogram are served by a simple counting bucket arrangement. Own own histogram uses a fixed, boxed slice of atomic ints as buckets. This is the shape of the data that vector was exposing already and we remove a little internal translation.

I believe this PR will significantly reduce the memory use of a loaded vector, leak not withstanding. I am unsure of the performance impact as yet. I will update the PR comments with massif results and an all-up benchmark run.

REF #6673
Closes #6461 #6456

Signed-off-by: Brian L. Troutwine brian@troutwine.us

binarylogic · 2021-04-05T19:46:30Z

cc @tobz, if you have a minute, we'd love your input.

tobz · 2021-04-05T20:08:15Z

I'm not sure I can entirely speak to the more philosophical questions in the description, at least not without thinking about it more, but: read_histogram_with_clear is indeed the right choice when you collect the values individually and want to, essentially, mark them as read... which in turn, as Brian mentions, will allow crossbeam-epoch to lazily clean up old blocks and keep your memory usage fairly bounded.

The non-clearing variant is meant for when you want a long-running bucket of values to calculate fully over i.e. you always need to rehydrate some downstream data structure from t=0. Admittedly, this isn't super useful because of the unbounded growth, and honestly, could probably stand to be removed. :P

In practice, under benchmark scenarios, there are still "quiescent" periods. As soon as you read/clear the histogram, and those blocks are no longer being read, they become eligible for reclamation. If you only have one "task" doing the read/clear, then you have no concurrent readers, and thus no contention, and thus they're almost immediately reclaimed.

blt · 2021-04-05T20:19:17Z

In practice, under benchmark scenarios, there are still "quiescent" periods. As soon as you read/clear the histogram, and those blocks are no longer being read, they become eligible for reclamation. If you only have one "task" doing the read/clear, then you have no concurrent readers, and thus no contention, and thus they're almost immediately reclaimed.

That's what I would have expected but it's different than what I'm seeing in practice. With this PR as the base I run vector like so:

#!/usr/bin/env bash

set -o errexit
set -o pipefail
set -o nounset
# set -o xtrace

RUSTFLAGS="-g" cargo build --release --no-default-features --features "sources-stdin,sinks-http"

yes "msg" | valgrind --tool=massif ./target/release/vector --config vector-tohttp.toml

with config

data_dir = "/tmp"

[sources.stdin]
type = "stdin"

[sinks.http]
type = "http"
inputs = ["stdin"]
compression = "none"
encoding.codec = "json"
healthcheck.enabled = false
batch.max_bytes = 128
uri = "http://127.0.0.1:8080/"

and still see steady memory growth in the same spot. Here's a massif dump: massif.out.1496054.zip

I'm going to upgrade metrics -- essentially pull in #6442 -- and see how that fares, in case you folks have fixed something upstream.

blt · 2021-04-05T21:06:23Z

Hmm, yep. Even with metrics-rs fully upgraded in 4a6bee5 we still have a leak in the histogram: massif.out.1531813.zip

blt · 2021-04-05T23:10:25Z

DHAT info: dhat.out.1543297.zip

tobz · 2021-04-06T00:20:12Z

I use macOS, and I both failed to find a reasonably simply way to compile/install massif-visualizer and the DHAT online viewer doesn't seem to work for the above DHAT output: complains about a missing 'mode' field or something. Long story short, I can't view them. :P

Looking at the textual data, though, for the Massif output, I don't see any references to Block<T> which would be what I intuitively expect to see in there if related?

Not sure if you can post a screenshot from the DHAT viewer.

blt · 2021-04-06T00:35:14Z

I use macOS, and I both failed to find a reasonably simply way to compile/install massif-visualizer and the DHAT online viewer doesn't seem to work for the above DHAT output: complains about a missing 'mode' field or something. Long story short, I can't view them. :P

D'oh, my bad. Let me get you some screenshots.

blt · 2021-04-06T00:35:48Z

@tobz FWIW I recognize that vector is a big project to debug in. I'm working on making a smaller reproducible example that mirrors our usage.

Extract from [vector](https://github.com/timberio/vector/) to help debug vectordotdev/vector#7014. Notably the version in this commit _does not_ display the issue we see in vector. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

blt · 2021-04-06T02:04:51Z

@tobz alright, I've managed to reproduce in https://github.com/blt/metrics_histo. Notably the problem does not exist in blt/metrics_histo@85e0a91 but only by blt/metrics_histo@170e02d once the tracing integration is added. Here's the massif dump I've grabbed:
massif.out.1584489.zip

Here's a screenshot:

I will note that before the tracing integration was added memory use was still very high but in spikes, like you'd expect with an epoch reclamation. 160Mb for the one -- very fast -- histogram in this example program.

tobz · 2021-04-06T02:18:14Z

Hmm, alright, that is interesting.

I'll have to take a look tomorrow. At first blush, I'm struggling to imagine how enabling tracing integration would break block reclamation, or maybe there's somehow a leak in the tracing integration that shows up incorrectly in Valgrind output? Definitely a head scratcher.

blt · 2021-04-06T15:13:03Z

Hmm, alright, that is interesting.

I'll have to take a look tomorrow. At first blush, I'm struggling to imagine how enabling tracing integration would break block reclamation, or maybe there's somehow a leak in the tracing integration that shows up incorrectly in Valgrind output? Definitely a head scratcher.

It is, I agree. I'm going to read the related code paths in-depth today. I think the signal we have out of valgrind these days is accurate but it can be... indirect sometimes.

blt · 2021-04-07T03:35:42Z

Alright, @tobz an update. As I was working through metrics-rs histogram code it dawned on me that vector, at least, probably doesn't need perfect sample capture. 0f1a773 pivots away from the use of metrics_util's Handle for something that's closer to vector's particular use case. I can make a ticket for metrics-rs with my findings -- shared here already -- if you'd like.

blt · 2021-04-07T04:08:35Z

Alright, good results so far. The massif output suggests that vector's memory use is much less, just over 6Mb:

massif.out.1866254.zip

blt · 2021-04-07T05:02:37Z

The throughput of this PR's build versus master is equivalent for our stdin -> remap -> blackhole and file -> remap -> blackhole configurations, though it does consume much less memory. Note this PR also includes an upgrade to latest metrics-rs.

blt · 2021-04-07T05:12:19Z

Flamegraph: https://gist.github.com/blt/49858036c35f674a9aac7478089952f7

This is basically what we see from master branch, further substantiating the performance parity. Test failures look like they'll be easy to repair and are mostly related to my adjusting the internal data model.

blt · 2021-04-07T15:36:21Z

@tsantero ping, relevant to your interests.

src/event/metric.rs

src/metrics/handle.rs

This commit calls `read_histogram_with_clear` rather than `read_histogram` as the later leaves the histogram's underlying block storage in place between polls. This leads to steady memory growth of vector. However, this does not arrest growth of the sample storage. The metrics-rs histogram code is quite complicated and it's believable to me that there's a leak in there, or we're misuing metrics-rs histograms. We might also consider, now that we're on the subject, if metrics-rs's histogram is the correct model for us. There are two problems. Firstly, the histogram stores every sample we send it in an atomic linked list of fixed size blocks, implying that storage _will_ grow bounded only by how often we poll. Secondarily, the collection of these fixed blocks are preformed only when a quiescent period is available, in the manner of crossbeam-epoch. We should expect that these are rare in a loaded vector, though the present issue is something different. I _think_ a fixed array of atomic integers would be sufficient as a histogram for vector -- bounded buckets are cheap -- but this is still the germ of a thought. REF #6673 #6461 #6456 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

This commit upgrades to the current metrics-rs. We know this causes a performance degredation in vector -- see metrics-rs/metrics#179 -- but we want to know if the observed leak is part of our older metrics-rs or is still present upstream. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

This commit follows on with my previous thoughts and replaces the metrics-rs handle with something that more closely matches our internal_metrics use case. Benefits include static memory use compared to the epoch mechanism in metrics-rs, though we do lose perfect sample capture. In my experiments this reduces vector's memory usage from 200Mb to 6Mb, controlling for the leak. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

src/metrics/handle.rs

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

bruceg

LGTM, though you did suggest one more micro-optimization, and there is a new warning in benches/metrics_snapshot.rs

bruceg · 2021-04-08T15:45:40Z

src/metrics/handle.rs

+    pub(crate) fn record(&mut self, value: f64) {
+        let mut prev_bound = f64::NEG_INFINITY;
+        for (bound, bucket) in self.buckets.iter_mut() {
+            if value > prev_bound && value <= *bound {


You should be able to get rid of one of these comparisons. I think you were referring to this idea too here: https://github.com/timberio/vector/pull/7014/files#r608864011

Hmm, well. After poking at this for a while I couldn't come up with any other implementation that makes a measurable difference in runtime, since I want to avoid assuming foreknowledge about bucket spacing. The logic strikes me as reasonable explicable so I'll let it alone.

Yes, it's unlikely to make a visible difference except in microbenchmarks.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

binarylogic · 2021-04-08T17:52:47Z

👏 nice work

blt changed the title ~~Call read_histogram_with_clear~~ chore: Call read_histogram_with_clear Apr 5, 2021

blt changed the title ~~chore: Call read_histogram_with_clear~~ chore: Use our own metrics-rs Handle implementation Apr 7, 2021

blt requested review from jszwedko, bruceg, JeanMertz and lukesteensen and removed request for jszwedko April 7, 2021 03:44

blt marked this pull request as ready for review April 7, 2021 03:45

blt requested review from a team and pablosichert and removed request for a team April 7, 2021 03:45

blt mentioned this pull request Apr 7, 2021

chore(deps): bump metrics from 0.13.0-alpha.13 to 0.14.2 #7031

Merged

bruceg reviewed Apr 7, 2021

View reviewed changes

src/event/metric.rs Outdated Show resolved Hide resolved

src/metrics/handle.rs Outdated Show resolved Hide resolved

src/metrics/handle.rs Show resolved Hide resolved

src/metrics/handle.rs Outdated Show resolved Hide resolved

blt requested a review from a team April 7, 2021 17:52

blt added 9 commits April 7, 2021 14:51

Address test failure

67a56a0

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

More test fixes

f1beb6b

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

introduce iterator

5e52f30

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

fix clippy warning

4ba6add

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

Make AtomicF64

550718e

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

Separate correctness and benchmark concerns

b38c909

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

bruceg reviewed Apr 7, 2021

View reviewed changes

src/metrics/handle.rs Outdated Show resolved Hide resolved

src/metrics/handle.rs Outdated Show resolved Hide resolved

blt added 3 commits April 7, 2021 15:29

cardinality always include cardinality itself

f53f89b

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

Remove unused pubs, fix bug in fetch_update

9466792

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

test fixes

5579f0d

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

blt requested review from bruceg and jszwedko April 8, 2021 01:37

shorter expression

16cf859

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

bruceg approved these changes Apr 8, 2021

View reviewed changes

Remove unused import

2ba9fd2

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

blt merged commit ddb16b1 into vectordotdev:master Apr 8, 2021

blt deleted the blt-metrics_mem_leak branch April 8, 2021 17:45

This was referenced Apr 8, 2021

Memory leak while using kubernetes_logs source #6673

Closed

enhancement: Use new Lookup accessors on LogEvents and entire internal API. #5374

Closed

tobz mentioned this pull request Apr 11, 2021

Experiment: switch to &Key for Recorder metrics-rs/metrics#194

Merged

blt mentioned this pull request Apr 12, 2021

Integrate metrics-rs #194 #7088

Closed

pablosichert mentioned this pull request Apr 19, 2021

Use incremental histograms internally #6456

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Use our own metrics-rs Handle implementation #7014

chore: Use our own metrics-rs Handle implementation #7014

blt commented Apr 5, 2021 •

edited

Loading

binarylogic commented Apr 5, 2021

tobz commented Apr 5, 2021 •

edited

Loading

blt commented Apr 5, 2021

blt commented Apr 5, 2021

blt commented Apr 5, 2021

tobz commented Apr 6, 2021

blt commented Apr 6, 2021

blt commented Apr 6, 2021

blt commented Apr 6, 2021

tobz commented Apr 6, 2021

blt commented Apr 6, 2021

blt commented Apr 7, 2021

blt commented Apr 7, 2021

blt commented Apr 7, 2021 •

edited

Loading

blt commented Apr 7, 2021

blt commented Apr 7, 2021

bruceg left a comment

bruceg Apr 8, 2021

blt Apr 8, 2021

bruceg Apr 8, 2021

binarylogic commented Apr 8, 2021

chore: Use our own metrics-rs Handle implementation #7014

chore: Use our own metrics-rs Handle implementation #7014

Conversation

blt commented Apr 5, 2021 • edited Loading

binarylogic commented Apr 5, 2021

tobz commented Apr 5, 2021 • edited Loading

blt commented Apr 5, 2021

blt commented Apr 5, 2021

blt commented Apr 5, 2021

tobz commented Apr 6, 2021

blt commented Apr 6, 2021

blt commented Apr 6, 2021

blt commented Apr 6, 2021

tobz commented Apr 6, 2021

blt commented Apr 6, 2021

blt commented Apr 7, 2021

blt commented Apr 7, 2021

blt commented Apr 7, 2021 • edited Loading

blt commented Apr 7, 2021

blt commented Apr 7, 2021

bruceg left a comment

Choose a reason for hiding this comment

bruceg Apr 8, 2021

Choose a reason for hiding this comment

blt Apr 8, 2021

Choose a reason for hiding this comment

bruceg Apr 8, 2021

Choose a reason for hiding this comment

binarylogic commented Apr 8, 2021

blt commented Apr 5, 2021 •

edited

Loading

tobz commented Apr 5, 2021 •

edited

Loading

blt commented Apr 7, 2021 •

edited

Loading