Investigate `http -> pipelines -> blackhole` #10144

blt · 2021-11-23T05:56:41Z

This issue is a record of and parent to the investigation of vector's performance when configured as in this gist.

Inbound http_gen -- built from lading at SHA 0da91906d56acc899b829cea971d79f13e712e21-- is configured like so:

worker_threads: 1
prometheus_addr: "0.0.0.0:9001"

targets:
  vector:
    headers: {}
    target_uri: "http://localhost:8282/"
    bytes_per_second: "500 Mb"
    parallel_connections: 10
    block_sizes: ["1Kb", "2Kb"]
    method:
      post:
        maximum_prebuild_cache_size_bytes: "256 Mb"
        variant:
          static:
            static_path: "/home/blt/projects/com/github/timberio/vector/investigation/input.log"

Referenced input.log can be found here.

The text was updated successfully, but these errors were encountered:

blt · 2021-11-23T05:58:08Z

An attempt has been made to get the referenced vector.toml et al into a soak. This work is ongoing in #10142 but is blocked pending a minikube bug.

blt · 2021-11-23T06:14:56Z

Absent a CI soak we use the older methods from #8512 et al for measuring throughput. Until further notice vector is run like so:

> firejail --noprofile --rlimit-as=8g --cpu=0,1,2,3,4,5,6,7 ./target/release/vector --config ~/projects/com/github/timberio/vector/investigation/vector.toml

and http_gen is run like so:

> firejail --noprofile --cpu=8 ./target/release/http_gen --config-path ~/projects/com/github/timberio/vector/investigation/http_gen.yaml

Under these conditions we find:

SHA	`http_gen` generation Kb/s	`http` bytes processed Kb/s	`blackhole` receipt Kb/s
`127f05b`	27	30	163

Please note, these values are in kibibytes. Please also be aware that throughput varied wildly and these are median values taken over a five minute sample.

blt · 2021-11-23T06:31:40Z

Captured perf command:

> perf record --output=/tmp/vector-perf.data --compression-level=10 --call-graph=dwarf ./target/release/vector --config ~/projects/com/github/timberio/vector/investigation/vector.toml

Vector built like so:

> CARGO_PROFILE_RELEASE_DEBUG=2 cargo build --release

Unfortunately even compressed the data file is 1.5G and cannot be compressed further.

flamegraph

Items of concern:

regex::exec::ExecBuilder::build appears as a hot spot, implying we are building regexes in a loop.
major cost centers show in VRL, work ongoing in Vrl Performance #9811
MatchDatadogQueryFn in VRL's stdlib is a major cost center, has been recently introduced

blt · 2021-11-23T06:34:05Z

/cc @vladimir-dd

vladimir-dd · 2021-11-23T11:14:30Z

Right, thanks @blt for running this test.
Yes, I'd expect MatchDatadogQueryFn to be one of performance bottlenecks - it's called once per pipeline for each event(plus for category processors), and I assume we are calling regex::exec::ExecBuilder::build from this function - seems like we build regexes at runtime, not during vrl compilation.

blt · 2021-11-23T14:18:01Z

Building on the last comment, @leebenson noted in slack that the match datadog query is caching regex, per

vector/lib/vrl/stdlib/src/match_datadog_query.rs

Lines 284 to 310 in d8ed1b5

    
           #[cached( 
        
               type = "SizedCache<String, Regex>", 
        
               create = "{ SizedCache::with_size(10) }", 
        
               convert = r#"{ to_match.to_owned() }"# 
        
           )] 
        
           /// Returns compiled word boundary regex. Cached to avoid recompilation in hot paths. 
        
           fn word_regex(to_match: &str) -> Regex { 
        
               Regex::new(&format!( 
        
                   r#"\b{}\b"#, 
        
                   regex::escape(to_match).replace("\\*", ".*") 
        
               )) 
        
               .expect("invalid wildcard regex") 
        
           } 
        
           #[cached( 
        
               type = "SizedCache<String, Regex>", 
        
               create = "{ SizedCache::with_size(10) }", 
        
               convert = r#"{ to_match.to_owned() }"# 
        
           )] 
        
           /// Returns compiled wildcard regex. Cached to avoid recompilation in hot paths. 
        
           fn wildcard_regex(to_match: &str) -> Regex { 
        
               Regex::new(&format!( 
        
                   "^{}$", 
        
                   regex::escape(to_match).replace("\\*", ".*") 
        
               )) 
        
               .expect("invalid wildcard regex") 
        
           }

. It's unknown how many instances we have at runtime now but 10 does seem small for the posted config.

blt · 2021-11-23T14:18:31Z

From @JeanMertz in slack:

I don't know the ins and outs here, but aot compilation can happen in the compile step for functions:

vector/lib/vrl/stdlib/src/match_datadog_query.rs

Lines 44 to 64 in d8ed1b5

fn compile(

&self,

_state: &state::Compiler,

_ctx: &FunctionCompileContext,

mut arguments: ArgumentList,

) -> Compiled {

let value = arguments.required("value");

let query_value = arguments.required_literal("query")?.to_value();

// Query should always be a string.

let query = query_value

.try_bytes_utf8_lossy()

.expect("datadog search query not bytes");

// Compile the Datadog search query to AST.

let node = parse(&query).map_err(|e| {

Box::new(ExpressionError::from(e.to_string())) as Box<dyn DiagnosticError>

})?;

Ok(Box::new(MatchDatadogQueryFn { value, node }))

}

Take a look at the parse_grok function that does aot compilation of the grok pattern:

vector/lib/vrl/stdlib/src/parse_grok.rs

Lines 88 to 118 in e399c84

fn compile(

&self,

_state: &state::Compiler,

_ctx: &FunctionCompileContext,

mut arguments: ArgumentList,

) -> Compiled {

let value = arguments.required("value");

let pattern = arguments

.required_literal("pattern")?

.to_value()

.try_bytes_utf8_lossy()

.expect("grok pattern not bytes")

.into_owned();

let mut grok = grok::Grok::with_patterns();

let pattern = Arc::new(

grok.compile(&pattern, true)

.map_err(|e| Box::new(Error::InvalidGrokPattern(e)) as Box<dyn DiagnosticError>)?,

);

let remove_empty = arguments

.optional("remove_empty")

.unwrap_or_else(|| expr!(false));

Ok(Box::new(ParseGrokFn {

value,

pattern,

remove_empty,

}))

}

This only works if the argument you want to use for aot compilation only takes literals, which does seem to be the case for match_datadog_query:

vector/lib/vrl/stdlib/src/match_datadog_query.rs

Line 51 in d8ed1b5

let query_value = arguments.required_literal("query")?.to_value();

So it seems all the ingredients are there to do as much as possible at the compile-step, instead of during runtime for each event (assuming caching isn't working).

blt · 2021-12-01T18:41:29Z

REF #10202

blt · 2021-12-01T18:42:07Z

REF #10189

vladimir-dd · 2021-12-02T14:23:38Z

REF #10200

blt · 2022-01-31T19:07:05Z

We are now investigating with the following configs run as above. Lading's version is e81610664b059422b774c770e741569c6ef8cf08.

We see from the soaks that both http_pipelines_blackhole and http_pipelines_no_grok_blackhole suffer similarly poor performance. In #11086 we investigated removing onig to get a clearer sense of allocation patterns in these configurations and go not real leads beyond what we already knew: our cloning of Event and vrl::Value is an allocation churn point.

For vector at 0736db5 I have the two offwaketime flamegraphs. The first is for http_pipelines_blackhole and the second the no-grok variant:

The difference between these two graphs are non-trivial but consider how similar their runtimes are an the no-grok variant is a chopped version of the grok variant I suspect that addressing issues in no-grok is a fruitful path forward. PollStrategy::poll_streams, Fanout::start_send, EventsSend::emit_metrics stand out as major culprits of forcing vector off-CPU.

This commit slims out unused dependencies from the crate, makes sure we don't use lazy_static and other small changes. While investigating #10144 we noticed that off-cpu time was concentrated on BTreeMap clones but could not investigate their exact source as onig's small allocations -- a dependency of this crate -- drowned out the heaptrack results. Unfortunately onig performs far better than `fancy_regex` for our use case and swapping that crate out temporarily did not materially improve off-cpu time. We did learn that clones of `Value` and `Event` are the primary BTreeMap culprits, as suspected, but do not have any ability to deal with that. An effort to insert `Cow` around these trees failed as inserted manual lifetime annotations is too big of a task without some mechanical help. This commit, should we choose to pursue this, set us up to remove onig in the future. The `grok` interface as discussed [here](#11086 (comment)) is very thin and we might profitably upstream an aliasing mechanism into `fancy_regex`, itself a skin on the top of `regex`. Signed-off-by: Brian L. Troutwine <brian@troutwine.us> Signed-off-by: Vladimir Zhuk <vladimir.zhuk@datadoghq.com>

One of the oddities of `Fanout` was the use of an `i` to index sinks. This was, partially, preserved across polls but was not in general use when looping. It is my understanding that `Fanout.i` was ultimately vestigial and any actual indexing was reset each poll. I think, as a result, we would also repoll the same sink multiple times when removals happened, which should be rare in practice but was possible. I have extracted the vector and index munging into a `Store` type. We should now no longer poll underlying sinks multiple times and calling code does not have to munge indexes, although it is required to manually advance/reset a 'cursor' because we're changing the shape of an iterator while iterating it. The primary difference here is the use of `swap_remove` instead of `remove`. This saves a shift. I expect no performance change here. I think, ultimately, this is a stepping stone to getting the logic here very explicit so we can start to do broadcasting in a way that is not impeded by slow receivers downstream. REF #10144 REF #10912 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

With regard to #10144 one thing we see very prominately in CPU time is the recursive call to determine the size of Event instances. This is something we could potentially keep updated but it's not clear how valuable that would be in practice. This commit -- by removing one prominate example -- is intended to figure that out. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

The problem with fanout is that it suffers the slowest receiver problem. We wait for every receiver to be available, then slam input into them. It's possible that if we buffer up some we can reduce gap time between sends but while this logic should be done in Fanout for now, to prove out the idea, I'm adding a Buffer around the relevant fanouts. REF #10144 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

I'm starting to experiment with the removal of `Sink` implementation for `Fanout` for #10144. My in-flight work is starting to sprawl uncomfortably so this is a small patch to remove reliance on one of the related `Sink` traits. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

Introduce a new route transform benchmark This commit introduces a new benchmark for the route transform with the understanding that we are, in some cases, bottlenecked through this transform. See #11688 as an example. Possible contributor to #10144. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

blt · 2022-03-09T20:27:30Z

Now that #11633 is correct we have an interesting lead on what's happening here. Consider this configuration versus this configuration. Both implement the same basic idea but the no_pipelines variant is the "unrolled" version of the former, that is, it does not use the pipelines construct. The no_pipelines variant is almost twice as fast as the pipelines variant -- hat tip to @lukesteensen for the idea to try this out -- and we get some interesting data on why the unrolled version is faster.

Consider the following graphs. I have run the unrolled variant first, followed by a pause, and then the pipelines version. Here we see throughput, which, aside from an initial high spike shows that the unrolled version has maybe twice better throughput:

Throughput is measured from lading, that is, the load generating tool. When we inspect Vector's self-instrumentation we find an interesting pattern. On the left-hand side the total number of processed events in Transform instances are roughly the same as on the right except that the right-hand side shows much higher internal Event traffic in the filter and noop components.

Utilization bears the out further. On the left-hand side Vector runtime is dominated transforms present in the configuration, especially in those with complicated datadog_search conditions, presenting a valuable optimization target. On the right-hand side runtime is dominated by truthy/falsy components, a quirk of how filtering is achieved in pipelines:

To explain this, the pipeline "filter" is not a single Transform instance but is actually two instances the "truthy" side and the "falsy" side with terminals on the negations of each boolean. This means every condition is run twice, Event instances are cloned twice etc etc. The noop is used as a coalescing terminal in the pipeline and shows up heavily in utilization as well, something that the "unrolled" variant is able to side-step entirely.

The pipeline mechanism in Vector as of this writing works by "expanding" sub-transforms into a sub-topology of the main topology and guides data flow through this sub-topology with these implicit filter/expand/routing constructs of non-zero cost. Consider that if the pipelines filter were a single rather than double runtime component the pipeline configuration would pay the cost of only a single condition check per filter per Event, rather than two. The expansion mechanism also awkwardly thwarts other optimizations that the rest of the Vector topology has picked up. Consider that PipelinesConfig does not turn into a Pipelines implementing some Transform. If it, say, implemented FunctionTransform then we could run multiple copies of it in parallel, improving CPU saturation. See #10265 for the kind of improvement this optimization brings.

As a next step I believe we should contrive to have PipelineFilterConfig implement TransformConfig::build and avoid the falsy/truthy penalty. More generally, PipelinesConfig should implement TransformConfig::build and the resulting Transform instance will then be able to pick up other topology optimizations. I have a hunch that the expansion approach extant in the code base today should be gradually walked back and we should not have TransformConfig instances that don't implement ::build. Ideally we can contrive to lose Noop entirely. As it is, the expansion mechanism in use in the pipelines means we lose optimizations that are done to the broader topology and optimization of the pipelines ends up being a specialized project.

This commit is a mostly mechanical shuffle of our pipeline code, done while investigating #10144. I found the previous organization challenging to comprehend and this reads more clear to me. I've tried to follow the section commenting style common to the rest of the transform code. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

With reference to #10144 and in light of #11849 we now have an understanding that http -> pipelines -> blackhole is significantly bottlenecked in datadog-grok. Unfortunately most of our data indicates we're looking at regex being the prime pain point. This commit does two things: introduces micro-benchmarks for `datadog_grok::filters::keyvalue::apply_filter` -- unfortunately exposing `datadog_grok::filters` from the crate so we can benchmark it -- and improves the performance of said function by +40% in the micro when there is a field delimiter in place. Specifically, we remove the need for nom-regex and avoid cloning a `regex::Regex` instance for each key and each value in a field. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

chore: Add microbenchmarks to lib/datadog/grok (#12172) With reference to #10144 and in light of #11849 we now have an understanding that http -> pipelines -> blackhole is significantly bottlenecked in datadog-grok. Unfortunately most of our data indicates we're looking at regex being the prime pain point. This commit does two things: introduces micro-benchmarks for `datadog_grok::filters::keyvalue::apply_filter` -- unfortunately exposing `datadog_grok::filters` from the crate so we can benchmark it -- and improves the performance of said function by +40% in the micro when there is a field delimiter in place. Specifically, we remove the need for nom-regex and avoid cloning a `regex::Regex` instance for each key and each value in a field. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

blt · 2022-04-19T23:50:04Z

As of 06e05be the main bottleneck per #12267 is the use of "enriched" mode in VRL's parse_user_agent. Resolving this -- by switching to "fast" mode -- suggests that the next areas of work are:

Incorporate new Lookup code into VRL and vector #7070, specifically we spend a lot time right here. Lookup parsing contributes roughly 7% of execution time.
Runtime regex compilation counts for 9% of execution time, likely this. Call sites for these functions are in VRL stdlib and src/conditions/datadog_search.rs.
VRL user-agent parsing is a bottleneck in http -> pipelines -> blackhole #12267 as mentioned.

JeanMertz · 2022-04-20T12:28:44Z

Runtime regex compilation counts for 9% of execution time, likely this. Call sites for these functions are in VRL stdlib and src/conditions/datadog_search.rs.

To expand on this, the only function in VRL's stdlib using word_regex and wildcard_regex, is match_datadog_query.

Related to #10144 we're curious what the pipelines soak looks like if the remap steps remain but are as cheap as possible. This soak is related to the no_grok variant in that it's basically the flip of that. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

One of the major difficulties underlying #10144 is an inability for Vector to saturate its cores. Exactly why is not totally clear, but we _are_ bottlenecking right at the top in our soak tests by parsing json in VRL rather than using the encoding feature of the source. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

This commit adjusts pipeline expansion so that they are combined, rather than, well, expanded. This means that the sub-transforms of a pipeline run in serial but that each pipeline as a whole can run multiple copies of itself at once. This also cleans up many low-priority tasks. Resolves #11787 Resolves #11784 REF #10144 Signed-off-by: Luke Steensen <luke.steensen@gmail.com> Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* Combine pipeline stages This commit adjusts pipeline expansion so that they are combined, rather than, well, expanded. This means that the sub-transforms of a pipeline run in serial but that each pipeline as a whole can run multiple copies of itself at once. This also cleans up many low-priority tasks. Resolves #11787 Resolves #11784 REF #10144 Signed-off-by: Luke Steensen <luke.steensen@gmail.com> Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * Tidy up errors, fix bug in vector.toml pipeline config Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * test dings Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * temporariliy disable http_datadog_filter_blackhole Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * PR feedback Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * PR feedback with regard to outputs Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * PR feedback Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * try 64 wide interior buffer Signed-off-by: Brian L. Troutwine <brian@troutwine.us> Co-authored-by: Brian L. Troutwine <brian@troutwine.us>

jszwedko · 2022-12-24T01:36:47Z

Closing since this transform was removed.

blt added domain: performance Anything related to Vector's performance domain: vrl Anything related to the Vector Remap Language domain: pipelines labels Nov 23, 2021

leebenson mentioned this issue Nov 23, 2021

Add baseline benchmarks for match_datadog_query VRL function #10151

Closed

blt mentioned this issue Nov 23, 2021

chore: Introduce a new http -> pipelines -> blackhole soak #10142

Merged

binarylogic added transform: pipelines Anything `pipelines` transform related and removed domain: pipelines labels Dec 7, 2021

blt self-assigned this Jan 31, 2022

blt mentioned this issue Jan 31, 2022

chore: Remove minor clone points in datadog-grok #11086

Merged

blt mentioned this issue Feb 1, 2022

chore: Extract the ring-ish structure implicit in Fanout #11142

Closed

blt mentioned this issue Feb 3, 2022

chore: DO NOT MERGE ME Temporarily remove EventsSent #11168

Closed

blt mentioned this issue Feb 3, 2022

chore: DO NOT MERGE ME Hack buffering in around Fanout #11169

Closed

blt mentioned this issue Feb 4, 2022

chore: Avoid the use of SinkExt in src/topology/builder.rs #11188

Merged

blt mentioned this issue Mar 9, 2022

chore: Slightly re-organize the pipeline code #11729

Merged

This was referenced Mar 10, 2022

Bring pipelines transform more inline with general transform behavior #11784

Closed

Remove the double condition execution inherent in PipelineFilterConfig expansion #11785

Closed

Remove the need for Noop transform #11786

Closed

blt mentioned this issue Apr 12, 2022

chore: Add microbenchmarks to lib/datadog/grok #12172

Merged

blt mentioned this issue Apr 18, 2022

VRL user-agent parsing is a bottleneck in http -> pipelines -> blackhole #12267

Closed

blt mentioned this issue Apr 20, 2022

chore: Add a remap nullified http -> pipelines soak #12328

Closed

jszwedko mentioned this issue Apr 25, 2022

Swap parse_groks key/value filter with new lookup code #12402

Closed

This was referenced May 3, 2022

chore: Use ndjson encoding, not parse_json #12538

Closed

ReadyArrays: Experiment with larger default size #12662

Open

ReadyArrays: Experiment with combining Events into a single array #12663

Open

blt mentioned this issue May 11, 2022

chore: Combine pipeline stages #12602

Merged

jszwedko closed this as completed Dec 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate `http -> pipelines -> blackhole` #10144

Investigate `http -> pipelines -> blackhole` #10144

blt commented Nov 23, 2021 •

edited

Loading

blt commented Nov 23, 2021

blt commented Nov 23, 2021 •

edited

Loading

blt commented Nov 23, 2021 •

edited

Loading

blt commented Nov 23, 2021

vladimir-dd commented Nov 23, 2021 •

edited

Loading

blt commented Nov 23, 2021

blt commented Nov 23, 2021

blt commented Dec 1, 2021

blt commented Dec 1, 2021

vladimir-dd commented Dec 2, 2021

blt commented Jan 31, 2022

blt commented Mar 9, 2022 •

edited

Loading

blt commented Apr 19, 2022

JeanMertz commented Apr 20, 2022

jszwedko commented Dec 24, 2022

Investigate http -> pipelines -> blackhole #10144

Investigate http -> pipelines -> blackhole #10144

Comments

blt commented Nov 23, 2021 • edited Loading

blt commented Nov 23, 2021

blt commented Nov 23, 2021 • edited Loading

blt commented Nov 23, 2021 • edited Loading

blt commented Nov 23, 2021

vladimir-dd commented Nov 23, 2021 • edited Loading

blt commented Nov 23, 2021

blt commented Nov 23, 2021

blt commented Dec 1, 2021

blt commented Dec 1, 2021

vladimir-dd commented Dec 2, 2021

blt commented Jan 31, 2022

blt commented Mar 9, 2022 • edited Loading

blt commented Apr 19, 2022

JeanMertz commented Apr 20, 2022

jszwedko commented Dec 24, 2022

Investigate `http -> pipelines -> blackhole` #10144

Investigate `http -> pipelines -> blackhole` #10144

blt commented Nov 23, 2021 •

edited

Loading

blt commented Nov 23, 2021 •

edited

Loading

blt commented Nov 23, 2021 •

edited

Loading

vladimir-dd commented Nov 23, 2021 •

edited

Loading

blt commented Mar 9, 2022 •

edited

Loading