Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate http -> pipelines -> blackhole #10144

Closed
blt opened this issue Nov 23, 2021 · 15 comments
Closed

Investigate http -> pipelines -> blackhole #10144

blt opened this issue Nov 23, 2021 · 15 comments
Assignees
Labels
domain: performance Anything related to Vector's performance domain: vrl Anything related to the Vector Remap Language transform: pipelines Anything `pipelines` transform related

Comments

@blt
Copy link
Contributor

blt commented Nov 23, 2021

This issue is a record of and parent to the investigation of vector's performance when configured as in this gist.

Inbound http_gen -- built from lading at SHA 0da91906d56acc899b829cea971d79f13e712e21-- is configured like so:

worker_threads: 1
prometheus_addr: "0.0.0.0:9001"

targets:
  vector:
    headers: {}
    target_uri: "http://localhost:8282/"
    bytes_per_second: "500 Mb"
    parallel_connections: 10
    block_sizes: ["1Kb", "2Kb"]
    method:
      post:
        maximum_prebuild_cache_size_bytes: "256 Mb"
        variant:
          static:
            static_path: "/home/blt/projects/com/github/timberio/vector/investigation/input.log"

Referenced input.log can be found here.

@blt blt added domain: performance Anything related to Vector's performance domain: vrl Anything related to the Vector Remap Language domain: pipelines labels Nov 23, 2021
@blt
Copy link
Contributor Author

blt commented Nov 23, 2021

An attempt has been made to get the referenced vector.toml et al into a soak. This work is ongoing in #10142 but is blocked pending a minikube bug.

@blt
Copy link
Contributor Author

blt commented Nov 23, 2021

Absent a CI soak we use the older methods from #8512 et al for measuring throughput. Until further notice vector is run like so:

> firejail --noprofile --rlimit-as=8g --cpu=0,1,2,3,4,5,6,7 ./target/release/vector --config ~/projects/com/github/timberio/vector/investigation/vector.toml

and http_gen is run like so:

> firejail --noprofile --cpu=8 ./target/release/http_gen --config-path ~/projects/com/github/timberio/vector/investigation/http_gen.yaml

Under these conditions we find:

SHA http_gen generation Kb/s http bytes processed Kb/s blackhole receipt Kb/s
127f05b 27 30 163

Please note, these values are in kibibytes. Please also be aware that throughput varied wildly and these are median values taken over a five minute sample.

@blt
Copy link
Contributor Author

blt commented Nov 23, 2021

Captured perf command:

> perf record --output=/tmp/vector-perf.data --compression-level=10 --call-graph=dwarf ./target/release/vector --config ~/projects/com/github/timberio/vector/investigation/vector.toml

Vector built like so:

> CARGO_PROFILE_RELEASE_DEBUG=2 cargo build --release

Unfortunately even compressed the data file is 1.5G and cannot be compressed further.

flamegraph

Items of concern:

  • regex::exec::ExecBuilder::build appears as a hot spot, implying we are building regexes in a loop.
  • major cost centers show in VRL, work ongoing in Vrl Performance #9811
  • MatchDatadogQueryFn in VRL's stdlib is a major cost center, has been recently introduced

@blt
Copy link
Contributor Author

blt commented Nov 23, 2021

/cc @vladimir-dd

@vladimir-dd
Copy link
Contributor

vladimir-dd commented Nov 23, 2021

Right, thanks @blt for running this test.
Yes, I'd expect MatchDatadogQueryFn to be one of performance bottlenecks - it's called once per pipeline for each event(plus for category processors), and I assume we are calling regex::exec::ExecBuilder::build from this function - seems like we build regexes at runtime, not during vrl compilation.

@blt
Copy link
Contributor Author

blt commented Nov 23, 2021

Building on the last comment, @leebenson noted in slack that the match datadog query is caching regex, per

#[cached(
type = "SizedCache<String, Regex>",
create = "{ SizedCache::with_size(10) }",
convert = r#"{ to_match.to_owned() }"#
)]
/// Returns compiled word boundary regex. Cached to avoid recompilation in hot paths.
fn word_regex(to_match: &str) -> Regex {
Regex::new(&format!(
r#"\b{}\b"#,
regex::escape(to_match).replace("\\*", ".*")
))
.expect("invalid wildcard regex")
}
#[cached(
type = "SizedCache<String, Regex>",
create = "{ SizedCache::with_size(10) }",
convert = r#"{ to_match.to_owned() }"#
)]
/// Returns compiled wildcard regex. Cached to avoid recompilation in hot paths.
fn wildcard_regex(to_match: &str) -> Regex {
Regex::new(&format!(
"^{}$",
regex::escape(to_match).replace("\\*", ".*")
))
.expect("invalid wildcard regex")
}
. It's unknown how many instances we have at runtime now but 10 does seem small for the posted config.

@blt
Copy link
Contributor Author

blt commented Nov 23, 2021

From @JeanMertz in slack:

I don't know the ins and outs here, but aot compilation can happen in the compile step for functions:

fn compile(
&self,
_state: &state::Compiler,
_ctx: &FunctionCompileContext,
mut arguments: ArgumentList,
) -> Compiled {
let value = arguments.required("value");
let query_value = arguments.required_literal("query")?.to_value();
// Query should always be a string.
let query = query_value
.try_bytes_utf8_lossy()
.expect("datadog search query not bytes");
// Compile the Datadog search query to AST.
let node = parse(&query).map_err(|e| {
Box::new(ExpressionError::from(e.to_string())) as Box<dyn DiagnosticError>
})?;
Ok(Box::new(MatchDatadogQueryFn { value, node }))
}

Take a look at the parse_grok function that does aot compilation of the grok pattern:
fn compile(
&self,
_state: &state::Compiler,
_ctx: &FunctionCompileContext,
mut arguments: ArgumentList,
) -> Compiled {
let value = arguments.required("value");
let pattern = arguments
.required_literal("pattern")?
.to_value()
.try_bytes_utf8_lossy()
.expect("grok pattern not bytes")
.into_owned();
let mut grok = grok::Grok::with_patterns();
let pattern = Arc::new(
grok.compile(&pattern, true)
.map_err(|e| Box::new(Error::InvalidGrokPattern(e)) as Box<dyn DiagnosticError>)?,
);
let remove_empty = arguments
.optional("remove_empty")
.unwrap_or_else(|| expr!(false));
Ok(Box::new(ParseGrokFn {
value,
pattern,
remove_empty,
}))
}

This only works if the argument you want to use for aot compilation only takes literals, which does seem to be the case for match_datadog_query:
let query_value = arguments.required_literal("query")?.to_value();

So it seems all the ingredients are there to do as much as possible at the compile-step, instead of during runtime for each event (assuming caching isn't working).

@blt
Copy link
Contributor Author

blt commented Dec 1, 2021

REF #10202

@blt
Copy link
Contributor Author

blt commented Dec 1, 2021

REF #10189

@vladimir-dd
Copy link
Contributor

REF #10200

@binarylogic binarylogic added transform: pipelines Anything `pipelines` transform related and removed domain: pipelines labels Dec 7, 2021
@blt blt self-assigned this Jan 31, 2022
@blt
Copy link
Contributor Author

blt commented Jan 31, 2022

We are now investigating with the following configs run as above. Lading's version is e81610664b059422b774c770e741569c6ef8cf08.

We see from the soaks that both http_pipelines_blackhole and http_pipelines_no_grok_blackhole suffer similarly poor performance. In #11086 we investigated removing onig to get a clearer sense of allocation patterns in these configurations and go not real leads beyond what we already knew: our cloning of Event and vrl::Value is an allocation churn point.

For vector at 0736db5 I have the two offwaketime flamegraphs. The first is for http_pipelines_blackhole and the second the no-grok variant:

vector-hpb
vector-hpb-no_grok

The difference between these two graphs are non-trivial but consider how similar their runtimes are an the no-grok variant is a chopped version of the grok variant I suspect that addressing issues in no-grok is a fruitful path forward. PollStrategy::poll_streams, Fanout::start_send, EventsSend::emit_metrics stand out as major culprits of forcing vector off-CPU.

blt added a commit that referenced this issue Feb 1, 2022
This commit slims out unused dependencies from the crate, makes sure 
we don't use lazy_static and other small changes. While investigating 
#10144 we noticed that off-cpu time was concentrated on BTreeMap clones
but could not investigate their exact source as onig's small allocations -- a
dependency of this crate -- drowned out the heaptrack results. Unfortunately
onig performs far better than `fancy_regex` for our use case and swapping that
crate out temporarily did not materially improve off-cpu time. We did learn that 
clones of `Value` and `Event` are the primary BTreeMap culprits, as suspected, but
do not have any ability to deal with that. An effort to insert `Cow` around these trees
failed as inserted manual lifetime annotations is too big of a task without some 
mechanical help. 

This commit, should we choose to pursue this, set us up to remove onig in the future.
The `grok` interface as discussed [here](#11086 (comment)) 
is very thin and we might profitably upstream an aliasing mechanism into `fancy_regex`, 
itself a skin on the top of `regex`. 

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Signed-off-by: Vladimir Zhuk <vladimir.zhuk@datadoghq.com>
blt added a commit that referenced this issue Feb 1, 2022
One of the oddities of `Fanout` was the use of an `i` to index sinks. This was,
partially, preserved across polls but was not in general use when looping. It is
my understanding that `Fanout.i` was ultimately vestigial and any actual
indexing was reset each poll. I think, as a result, we would also repoll the
same sink multiple times when removals happened, which should be rare in
practice but was possible. I have extracted the vector and index munging into a
`Store` type. We should now no longer poll underlying sinks multiple times and
calling code does not have to munge indexes, although it is required to manually
advance/reset a 'cursor' because we're changing the shape of an iterator while
iterating it.

The primary difference here is the use of `swap_remove` instead of
`remove`. This saves a shift.

I expect no performance change here. I think, ultimately, this is a stepping
stone to getting the logic here very explicit so we can start to do broadcasting
in a way that is not impeded by slow receivers downstream.

REF #10144
REF #10912

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Feb 2, 2022
One of the oddities of `Fanout` was the use of an `i` to index sinks. This was,
partially, preserved across polls but was not in general use when looping. It is
my understanding that `Fanout.i` was ultimately vestigial and any actual
indexing was reset each poll. I think, as a result, we would also repoll the
same sink multiple times when removals happened, which should be rare in
practice but was possible. I have extracted the vector and index munging into a
`Store` type. We should now no longer poll underlying sinks multiple times and
calling code does not have to munge indexes, although it is required to manually
advance/reset a 'cursor' because we're changing the shape of an iterator while
iterating it.

The primary difference here is the use of `swap_remove` instead of
`remove`. This saves a shift.

I expect no performance change here. I think, ultimately, this is a stepping
stone to getting the logic here very explicit so we can start to do broadcasting
in a way that is not impeded by slow receivers downstream.

REF #10144
REF #10912

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Feb 3, 2022
With regard to #10144 one thing we see very prominately in CPU time is the
recursive call to determine the size of Event instances. This is something we
could potentially keep updated but it's not clear how valuable that would be in
practice. This commit -- by removing one prominate example -- is intended to
figure that out.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Feb 3, 2022
The problem with fanout is that it suffers the slowest receiver problem. We wait
for every receiver to be available, then slam input into them. It's possible
that if we buffer up some we can reduce gap time between sends but while this
logic should be done in Fanout for now, to prove out the idea, I'm adding a
Buffer around the relevant fanouts.

REF #10144

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Feb 4, 2022
I'm starting to experiment with the removal of `Sink` implementation for
`Fanout` for #10144. My in-flight work is starting to sprawl uncomfortably so
this is a small patch to remove reliance on one of the related `Sink` traits.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Mar 5, 2022
Introduce a new route transform  benchmark

This commit introduces a new benchmark for the route transform with the
understanding that we are, in some cases, bottlenecked through this
transform. See #11688 as an
example. Possible contributor to #10144.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
@blt
Copy link
Contributor Author

blt commented Mar 9, 2022

Now that #11633 is correct we have an interesting lead on what's happening here. Consider this configuration versus this configuration. Both implement the same basic idea but the no_pipelines variant is the "unrolled" version of the former, that is, it does not use the pipelines construct. The no_pipelines variant is almost twice as fast as the pipelines variant -- hat tip to @lukesteensen for the idea to try this out -- and we get some interesting data on why the unrolled version is faster.

Consider the following graphs. I have run the unrolled variant first, followed by a pause, and then the pipelines version. Here we see throughput, which, aside from an initial high spike shows that the unrolled version has maybe twice better throughput:

image (1)

Throughput is measured from lading, that is, the load generating tool. When we inspect Vector's self-instrumentation we find an interesting pattern. On the left-hand side the total number of processed events in Transform instances are roughly the same as on the right except that the right-hand side shows much higher internal Event traffic in the filter and noop components.

image (2)

Utilization bears the out further. On the left-hand side Vector runtime is dominated transforms present in the configuration, especially in those with complicated datadog_search conditions, presenting a valuable optimization target. On the right-hand side runtime is dominated by truthy/falsy components, a quirk of how filtering is achieved in pipelines:

image (3)

To explain this, the pipeline "filter" is not a single Transform instance but is actually two instances the "truthy" side and the "falsy" side with terminals on the negations of each boolean. This means every condition is run twice, Event instances are cloned twice etc etc. The noop is used as a coalescing terminal in the pipeline and shows up heavily in utilization as well, something that the "unrolled" variant is able to side-step entirely.

The pipeline mechanism in Vector as of this writing works by "expanding" sub-transforms into a sub-topology of the main topology and guides data flow through this sub-topology with these implicit filter/expand/routing constructs of non-zero cost. Consider that if the pipelines filter were a single rather than double runtime component the pipeline configuration would pay the cost of only a single condition check per filter per Event, rather than two. The expansion mechanism also awkwardly thwarts other optimizations that the rest of the Vector topology has picked up. Consider that PipelinesConfig does not turn into a Pipelines implementing some Transform. If it, say, implemented FunctionTransform then we could run multiple copies of it in parallel, improving CPU saturation. See #10265 for the kind of improvement this optimization brings.

As a next step I believe we should contrive to have PipelineFilterConfig implement TransformConfig::build and avoid the falsy/truthy penalty. More generally, PipelinesConfig should implement TransformConfig::build and the resulting Transform instance will then be able to pick up other topology optimizations. I have a hunch that the expansion approach extant in the code base today should be gradually walked back and we should not have TransformConfig instances that don't implement ::build. Ideally we can contrive to lose Noop entirely. As it is, the expansion mechanism in use in the pipelines means we lose optimizations that are done to the broader topology and optimization of the pipelines ends up being a specialized project.

blt added a commit that referenced this issue Mar 9, 2022
This commit is a mostly mechanical shuffle of our pipeline code, done
while investigating #10144. I found the previous organization
challenging to comprehend and this reads more clear to me. I've tried
to follow the section commenting style common to the rest of the
transform code.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Mar 10, 2022
This commit is a mostly mechanical shuffle of our pipeline code, done
while investigating #10144. I found the previous organization
challenging to comprehend and this reads more clear to me. I've tried
to follow the section commenting style common to the rest of the
transform code.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Mar 10, 2022
This commit is a mostly mechanical shuffle of our pipeline code, done
while investigating #10144. I found the previous organization
challenging to comprehend and this reads more clear to me. I've tried
to follow the section commenting style common to the rest of the
transform code.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Apr 12, 2022
With reference to #10144 and in light of #11849 we now have an understanding
that http -> pipelines -> blackhole is significantly bottlenecked in
datadog-grok. Unfortunately most of our data indicates we're looking at regex
being the prime pain point. This commit does two things: introduces
micro-benchmarks for `datadog_grok::filters::keyvalue::apply_filter` --
unfortunately exposing `datadog_grok::filters` from the crate so we can
benchmark it -- and improves the performance of said function by +40% in the
micro when there is a field delimiter in place. Specifically, we remove the need
for nom-regex and avoid cloning a `regex::Regex` instance for each key and each
value in a field.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Apr 12, 2022
With reference to #10144 and in light of #11849 we now have an understanding
that http -> pipelines -> blackhole is significantly bottlenecked in
datadog-grok. Unfortunately most of our data indicates we're looking at regex
being the prime pain point. This commit does two things: introduces
micro-benchmarks for `datadog_grok::filters::keyvalue::apply_filter` --
unfortunately exposing `datadog_grok::filters` from the crate so we can
benchmark it -- and improves the performance of said function by +40% in the
micro when there is a field delimiter in place. Specifically, we remove the need
for nom-regex and avoid cloning a `regex::Regex` instance for each key and each
value in a field.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Apr 12, 2022
With reference to #10144 and in light of #11849 we now have an understanding
that http -> pipelines -> blackhole is significantly bottlenecked in
datadog-grok. Unfortunately most of our data indicates we're looking at regex
being the prime pain point. This commit does two things: introduces
micro-benchmarks for `datadog_grok::filters::keyvalue::apply_filter` --
unfortunately exposing `datadog_grok::filters` from the crate so we can
benchmark it -- and improves the performance of said function by +40% in the
micro when there is a field delimiter in place. Specifically, we remove the need
for nom-regex and avoid cloning a `regex::Regex` instance for each key and each
value in a field.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue Apr 13, 2022
chore: Add microbenchmarks to lib/datadog/grok  (#12172)

With reference to #10144 and in light of #11849 we now have an understanding
that http -> pipelines -> blackhole is significantly bottlenecked in
datadog-grok. Unfortunately most of our data indicates we're looking at regex
being the prime pain point. This commit does two things: introduces
micro-benchmarks for `datadog_grok::filters::keyvalue::apply_filter` --
unfortunately exposing `datadog_grok::filters` from the crate so we can
benchmark it -- and improves the performance of said function by +40% in the
micro when there is a field delimiter in place. Specifically, we remove the need
for nom-regex and avoid cloning a `regex::Regex` instance for each key and each
value in a field.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
@blt
Copy link
Contributor Author

blt commented Apr 19, 2022

As of 06e05be the main bottleneck per #12267 is the use of "enriched" mode in VRL's parse_user_agent. Resolving this -- by switching to "fast" mode -- suggests that the next areas of work are:

@JeanMertz
Copy link
Contributor

Runtime regex compilation counts for 9% of execution time, likely this. Call sites for these functions are in VRL stdlib and src/conditions/datadog_search.rs.

To expand on this, the only function in VRL's stdlib using word_regex and wildcard_regex, is match_datadog_query.

blt added a commit that referenced this issue Apr 20, 2022
Related to #10144 we're curious what the pipelines soak looks like if the remap
steps remain but are as cheap as possible. This soak is related to the no_grok
variant in that it's basically the flip of that.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue May 3, 2022
One of the major difficulties underlying #10144 is an inability for Vector to
saturate its cores. Exactly why is not totally clear, but we _are_ bottlenecking
right at the top in our soak tests by parsing json in VRL rather than using the
encoding feature of the source.

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue May 11, 2022
This commit adjusts pipeline expansion so that they are combined, rather than,
well, expanded. This means that the sub-transforms of a pipeline run in serial
but that each pipeline as a whole can run multiple copies of itself at
once. This also cleans up many low-priority tasks.

Resolves #11787
Resolves #11784
REF #10144

Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue May 13, 2022
This commit adjusts pipeline expansion so that they are combined, rather than,
well, expanded. This means that the sub-transforms of a pipeline run in serial
but that each pipeline as a whole can run multiple copies of itself at
once. This also cleans up many low-priority tasks.

Resolves #11787
Resolves #11784
REF #10144

Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt added a commit that referenced this issue May 16, 2022
* Combine pipeline stages

This commit adjusts pipeline expansion so that they are combined, rather than,
well, expanded. This means that the sub-transforms of a pipeline run in serial
but that each pipeline as a whole can run multiple copies of itself at
once. This also cleans up many low-priority tasks.

Resolves #11787
Resolves #11784
REF #10144

Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* Tidy up errors, fix bug in vector.toml pipeline config

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* test dings

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* temporariliy disable http_datadog_filter_blackhole

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* PR feedback

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* PR feedback with regard to outputs

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* PR feedback

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

* try 64 wide interior buffer

Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

Co-authored-by: Brian L. Troutwine <brian@troutwine.us>
@jszwedko
Copy link
Member

Closing since this transform was removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: performance Anything related to Vector's performance domain: vrl Anything related to the Vector Remap Language transform: pipelines Anything `pipelines` transform related
Projects
None yet
Development

No branches or pull requests

5 participants