-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate http -> pipelines -> blackhole
#10144
Comments
An attempt has been made to get the referenced |
Absent a CI soak we use the older methods from #8512 et al for measuring throughput. Until further notice vector is run like so:
and
Under these conditions we find:
Please note, these values are in kibibytes. Please also be aware that throughput varied wildly and these are median values taken over a five minute sample. |
Captured perf command:
Vector built like so:
Unfortunately even compressed the data file is 1.5G and cannot be compressed further. Items of concern:
|
/cc @vladimir-dd |
Right, thanks @blt for running this test. |
Building on the last comment, @leebenson noted in slack that the match datadog query is caching regex, per vector/lib/vrl/stdlib/src/match_datadog_query.rs Lines 284 to 310 in d8ed1b5
|
From @JeanMertz in slack:
|
REF #10202 |
REF #10189 |
REF #10200 |
We are now investigating with the following configs run as above. Lading's version is e81610664b059422b774c770e741569c6ef8cf08. We see from the soaks that both For vector at 0736db5 I have the two offwaketime flamegraphs. The first is for The difference between these two graphs are non-trivial but consider how similar their runtimes are an the no-grok variant is a chopped version of the grok variant I suspect that addressing issues in no-grok is a fruitful path forward. |
This commit slims out unused dependencies from the crate, makes sure we don't use lazy_static and other small changes. While investigating #10144 we noticed that off-cpu time was concentrated on BTreeMap clones but could not investigate their exact source as onig's small allocations -- a dependency of this crate -- drowned out the heaptrack results. Unfortunately onig performs far better than `fancy_regex` for our use case and swapping that crate out temporarily did not materially improve off-cpu time. We did learn that clones of `Value` and `Event` are the primary BTreeMap culprits, as suspected, but do not have any ability to deal with that. An effort to insert `Cow` around these trees failed as inserted manual lifetime annotations is too big of a task without some mechanical help. This commit, should we choose to pursue this, set us up to remove onig in the future. The `grok` interface as discussed [here](#11086 (comment)) is very thin and we might profitably upstream an aliasing mechanism into `fancy_regex`, itself a skin on the top of `regex`. Signed-off-by: Brian L. Troutwine <brian@troutwine.us> Signed-off-by: Vladimir Zhuk <vladimir.zhuk@datadoghq.com>
One of the oddities of `Fanout` was the use of an `i` to index sinks. This was, partially, preserved across polls but was not in general use when looping. It is my understanding that `Fanout.i` was ultimately vestigial and any actual indexing was reset each poll. I think, as a result, we would also repoll the same sink multiple times when removals happened, which should be rare in practice but was possible. I have extracted the vector and index munging into a `Store` type. We should now no longer poll underlying sinks multiple times and calling code does not have to munge indexes, although it is required to manually advance/reset a 'cursor' because we're changing the shape of an iterator while iterating it. The primary difference here is the use of `swap_remove` instead of `remove`. This saves a shift. I expect no performance change here. I think, ultimately, this is a stepping stone to getting the logic here very explicit so we can start to do broadcasting in a way that is not impeded by slow receivers downstream. REF #10144 REF #10912 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
One of the oddities of `Fanout` was the use of an `i` to index sinks. This was, partially, preserved across polls but was not in general use when looping. It is my understanding that `Fanout.i` was ultimately vestigial and any actual indexing was reset each poll. I think, as a result, we would also repoll the same sink multiple times when removals happened, which should be rare in practice but was possible. I have extracted the vector and index munging into a `Store` type. We should now no longer poll underlying sinks multiple times and calling code does not have to munge indexes, although it is required to manually advance/reset a 'cursor' because we're changing the shape of an iterator while iterating it. The primary difference here is the use of `swap_remove` instead of `remove`. This saves a shift. I expect no performance change here. I think, ultimately, this is a stepping stone to getting the logic here very explicit so we can start to do broadcasting in a way that is not impeded by slow receivers downstream. REF #10144 REF #10912 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
With regard to #10144 one thing we see very prominately in CPU time is the recursive call to determine the size of Event instances. This is something we could potentially keep updated but it's not clear how valuable that would be in practice. This commit -- by removing one prominate example -- is intended to figure that out. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
The problem with fanout is that it suffers the slowest receiver problem. We wait for every receiver to be available, then slam input into them. It's possible that if we buffer up some we can reduce gap time between sends but while this logic should be done in Fanout for now, to prove out the idea, I'm adding a Buffer around the relevant fanouts. REF #10144 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
I'm starting to experiment with the removal of `Sink` implementation for `Fanout` for #10144. My in-flight work is starting to sprawl uncomfortably so this is a small patch to remove reliance on one of the related `Sink` traits. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Introduce a new route transform benchmark This commit introduces a new benchmark for the route transform with the understanding that we are, in some cases, bottlenecked through this transform. See #11688 as an example. Possible contributor to #10144. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
Now that #11633 is correct we have an interesting lead on what's happening here. Consider this configuration versus this configuration. Both implement the same basic idea but the Consider the following graphs. I have run the unrolled variant first, followed by a pause, and then the pipelines version. Here we see throughput, which, aside from an initial high spike shows that the unrolled version has maybe twice better throughput: Throughput is measured from lading, that is, the load generating tool. When we inspect Vector's self-instrumentation we find an interesting pattern. On the left-hand side the total number of processed events in Utilization bears the out further. On the left-hand side Vector runtime is dominated transforms present in the configuration, especially in those with complicated To explain this, the pipeline "filter" is not a single The pipeline mechanism in Vector as of this writing works by "expanding" sub-transforms into a sub-topology of the main topology and guides data flow through this sub-topology with these implicit filter/expand/routing constructs of non-zero cost. Consider that if the pipelines filter were a single rather than double runtime component the pipeline configuration would pay the cost of only a single condition check per filter per As a next step I believe we should contrive to have |
This commit is a mostly mechanical shuffle of our pipeline code, done while investigating #10144. I found the previous organization challenging to comprehend and this reads more clear to me. I've tried to follow the section commenting style common to the rest of the transform code. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit is a mostly mechanical shuffle of our pipeline code, done while investigating #10144. I found the previous organization challenging to comprehend and this reads more clear to me. I've tried to follow the section commenting style common to the rest of the transform code. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit is a mostly mechanical shuffle of our pipeline code, done while investigating #10144. I found the previous organization challenging to comprehend and this reads more clear to me. I've tried to follow the section commenting style common to the rest of the transform code. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
With reference to #10144 and in light of #11849 we now have an understanding that http -> pipelines -> blackhole is significantly bottlenecked in datadog-grok. Unfortunately most of our data indicates we're looking at regex being the prime pain point. This commit does two things: introduces micro-benchmarks for `datadog_grok::filters::keyvalue::apply_filter` -- unfortunately exposing `datadog_grok::filters` from the crate so we can benchmark it -- and improves the performance of said function by +40% in the micro when there is a field delimiter in place. Specifically, we remove the need for nom-regex and avoid cloning a `regex::Regex` instance for each key and each value in a field. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
With reference to #10144 and in light of #11849 we now have an understanding that http -> pipelines -> blackhole is significantly bottlenecked in datadog-grok. Unfortunately most of our data indicates we're looking at regex being the prime pain point. This commit does two things: introduces micro-benchmarks for `datadog_grok::filters::keyvalue::apply_filter` -- unfortunately exposing `datadog_grok::filters` from the crate so we can benchmark it -- and improves the performance of said function by +40% in the micro when there is a field delimiter in place. Specifically, we remove the need for nom-regex and avoid cloning a `regex::Regex` instance for each key and each value in a field. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
With reference to #10144 and in light of #11849 we now have an understanding that http -> pipelines -> blackhole is significantly bottlenecked in datadog-grok. Unfortunately most of our data indicates we're looking at regex being the prime pain point. This commit does two things: introduces micro-benchmarks for `datadog_grok::filters::keyvalue::apply_filter` -- unfortunately exposing `datadog_grok::filters` from the crate so we can benchmark it -- and improves the performance of said function by +40% in the micro when there is a field delimiter in place. Specifically, we remove the need for nom-regex and avoid cloning a `regex::Regex` instance for each key and each value in a field. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
chore: Add microbenchmarks to lib/datadog/grok (#12172) With reference to #10144 and in light of #11849 we now have an understanding that http -> pipelines -> blackhole is significantly bottlenecked in datadog-grok. Unfortunately most of our data indicates we're looking at regex being the prime pain point. This commit does two things: introduces micro-benchmarks for `datadog_grok::filters::keyvalue::apply_filter` -- unfortunately exposing `datadog_grok::filters` from the crate so we can benchmark it -- and improves the performance of said function by +40% in the micro when there is a field delimiter in place. Specifically, we remove the need for nom-regex and avoid cloning a `regex::Regex` instance for each key and each value in a field. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
As of 06e05be the main bottleneck per #12267 is the use of "enriched" mode in VRL's
|
To expand on this, the only function in VRL's stdlib using |
Related to #10144 we're curious what the pipelines soak looks like if the remap steps remain but are as cheap as possible. This soak is related to the no_grok variant in that it's basically the flip of that. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
One of the major difficulties underlying #10144 is an inability for Vector to saturate its cores. Exactly why is not totally clear, but we _are_ bottlenecking right at the top in our soak tests by parsing json in VRL rather than using the encoding feature of the source. Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit adjusts pipeline expansion so that they are combined, rather than, well, expanded. This means that the sub-transforms of a pipeline run in serial but that each pipeline as a whole can run multiple copies of itself at once. This also cleans up many low-priority tasks. Resolves #11787 Resolves #11784 REF #10144 Signed-off-by: Luke Steensen <luke.steensen@gmail.com> Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This commit adjusts pipeline expansion so that they are combined, rather than, well, expanded. This means that the sub-transforms of a pipeline run in serial but that each pipeline as a whole can run multiple copies of itself at once. This also cleans up many low-priority tasks. Resolves #11787 Resolves #11784 REF #10144 Signed-off-by: Luke Steensen <luke.steensen@gmail.com> Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
* Combine pipeline stages This commit adjusts pipeline expansion so that they are combined, rather than, well, expanded. This means that the sub-transforms of a pipeline run in serial but that each pipeline as a whole can run multiple copies of itself at once. This also cleans up many low-priority tasks. Resolves #11787 Resolves #11784 REF #10144 Signed-off-by: Luke Steensen <luke.steensen@gmail.com> Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * Tidy up errors, fix bug in vector.toml pipeline config Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * test dings Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * temporariliy disable http_datadog_filter_blackhole Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * PR feedback Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * PR feedback with regard to outputs Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * PR feedback Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * try 64 wide interior buffer Signed-off-by: Brian L. Troutwine <brian@troutwine.us> Co-authored-by: Brian L. Troutwine <brian@troutwine.us>
Closing since this transform was removed. |
This issue is a record of and parent to the investigation of vector's performance when configured as in this gist.
Inbound http_gen -- built from lading at SHA 0da91906d56acc899b829cea971d79f13e712e21-- is configured like so:
Referenced
input.log
can be found here.The text was updated successfully, but these errors were encountered: