-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cds: add a benchmark #14167
cds: add a benchmark #14167
Conversation
Signed-off-by: Phil Genera <pgenera@google.com>
Signed-off-by: Phil Genera <pgenera@google.com>
Signed-off-by: Phil Genera <pgenera@google.com>
clang-tidy failure is:
which is, uh, the point? |
We've added |
/assign @jmarantz |
Signed-off-by: Phil Genera <pgenera@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me though I'm not an expert in the cluster-init stuff so I may have missed something.
Just a few questions which could be resolved by adding comments or small changes.
Envoy::Upstream::CdsSpeedTest speed_test(state, false); | ||
uint32_t clusters = skipExpensiveBenchmarks() ? 1 : state.range(0); | ||
|
||
speed_test.clusterHelper(true, clusters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment a bit on a few choices here:
- Why call
speed_test.clusterHelper(true, clusters);
twice? - Why not construct
CdsSpeedTest
outside the loop? - Related: should we put the
Envoy::Logger::Context
into the class to avoid repeating that code for each test?
I'm not pushing back against those decisions I'm just trying to understand them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- that's the difference between the two test cases; I added a comment (hopefully) explaining.
- I assume the loop here is modifying state in a meaningful way, which is then passed to the newly constructed speed_test. This
auto _ : state
is directly from the example in https://github.com/google/benchmark - y'know, I don't know what the Logger Context was doing here, and removing it has no ill effect...
This was mostly copy & pasted from eds_speed_test (I don't understand github's copy detection and why it didn't notice this); I've backported the changes to eds_speed_test as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE 1 and 3: cool, all set, thanks.
Re 2: the auto _ state : state
pattern is fine, but I think state.range(x) won't change within a call, and it would be safe to construct the CdsSpeedTest object outside the loop. I think that will result in a more focused perf benchmark. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing bears out your hypothesis, I see what I missed there.
Envoy::Upstream::CdsSpeedTest speed_test(state, state.range(0)); | ||
// if we've been instructed to skip tests, only run once no matter the argument: | ||
uint32_t clusters = skipExpensiveBenchmarks() ? 1 : state.range(2); | ||
speed_test.clusterHelper(state.range(1), clusters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this also include the time it took to build the response proto in the benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe so; the calls to state_.PauseTiming()
and ResumeTiming()
in clusterHelper should exclude everything but the grpc callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be clearer I think what we were benchmarking if we put the PauseTiming calls at the start of the loop here (line 162) and Resume at the end, and then put Resume/Pause calls in the helper functions surrounding the code we really want to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is what you mean.
Signed-off-by: Phil Genera <pgenera@google.com>
Signed-off-by: Phil Genera <pgenera@google.com>
I (somewhat unrelatedly) made the floor iteration counts match between cds & eds_speed_test in the last commit; below 100 or so iterations the constant factor dominates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! lgtm; just a couple of nits.
Signed-off-by: Phil Genera <pgenera@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks phil!
Probably it makes sense for some @envoyproxy/senior-maintainers to take a look -- there may be some subtlety in the interaction with the CDS system. Phil -- did you get some interesting results with the test? |
No smoking gun, unfortunately, just things we know: v3 is faster than v2, ignoring unknown fields is faster than validating them. Unlike EDS, there's no obvious n^2 or worse behavior.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! Do you have any perf
profiles and flamegraphs when running some of the longer benchmarks?
/wait
resetCluster(R"EOF( | ||
name: name | ||
connect_timeout: 0.25s | ||
type: EDS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of Cluster
configuration, I think there are ultimately many dimensions to consider, but the two I'd start with are number of clusters, number of endpoints per cluster and the choice of load balancing algorithm (everything from WRR to Maglev etc.). This will provide a pretty interesting microbenchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now this exercises the number of clusters for both the v2 & v3 APIs; I propose taking out the v2 part and adding LB algorithm & endpoints per cluster, all for STATIC
clusters. Does that sound like a good next revision?
In the interim I've separated out the v2 & v3 tests to make the output more readable and give the Complexity Computing Magic a better chance at working.
I haven't generated interesting output, perf or otherwise, as I just got this working again. I'm happy to do so once we know I'm testing the right things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's drop v2 as it's on the way out. Agree on the changes proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think we could also explore LB algorithm and endpoints in the EDS test if it was able to fully simulate what is going on in eds_speed_test.cc
, so maybe we can make CDS test even simpler and exercise some other cluster attributes in the parameter space. For now just dropping v2 sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've dropped v2; maybe adding a larger set of test parameters to this or cds should go in a followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do the additions in followup PRs.
Signed-off-by: Phil Genera <pgenera@google.com>
Here's the current output, with warnings about v2 API deprecation elided:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, waiting for v2 removal.
Signed-off-by: Phil Genera <pgenera@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks. I think the CDS/EDS tests could ultimately be refactoring to make a framework for more general xDS benchmarks, but we can live by Rule of Three in this PR.
@pgenera the ASAN failure looks legit. |
I've managed to get a flame graph (SVG cannot be attached here, so have a gist) out, via the following:
Unfortunately its dominated by making protos for the test, rather than the portion that we're actively measuring. I'll take a look at the asan failure shortly. |
Thanks @pgenera. Almost all the time is in |
I was going to explore filtering the Separately I wonder if these cluster creation helpers should create protos directly instead of parsing YAML. |
Yeah the flame-graph you shared is dominated by the protobuf parsing library which is not consistent with what we see from captures from live servers. Probably the actual protocol marshall/unmarshall is much faster than yaml parsing. |
This would be optimal. I'd just port the |
That sounds good. Would it be sufficient though to just turn off the profile collection during the yaml parsing,. assuming that's possible? |
I ported buildStaticCluster into the benchmark and got a more interesting flame graph out; I haven't dug deep or done any filtering yet however. I agree that we're on the cusp of wanting to build something generic for xDS benchmarking. |
Signed-off-by: Phil Genera <pgenera@google.com>
Signed-off-by: Phil Genera <pgenera@google.com>
Here's the most recent flame graph, now using a more concrete stats implementation; however I still don't see the stats overhead I'd expect. I don't have any leads on the asan failure, other than having reproduced it:
|
Having now found #13973, it builds with |
Quick sanity check: did you use "-c opt" for the build you used to collect the flame graph? |
Can confirm:
$ history | grep -B 4 "perf record"
1920 [2020-12-11 20:06:26 +0000] bazel build -c opt
//test/common/upstream:cds_speed_test
1921 [2020-12-11 20:08:55 +0000] perf record -g
./git/envoy/bazel-bin/test/common/upstream/cds_speed_test --
--benchmark_filter="addClusters/1/100000" --benchmark_repitions=100
1922 [2020-12-11 20:08:57 +0000] cd
1923 [2020-12-11 20:08:59 +0000] perf record -g
./git/envoy/bazel-bin/test/common/upstream/cds_speed_test --
--benchmark_filter="addClusters/1/100000" --benchmark_repitions=100
|
I don't think it's an issue of stats or not, we're not spending any significant time in
-l trace logging and try figure out what is going on in terms of code execution, i.e. is a single cluster being built as expected.
|
I'm also wondering if it might be useful to start swapping out the mocks
for real implementations.
One alternative approach is to use the MainCommon class, which is a
linkable way to set up all the production classes; no mocks.
…On Sat, Dec 12, 2020 at 6:36 PM htuch ***@***.***> wrote:
I don't think it's an issue of stats or not, we're not spending any
significant time in
https://github.com/envoyproxy/envoy/blob/046b0e315e04f645aa0d62119a056bc560497ed7/source/common/upstream/cds_api_impl.cc#L84.
This is what dominates in the workload flamegraphs that we want to model.
I'm suspecting some artifact of the benchmark, e.g. mocks or maybe the
actual config proto, is to blame. I suggest doing a single iteration run
with -l trace logging and try figure out what is going on in terms of
code execution, i.e. is a single cluster being built as expected.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#14167 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAO2IPKTJ24YN36GUUZEAXLSUP5ABANCNFSM4UBLMGAQ>
.
|
Sounds good, I'll start tearing the mocks out. |
/wait |
Commit Message: Add a benchmark test for Cluster Discovery Service.
Additional Description: This copies eds_speed_test and modifies it to exercise CDS instead, with some small changes in benchmark options.
Risk Level: Low - test only change.
Testing: Test runs & passes in both _benchmark_test & bazel run cds_speed_test variants.
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Fixes #14005