Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: fix multiple race conditions in simulated time tests #12527

Merged
merged 14 commits into from
Aug 14, 2020

Conversation

mattklein123
Copy link
Member

@mattklein123 mattklein123 commented Aug 7, 2020

This PR fixes multiple race conditions in tests. The summary is:

  1. All waitFor() operations are now fully synchronized.
  2. waitFor() no longer moves simulated time and performs real sleeps in
    all time systems. This means that all network operations are now
    "instantaneous" and makes all time advances for alarms explicit. This
    required fixes in a few tests but should make simulated time much easier
    to reason about.
  3. All timeout durations for network operations use real time for timeouts.

Fixes #12480
Fixes #10568

Risk Level: None for prod code, high for tests
Testing: Existing and fixed tests
Docs Changes: N/A
Release Notes: N/A

@mattklein123
Copy link
Member Author

mattklein123 commented Aug 7, 2020

cc @jmarantz @wrowe @sunjayBhatia

This is not done and I'm still working through various issues but I wanted to let you see my current progress. I think the idea here is sound, however see my comment around waitFor() being inherently racy. I think we can fix this and cleanup a lot of callers by switching this to take a Mutex and a Condition and then using Await. I will work on this more later or tomorrow.

Ref: https://github.com/envoyproxy/envoy/pull/12527/files#diff-f2c85459672519c620a47b880b9b0d20R317-R319

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cross-referencing #12539 which tries to switch integration tests to using await() rather than condvar, for more robust operation.

@mattklein123 mattklein123 force-pushed the fix_sim_time branch 4 times, most recently from b8e85f7 to 3b90839 Compare August 8, 2020 00:15
@mattklein123
Copy link
Member Author

@jmarantz I'm about to quit for today but this passes except for 2 tests which don't compile which should not be difficult to fix. From a test perspective this is a pretty scary change, but overall I think this makes everything much simpler to reason about and cleans up a bunch of stuff. Feel free to start reviewing and helping me to fix things.

@mattklein123
Copy link
Member Author

Running TSAN and seeing some errors. None of them look too bad to fix so will work on that next.

@mattklein123
Copy link
Member Author

No good deed goes unpunished. The TSAN issues are internal to abseil. I think they were recently fixed with:

9fc78436565eb3b204d4aa425ee3773354392f45 by Derek Mauro <dmauro@google.com>:

Use auto-detected sanitizer attributes for ASAN, MSAN, and TSAN builds

But now when I pull current abseil there are TSAN errors without any other changes: see abseil/abseil-cpp#760

@mattklein123
Copy link
Member Author

@jmarantz this passes all tests for me locally now under fastbuild and tsan. There are some flakes that I have hit. It's unclear if they are new or if they are pre-existing and exacerbated by the alternate TSAN lock implementation we are now using. It will be better to merge the other PR with the abseil bump first and see how that goes.

Fixes #12480
Fixes #10568

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123 mattklein123 marked this pull request as ready for review August 13, 2020 19:36
@mattklein123 mattklein123 changed the title [WIP] testing: fix multiple race conditions in simulated time tests testing: fix multiple race conditions in simulated time tests Aug 13, 2020
@mattklein123
Copy link
Member Author

@jmarantz this is passing all tests on fastbuild and I think is ready for real review. I'm going to start looking for flakes.

Signed-off-by: Matt Klein <mklein@lyft.com>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing comments; mostly nits

test/common/router/header_parser_fuzz_test.cc Outdated Show resolved Hide resolved
test/common/formatter/substitution_formatter_fuzz_test.cc Outdated Show resolved Hide resolved
test/integration/fake_upstream.cc Outdated Show resolved Hide resolved
test/integration/fake_upstream.cc Outdated Show resolved Hide resolved
test/integration/fake_upstream.cc Outdated Show resolved Hide resolved
jmarantz
jmarantz previously approved these changes Aug 13, 2020
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great; just a few nits, mostly about clarity and comments.

test/integration/fake_upstream.h Outdated Show resolved Hide resolved
test/integration/fake_upstream.h Outdated Show resolved Hide resolved
test/test_common/simulated_time_system_test.cc Outdated Show resolved Hide resolved
Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member Author

@jmarantz updated. Great suggestion about the time bounds class. Much cleaner!

@mattklein123
Copy link
Member Author

ARM flake is a known different issue #12638

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you for finally cleaning this up!

Up to you if you want to apply the syntactic tweaks or just leave that for next time.

auto thread = Thread::threadFactoryForTest().createThread([this, &mutex, &done]() {
for (;;) {
{
absl::MutexLock lock(&mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taste test, as this syntax works now (for you golang fans):

    for (;;) {
      if (absl::MutexLock lock(&mutex); done) {
        return;
      }
      base_scheduler_.run(Dispatcher::RunType::Block);
    }

Looking at this code is the first time it occurred to me to use it in C++.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah that's good. I will fix that in a follow up. I want to get this merged so we can see how we are doing with flakes.

auto thread = Thread::threadFactoryForTest().createThread([this, &mutex, &done]() {
for (;;) {
{
absl::MutexLock lock(&mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

golang syntax here if you like.

public:
template <class D>
RealTimeBound(const D& duration)
: end_time_(std::chrono::steady_clock::now() + duration) // NO_CHECK_FORMAT(real_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm guessing you were swayed to use this style by the convenience of not bothering to pass timeSystem() into the ctor here.

Regardless; this turned out really well. Thanks!

@mattklein123 mattklein123 merged commit a42a677 into master Aug 14, 2020
@mattklein123 mattklein123 deleted the fix_sim_time branch August 14, 2020 02:45
mpuncel added a commit to mpuncel/envoy that referenced this pull request Aug 14, 2020
* master: (67 commits)
  logger: support log control in admin interface and command line option for Fancy Logger (envoyproxy#12369)
  test: fix http_timeout_integration_test flake (envoyproxy#12654)
  [fuzz]added an input check in writefilter fuzzer and added test cases (envoyproxy#12628)
  add 'explicit' restriction. (envoyproxy#12643)
  scoped_rds_integration_test migrate from api v2 to api v3. (envoyproxy#12633)
  fuzz: added fuzz test for listener filter tls_inspector (envoyproxy#12617)
  testing: fix multiple race conditions in simulated time tests (envoyproxy#12527)
  [tls] Move handshaking behavior into SslSocketInfo. (envoyproxy#12571)
  header: getting rid of exception-throwing behaviors in header files [the rest] (envoyproxy#12611)
  router: add new ratelimited retry backoff strategy (envoyproxy#12202)
  [redis_proxy] added a constraint for route.prefix().size() (envoyproxy#12637)
  network: add tcp listener backlog config (envoyproxy#12625)
  runtime: debug log that condition is always true when fractionalPercent numerator > denominator (envoyproxy#12068)
  WatchDog Extension hook (envoyproxy#12416)
  router: add dynamic metadata header formatter (envoyproxy#11858)
  statsd: revert visibility to public (envoyproxy#12621)
  Fix regression of /build_* in gitignore (envoyproxy#12630)
  Added a missing extension point to documentation. (envoyproxy#12620)
  Reverts proxy protocol test on windows (envoyproxy#12619)
  caching: Improved the tests and coverage of the CacheFilter tree (envoyproxy#12544)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants