-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: hds_integration_test failures with tsan #12184
Comments
Repro instructions:
|
From a very quick look at this test, it seems like there is a race condition where after calling |
Hmm nevermind I see |
FWIW I find this fails 98% of the time if I switch to simulated time, via:
I think this might be because this test uses gRPC which does not (afaik) have injectable time. But maybe this change offers a way to get more consistent failures, which might be easier to repro and debug. |
Yeah I'm having trouble reproing. I will try again tomorrow. |
In looking at the traces, I think the problem has to be: envoy/test/integration/hds_integration_test.cc Lines 298 to 305 in 796a674
Envoy can close the connection while the connection is being popped off of the connection queue. There must be some race condition in the framework here. I can look more tomorrow unless @alyssawilk has any quick ideas. It's possible that recent event loop changes from @antoniovicente have made this race more likely though IIRC the events in question here aren't getting delayed to the next event loop iteration but I don't remember the details. |
(Most relevant tsan trace: https://gist.github.com/jmarantz/edc6a4349344c7620f98b00180e69ed7) |
I think that the Envoy::Event::FileEventImpl::assignEvents in the call stack indicates that we are dealing with real events, not fake events. I think the sequence here is something along the lines of:
There are no synchronization primitives protecting the fake http upstream and its IO handle. TSAN detects a cross thread read/write interaction between steps 2 and 4 above. The cross thread interactions in FakeUpstream are complex. We end up doing a fair number of operations in the main test thread while other operations happen in the backend thread. IIRC mutex use in FakeUpstream is very minimal. |
Yeah agreed this is complicated. From looking at the TSAN trace it actually seems like to fix this we would need to block the fake upstream event loop from running or we have to do more operations on the fake upstream thread. I don't think there is an easy fix here. I will think about it more tomorrow. |
(The best fix would be to switch the timeout tests to simulated time @jmarantz :) ) |
I think most of the tests I saw fail in real time involved direct calls to do timed waits on condvars from grpc, which lacks injectable time, afaict. |
FWIW many tsan failures are just because an assertion fails, and the test doesn't handle things if the unexpected thing occurs. Josh since you can repro, can you dump trace logs of a failure or two? |
@alyssawilk see #12184 (comment) and #12184 (comment). It's a race condition around connection timeout/close. I think I have an idea on how to fix. I will try to do it today. |
Part of #12184 Signed-off-by: Matt Klein <mklein@lyft.com>
CC @drewsortega |
Part of #12184 Signed-off-by: Matt Klein <mklein@lyft.com>
@drewsortega I think you are being volunteered to work on this further. :) I'm able to repro more issues under TSAN that I haven't had a chance to fix yet:
This one is getting stuck in the wait loop in the test. I haven't debugged it.
This test is fundamentally broken. It relies on real time guarantees which are not satisfied in CI and under the sanitizers. We either need to switch to SimTime, drop the test, or test it some other way. I also suspect
cc @htuch |
FWIW, I think we discussed in at least one PR the desire for simulated time, there was some 10x slowdown that prevented this from being used at the time, see #5977. |
Sure, I can work on this one further. :) |
Part of envoyproxy#12184 Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Part of envoyproxy#12184 Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: scheler <santosh.cheler@appdynamics.com>
Drew tries many times, but was not able to reproduce the flakiness even if run 1000 times with tsan. There was a commit on 8/13: testing: fix multiple race conditions in simulated time tests (#12527) which improved the code of fake_upstream and probably has fixed the flakiness. |
FWIW I've seen failures on main branch since the above fix, on ARM release, which tends to run very fast. I will take a look and see if I can repro. |
Part of envoyproxy#12184 Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Antonio Vicente <avd@google.com>
* hds: fix integration test flakes (#12214) Part of #12184 Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Antonio Vicente <avd@google.com> * Switch to a tsan-instrumented libc++ for tsan tests (#12134) This fixes #9784 and re-enables vhds_integration_test Risk Level: Low, but will most likely increase memory usage Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com> Signed-off-by: Antonio Vicente <avd@google.com> * test: shard hds_integration_test (#12482) This should avoid TSAN timeout flakes. Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Antonio Vicente <avd@google.com> * test: shard http2_integration_test (#11939) This should mitigate TSAN timeout. Signed-off-by: Lizan Zhou <lizan@tetrate.io> Signed-off-by: Antonio Vicente <avd@google.com> * test: fix http2_integration_test flake (#12450) Fixes #12442 Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Antonio Vicente <avd@google.com> * Kick CI Signed-off-by: Antonio Vicente <avd@google.com> Co-authored-by: Matt Klein <mklein@lyft.com> Co-authored-by: Dmitri Dolguikh <ddolguik@redhat.com> Co-authored-by: Lizan Zhou <lizan@tetrate.io>
* docs: kick-off 1.15.1 release. (envoyproxy#12166) Signed-off-by: Piotr Sikora <piotrsikora@google.com> * tls: update BoringSSL-FIPS to 20190808. (envoyproxy#12170) Signed-off-by: Piotr Sikora <piotrsikora@google.com> * test: Exclude wasm_vm_test from CI by making it a "manual" test. (#207) The wee v8 build times out in CI under --config=asan because the machine the job is scheduled on is too small. Signed-off-by: Antonio Vicente <avd@google.com> * [v1.15] http: header map security fixes for duplicate headers (#197) (#200) Previously header matching did not match on all headers for non-inline headers. This patch changes the default behavior to always logically match on all headers. Multiple individual headers will be logically concatenated with ',' similar to what is done with inline headers. This makes the behavior effectively consistent. This behavior can be temporary reverted by setting the runtime value "envoy.reloadable_features.header_match_on_all_headers" to "false". Targeted fixes have been additionally performed on the following extensions which make them consider all duplicate headers by default as a comma concatenated list: 1) Any extension using CEL matching on headers. 2) The header to metadata filter. 3) The JWT filter. 4) The Lua filter. Like primary header matching used in routing, RBAC, etc. this behavior can be disabled by setting the runtime value "envoy.reloadable_features.header_match_on_all_headers" to false. Finally, the setCopy() header map API previously only set the first header in the case of duplicate non-inline headers. setCopy() now behaves similiarly to the other set*() APIs and replaces all found headers with a single value. This may have had security implications in the extauth filter which uses this API. This behavior can be disabled by setting the runtime value "envoy.reloadable_features.http_set_copy_replace_all_headers" to false. Fixes https://github.com/envoyproxy/envoy-setec/issues/188 Signed-off-by: Matt Klein <mklein@lyft.com> * backport to v1.15: Fix Kafka Repository Location (#223) Update mirror used to fetch kafka dependency to a valid, working mirror. Based on envoyproxy#13025 Resolves envoyproxy#13011 Signed-off-by: Antonio Vicente <avd@google.com> * release: cutting 1.15.1 (#217) Signed-off-by: Antonio Vicente <avd@google.com> * docs: Fix release notes for v1.15.1 release. (envoyproxy#13318) Signed-off-by: Antonio Vicente <avd@google.com> * Backport flaky test and tsan fixes to releases/v1.15 branch (envoyproxy#13337) * hds: fix integration test flakes (envoyproxy#12214) Part of envoyproxy#12184 Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Antonio Vicente <avd@google.com> * Switch to a tsan-instrumented libc++ for tsan tests (envoyproxy#12134) This fixes envoyproxy#9784 and re-enables vhds_integration_test Risk Level: Low, but will most likely increase memory usage Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com> Signed-off-by: Antonio Vicente <avd@google.com> * test: shard hds_integration_test (envoyproxy#12482) This should avoid TSAN timeout flakes. Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Antonio Vicente <avd@google.com> * test: shard http2_integration_test (envoyproxy#11939) This should mitigate TSAN timeout. Signed-off-by: Lizan Zhou <lizan@tetrate.io> Signed-off-by: Antonio Vicente <avd@google.com> * test: fix http2_integration_test flake (envoyproxy#12450) Fixes envoyproxy#12442 Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Antonio Vicente <avd@google.com> * Kick CI Signed-off-by: Antonio Vicente <avd@google.com> Co-authored-by: Matt Klein <mklein@lyft.com> Co-authored-by: Dmitri Dolguikh <ddolguik@redhat.com> Co-authored-by: Lizan Zhou <lizan@tetrate.io> * docs: kick off v1.15.3-dev (envoyproxy#13695) Signed-off-by: Christoph Pakulski <christoph@tetrate.io> * 1.15: CI fixes backport (envoyproxy#13697) Backport following commits to 1.15: 748b2ab (mac ci: try ignoring update failure (envoyproxy#13658), 2020-10-20) f95f539 (ci: various improvements (envoyproxy#13660), 2020-10-20) 73d78f8 (ci: use multiple stage (envoyproxy#13557), 2020-10-15) b7a4756 (ci: use azp for api and go-control-plane sync (envoyproxy#13550), 2020-10-14) 876a6bb (ci use azp to sync filter example (envoyproxy#13501), 2020-10-12) a0f31ee (ci: use azp to generate docs (envoyproxy#13481), 2020-10-12) Signed-off-by: Lizan Zhou <lizan@tetrate.io> Co-authored-by: asraa <asraa@google.com> * 1.15: fix CI script (envoyproxy#13724) Signed-off-by: Lizan Zhou <lizan@tetrate.io> * Prevent SEGFAULT when disabling listener (envoyproxy#13515) (envoyproxy#13903) This prevents the stop_listening overload action from causing segmentation faults that can occur if the action is enabled after the listener has already shut down. Signed-off-by: Alex Konradi <akonradi@google.com> Signed-off-by: Christoph Pakulski <christoph@tetrate.io> * proxy protocol: set downstreamRemoteAddress on StreamInfo (envoyproxy#14131) (envoyproxy#14169) This fixes a regression which resulted in the downstreamRemoteAddress on the StreamInfo for a connection not having the address supplied by the proxy protocol filter, but instead having the address of the directly connected peer. This issue does not affect HTTP filters. Fixes envoyproxy#14087 Signed-off-by: Greg Greenway <ggreenway@apple.com> Signed-off-by: Christoph Pakulski <christoph@tetrate.io> * ci: temproray disable vhds_integration_test in TSAN (envoyproxy#12067) (envoyproxy#14217) Signed-off-by: Lizan Zhou <lizan@tetrate.io> * tcmalloc changed and the data coming out of tcmalloc::MallocExtension::GetNumericProperty("generic.current_allocated_bytes") (envoyproxy#14165) Commit Message: tcmalloc changed and the data coming out of tcmalloc::MallocExtension::GetNumericProperty("generic.current_allocated_bytes") no longer appears to be deterministic, even in unthreaded tests. So disable exact mem checks till we sort that out Additional Description: Risk Level: low Testing: just thread_local_store_test Docs Changes: n/a Release Notes: n/a no longer appears to be deterministic, even in unthreaded tests. So disable exact mem checks till we sort that out Signed-off-by: Joshua Marantz <jmarantz@google.com> Signed-off-by: Christoph Pakulski <christoph@tetrate.io> Co-authored-by: Joshua Marantz <jmarantz@google.com> * backport to v1.15: connection: Remember transport socket read resumption requests and replay them when re-enabling read. (envoyproxy#13772) (envoyproxy#14173) * connection: Remember transport socket read resumption requests and replay them when re-enabling read. (envoyproxy#13772) Fixes SslSocket read resumption after readDisable when processing the SSL record that contains the last bytes of the HTTP message Signed-off-by: Antonio Vicente <avd@google.com> * backport to 1.15: udp: properly handle truncated/dropped datagrams (envoyproxy#14122) (envoyproxy#14166) Signed-off-by: Matt Klein <mklein@lyft.com> Signed-off-by: Christoph Pakulski <christoph@tetrate.io> Co-authored-by: Matt Klein <mklein@lyft.com> Co-authored-by: Christoph Pakulski <christoph@tetrate.io> * backport to 1.15: vrp: allow supervisord to open its log file (envoyproxy#14066) (envoyproxy#14280) Commit Message: Allow supervisord to open its log file Additional Description: Change the default location of the log file and give supervisord permissions to write to it. Risk Level: low Testing: built image locally Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a Signed-off-by: Alex Konradi <akonradi@google.com> Signed-off-by: Christoph Pakulski <christoph@tetrate.io> * rel 1.15: close release 1.15.3 (envoyproxy#14303) Signed-off-by: Christoph Pakulski <christoph@tetrate.io> * Kick off rel 1.15.4. (envoyproxy#14323) Signed-off-by: Christoph Pakulski <christoph@tetrate.io> * backport to 1.15: http: fix datadog and squash handling of responses without body (envoyproxy#13328) (envoyproxy#14458) Commit Message: Fixing bugs in datadog and sqaush where unexpected bodyless responses would crash Envoy Risk Level: low Testing: new unit tests, updated certs Docs Changes: n/a Release Notes: inline Signed-off-by: Christoph Pakulski <christoph@tetrate.io> Co-authored-by: alyssawilk <alyssar@chromium.org> * backport 1.15: http: fixing a bug with IPv6 hosts (envoyproxy#14273) Fixing a bug where HTTP parser offsets for IPv6 hosts did not include [] and Envoy assumed it did. This results in mis-parsing addresses for IPv6 CONNECT requests and IPv6 hosts in fully URLs over HTTP/1.1 Risk Level: low Testing: new unit, integration tests Docs Changes: n/a Release Notes: inline Signed-off-by: Shikugawa <rei@tetrate.io> Co-authored-by: alyssawilk <alyssar@chromium.org> * backport to 1.15: tls: fix detection of the upstream connection close event. (envoyproxy#13858) (envoyproxy#14568) Fixes envoyproxy#13856. Signed-off-by: Piotr Sikora <piotrsikora@google.com> Signed-off-by: Christoph Pakulski <christoph@tetrate.io> Co-authored-by: Piotr Sikora <piotrsikora@google.com> Co-authored-by: antonio <avd@google.com> Co-authored-by: Matt Klein <mklein@lyft.com> Co-authored-by: Dmitri Dolguikh <ddolguik@redhat.com> Co-authored-by: Lizan Zhou <lizan@tetrate.io> Co-authored-by: Christoph Pakulski <christoph@tetrate.io> Co-authored-by: asraa <asraa@google.com> Co-authored-by: Joshua Marantz <jmarantz@google.com> Co-authored-by: Rei Shimizu <Shikugawa@gmail.com> Co-authored-by: alyssawilk <alyssar@chromium.org>
On master as of today, about 20% failure rate. They may have a common root cause but there are different classes of symptoms. E.g. here are 4 different ones: a short assert-failure and some very lengthy tsan reports that are in gist.
tsan failures:
The text was updated successfully, but these errors were encountered: