Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health_checker: Make health check loop wait for any required SDS secrets to be loaded before starting. #16236

Closed
wants to merge 7 commits into from

Conversation

mpuncel
Copy link
Contributor

@mpuncel mpuncel commented Apr 29, 2021

Commit Message: Make health check loop wait for any required SDS secrets to be loaded before starting.
Additional Description: This should avoid an issue where Envoy might take over a minute to warm clusters with default settings when SDS and active health checking are used. The issue was caused by health checks starting before secrets are ready and then waiting for no_traffic_interval (default 60s) before trying again. This change makes health checks wait for SDS secrets to be ready before starting.
Risk Level: Medium
Testing: Concurrency test covers the bug that caused this to be reverted initially, and was tested locally with --runs_per_test 100, confirming test failures when the bug is present and none when it was fixed.
Docs Changes: None
Release Notes: Included
Platform Specific Features: None
Fixes #12389, #15977

mpuncel added 2 commits April 29, 2021 11:48
This commit is a revival of
envoyproxy#13516 which aimed to solve
envoyproxy#12389. That PR was merged and
then reverted due to a deadlock bug, which is still present in this
commit.

The fix will be applied in a subsequent commit in order to make the
resolution clearer to PR reviewers.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
The bug was caused by not queuing callbacks in addReadyCb() while
holding the lock. This means that you could make the decision to NOT run
the callback, get a secret concurrently which would run all callbacks,
and then add it to the queue at which point it would never run.

This also changed the name of addReadyCb() to addHealthCheckingReadyCb()
on HostImpl, and made the logic add a callback for the health checking
transport socket. That's the socket that we actually want health checks
to wait on which may be different from the data plane socket.

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel
Copy link
Contributor Author

mpuncel commented Apr 29, 2021

For reviewers: the first commit is basically just a revert-of-a-revert of #13516 and then rebased onto main. It still contains the bug that caused it to be reverted.

The second commit has the fix, and then the ClientAddSecretsReadyCallbackParallel and ServerAddSecretsReadyCallbackParallel tests I added were able to reproduce the problem a small percentage of the time.

Run on my local machine demonstrating the bug:


vscode@3781f2aa0124:/workspaces/envoy$ bazel test //test/extensions/transport_sockets/tls:ssl_socket_test  --test_env=ENVOY_IP_TEST_VERSIONS=v4only  --test_filter=IpVersions/SslSocketTest.ServerAddSecretsReadyCallbackParallel/IPv4 --cache_test_results=no --runs_per_test=100
INFO: Build options --cache_test_results, --runs_per_test, and --test_filter have changed, discarding analysis cache.
DEBUG: /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/external/bazel_toolchains/rules/rbe_repo.bzl:491:10: Bazel 3.7.2 is used in rbe_ubuntu_clang.
INFO: Analyzed target //test/extensions/transport_sockets/tls:ssl_socket_test (0 packages loaded, 15851 targets configured).
INFO: Found 1 test target...
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 96 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_96_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 93 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_93_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 86 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_86_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 83 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_83_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 81 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_81_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 75 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_75_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 61 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_61_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 47 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_47_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 45 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_45_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 41 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_41_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 36 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_36_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 35 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_35_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 30 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_30_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 14 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_14_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 5 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_5_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 4 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_4_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 2 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_2_of_100/test.log)

FAILED: //test/extensions/transport_sockets/tls:ssl_socket_test (Summary)
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_96_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_93_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_86_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_83_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_81_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_75_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_61_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_47_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_45_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_41_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_36_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_35_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_30_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_14_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_5_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_4_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_2_of_100/test.log
Target //test/extensions/transport_sockets/tls:ssl_socket_test up-to-date:
  bazel-bin/test/extensions/transport_sockets/tls/ssl_socket_test
INFO: Elapsed time: 69.893s, Critical Path: 46.65s
INFO: 403 processes: 1 internal, 402 processwrapper-sandbox.
INFO: Build completed, 1 test FAILED, 403 total actions
//test/extensions/transport_sockets/tls:ssl_socket_test                  FAILED in 17 out of 400 in 0.9s
  Stats over 400 runs: max = 0.9s, min = 0.3s, avg = 0.4s, dev = 0.1s
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_96_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_93_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_86_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_83_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_81_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_75_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_61_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_47_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_45_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_41_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_36_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_35_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_30_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_14_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_5_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_4_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_2_of_100/test.log

Executed 1 out of 1 test: 1 fails locally.
INFO: Build completed, 1 test FAILED, 403 total actions


vscode@3781f2aa0124:/workspaces/envoy$ bazel test //test/extensions/transport_sockets/tls:ssl_socket_test  --test_env=ENVOY_IP_TEST_VERSIONS=v4only  --test_filter=IpVersions/SslSocketTest.ClientAddSecretsReadyCallbackParallel/IPv4 --cache_test_results=no --runs_per_test=100
INFO: Build option --test_filter has changed, discarding analysis cache.
DEBUG: /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/external/bazel_toolchains/rules/rbe_repo.bzl:491:10: Bazel 3.7.2 is used in rbe_ubuntu_clang.
INFO: Analyzed target //test/extensions/transport_sockets/tls:ssl_socket_test (0 packages loaded, 15850 targets configured).
INFO: Found 1 test target...
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 67 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_67_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 60 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_60_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 41 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_41_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 27 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_27_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 16 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_16_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 6 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_6_of_100/test.log)
FAIL: //test/extensions/transport_sockets/tls:ssl_socket_test (shard 1 of 4, run 2 of 100) (see /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_2_of_100/test.log)

FAILED: //test/extensions/transport_sockets/tls:ssl_socket_test (Summary)
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_67_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_60_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_41_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_27_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_16_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_6_of_100/test.log
      /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_2_of_100/test.log
Target //test/extensions/transport_sockets/tls:ssl_socket_test up-to-date:
  bazel-bin/test/extensions/transport_sockets/tls/ssl_socket_test
INFO: Elapsed time: 24.254s, Critical Path: 1.12s
INFO: 401 processes: 1 internal, 400 processwrapper-sandbox.
INFO: Build completed, 1 test FAILED, 401 total actions
//test/extensions/transport_sockets/tls:ssl_socket_test                  FAILED in 7 out of 400 in 0.9s
  Stats over 400 runs: max = 0.9s, min = 0.3s, avg = 0.4s, dev = 0.1s
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_67_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_60_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_41_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_27_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_16_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_6_of_100/test.log
  /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/execroot/envoy/bazel-out/k8-fastbuild/testlogs/test/extensions/transport_sockets/tls/ssl_socket_test/shard_1_of_4_run_2_of_100/test.log

Executed 1 out of 1 test: 1 fails locally.
INFO: Build completed, 1 test FAILED, 401 total actions

And after it was fixed:

vscode@3781f2aa0124:/workspaces/envoy$ bazel test //test/extensions/transport_sockets/tls:ssl_socket_test  --test_env=ENVOY_IP_TEST_VERSIONS=v4only  --test_filter=IpVersions/SslSocketTest.ClientAddSecretsReadyCallbackParallel/IPv4 --cache_test_results=no --runs_per_test=100
DEBUG: /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/external/bazel_toolchains/rules/rbe_repo.bzl:491:10: Bazel 3.7.2 is used in rbe_ubuntu_clang.
INFO: Analyzed target //test/extensions/transport_sockets/tls:ssl_socket_test (0 packages loaded, 0 targets configured).
INFO: Found 1 test target...
Target //test/extensions/transport_sockets/tls:ssl_socket_test up-to-date:
  bazel-bin/test/extensions/transport_sockets/tls/ssl_socket_test
INFO: Elapsed time: 41.045s, Critical Path: 15.09s
INFO: 403 processes: 1 internal, 402 processwrapper-sandbox.
INFO: Build completed successfully, 403 total actions
//test/extensions/transport_sockets/tls:ssl_socket_test                  PASSED in 0.8s
  Stats over 400 runs: max = 0.8s, min = 0.3s, avg = 0.4s, dev = 0.1s

Executed 1 out of 1 test: 1 test passes.
INFO: Build completed successfully, 403 total actions
vscode@3781f2aa0124:/workspaces/envoy$ bazel test //test/extensions/transport_sockets/tls:ssl_socket_test  --test_env=ENVOY_IP_TEST_VERSIONS=v4only  --test_filter=IpVersions/SslSocketTest.ServerAddSecretsReadyCallbackParallel/IPv4 --cache_test_results=no --runs_per_test=100
INFO: Build option --test_filter has changed, discarding analysis cache.
DEBUG: /build/.cache/bazel/_bazel_vscode/2d35de14639eaad1ac7060a4dd7e3351/external/bazel_toolchains/rules/rbe_repo.bzl:491:10: Bazel 3.7.2 is used in rbe_ubuntu_clang.
INFO: Analyzed target //test/extensions/transport_sockets/tls:ssl_socket_test (0 packages loaded, 15850 targets configured).
INFO: Found 1 test target...
Target //test/extensions/transport_sockets/tls:ssl_socket_test up-to-date:
  bazel-bin/test/extensions/transport_sockets/tls/ssl_socket_test
INFO: Elapsed time: 21.857s, Critical Path: 0.68s
INFO: 401 processes: 1 internal, 400 processwrapper-sandbox.
INFO: Build completed successfully, 401 total actions
//test/extensions/transport_sockets/tls:ssl_socket_test                  PASSED in 0.5s
  Stats over 400 runs: max = 0.5s, min = 0.3s, avg = 0.3s, dev = 0.0s

Executed 1 out of 1 test: 1 test passes.
INFO: Build completed successfully, 401 total actions

// catch a race condition introduced in
// https://github.com/envoyproxy/envoy/pull/13516 where a callback would never
// be called due to a concurrency bug.
TEST_P(SslSocketTest, ServerAddSecretsReadyCallbackParallel) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the next test are probably a bit odd since they start a thread to try to tickle a concurrency bug that has been fixed now, and only happened in ~5% of runs on my machine. This means if there is a regression for some reason we might not see it in the PR that introduces it.

Let me know if there is a better way of handling tests like this

mpuncel added 2 commits April 29, 2021 12:16
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
// 1 microsecond of sleep seems to be a sweet spot for giving the addReadyCb
// thread enough of a head start to reliably cause the test to fail with buggy
// code.
std::this_thread::sleep_for(std::chrono::microseconds(1));
Copy link
Contributor Author

@mpuncel mpuncel Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm failing the style check because of this, I'm looking into whether the issue can be reproduced with ThreadSynchronizer instead (thanks Greg for suggesting that on Slack).

It's a bit weird because with the buggy code I certainly can reproduce it 100% of the time with that, but then when it is fixed there is nowhere I can put the syncPoint() that would let me deadlock it (because the syncpoint would need to be put in a place where the mutex is held).

All of this makes me wonder if this test isn't worth checking in at all: maybe the value was just in demonstrating the bug and that it was fixed, but it's not so useful as a regression test since it's testing a very specific concurrency bug?

mpuncel added 2 commits April 30, 2021 11:08
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@mpuncel mpuncel requested a review from cpakulski as a code owner April 30, 2021 17:48
Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@lizan
Copy link
Member

lizan commented May 4, 2021

/assign-from @envoyproxy/first-pass-reviewers

@repokitteh-read-only
Copy link

@envoyproxy/first-pass-reviewers assignee is @antoniovicente

🐱

Caused by: a #16236 (comment) was created by @lizan.

see: more, trace.

@antoniovicente antoniovicente changed the title Mpuncel/sds hc sequence healthcheck: Make health check loop wait for any required SDS secrets to be loaded before starting. May 6, 2021
@antoniovicente antoniovicente changed the title healthcheck: Make health check loop wait for any required SDS secrets to be loaded before starting. health_checker: Make health check loop wait for any required SDS secrets to be loaded before starting. May 6, 2021
Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes to SDS handling and health checks.

I'ld like to better understand threading involved in SDS updates vs health checkers. Please see comments below.

if (ssl_ctx_) {
immediately_run_callback = true;
} else {
secrets_ready_callbacks_.push_back(callback);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What mutex guards secrets_ready_callbacks_ ?

You only have a reader lock on ssl_ctx_mu_ here, I think you need a write lock.

@@ -151,6 +156,7 @@ class ServerSslSocketFactory : public Network::TransportSocketFactory,
const std::vector<std::string> server_names_;
mutable absl::Mutex ssl_ctx_mu_;
Envoy::Ssl::ServerContextSharedPtr ssl_ctx_ ABSL_GUARDED_BY(ssl_ctx_mu_);
std::list<std::function<void()>> secrets_ready_callbacks_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add ABSL_GUARDED_BY

std::list<std::function<void()>> secrets_ready_callbacks_ ABSL_GUARDED_BY(ssl_ctx_mu_);

if (ssl_ctx_) {
immediately_run_callback = true;
} else {
secrets_ready_callbacks_.push_back(callback);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no mechanism to cancel pending callbacks on ActiveHealthCheckSession deletion. Is it possible for it to take a long time for secrets to become available, and for the ActiveHealthCheckSession to be deleted before this callback is executed?

@@ -121,6 +121,9 @@ class TsiSocketFactory : public Network::TransportSocketFactory {
Network::TransportSocketPtr
createTransportSocket(Network::TransportSocketOptionsSharedPtr options) const override;

// TODO(mpuncel) only invoke callback() once secrets are ready.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC TSI socket does not use SDS to propagate credentials. I think that the current implementation is good enough.

@@ -82,6 +82,7 @@ class ServerStartTlsSocketFactory : public Network::TransportSocketFactory,
createTransportSocket(Network::TransportSocketOptionsSharedPtr options) const override;
bool implementsSecureTransport() const override { return false; }
bool usesProxyProtocolOptions() const override { return false; }
void addReadyCb(std::function<void()> callback) override { callback(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TODO(mpuncel) only invoke callback() once secrets are ready. may be appropriate in this case.

@@ -50,6 +50,7 @@ class UpstreamProxyProtocolSocketFactory : public Network::TransportSocketFactor
createTransportSocket(Network::TransportSocketOptionsSharedPtr options) const override;
bool implementsSecureTransport() const override;
bool usesProxyProtocolOptions() const override { return true; }
void addReadyCb(std::function<void()> callback) override { callback(); };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TODO(mpuncel) only invoke callback() once secrets are ready. may be appropriate in this case.

}
if (should_run_callbacks) {
for (const auto& cb : secrets_ready_callbacks_) {
cb();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On what thread to these callbacks run?

I don't know what the threading model looks like for health checker vs SDS. I think there's potential for callbacks being invoked from the wrong thread.

}
}
if (should_run_callbacks) {
for (const auto& cb : secrets_ready_callbacks_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what threads are callbacks added to this list? Do adds happen from outside the thread where onAddOrUpdateSecret() is called?

@mpuncel
Copy link
Contributor Author

mpuncel commented May 10, 2021

thanks for the feedback, I'll hopefully be able to address it in a couple days

@github-actions
Copy link

github-actions bot commented Jun 9, 2021

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 9, 2021
@github-actions
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stalebot believes this issue/PR has not been touched recently waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should HC activation be delayed until needed secrets are available?
3 participants