Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Active healthchecks, TLS, and DNS service discovery on a Virtual Node can delay Envoy initialization #227

Open
dastbe opened this issue Jul 11, 2020 · 3 comments
Assignees
Labels
Blocked on an Envoy fix Waiting for more information or for a dependency Bug Something isn't working Envoy Docker Image

Comments

@dastbe
Copy link
Contributor

dastbe commented Jul 11, 2020

Summary

When a source Virtual Node routes to a destination Virtual Node that has both TLS and active healthchecks configured, Envoy initialization will be delayed by upwards of 60 seconds.

Steps to Reproduce

  1. Configure a Virtual Node with TLS via ACM and active healthchecks
  2. Make this Virtual Node (indirectly) the provider for a Virtual Service
  3. Have another Virtual Node depend on the above Virtual Service
  4. Launch a new application in the above Virtual Node
  5. Observe an at least 60 second initialization period for the Envoy

Are you currently working around this issue?

Disabling one of healthchecks or TLS, or switching to Cloud Map based Service Discovery

Additional context

This occurs due to a race condition in Envoy in at least the 1.12.x series. Envoy initiates the first healthcheck on a cluster before the ACM-backed secret is retrieved, resulting in a health check connection failure. Specifically for DNS backed clusters, Envoy does not consider this a "round" of healthchecking and so waits for another round of healthchecks to occur. The same scenario occurs for Cloud Map based Service Discovery, but Envoy does consider this a round of healthchecks and so continues initialization.

Because there is no traffic on the cluster Envoy leverages a no_traffic_interval instead of the healthcheck interval, which by default is 60 seconds. After this interval, Envoy initiates another round of healthchecks which it then considers sufficient for continuing initialization.

@dastbe dastbe added the Bug Something isn't working label Jul 11, 2020
rajal-amzn added a commit to aws/aws-app-mesh-examples that referenced this issue Jul 13, 2020
* Update Ingress gateway walkthrough to use appmesh prod
* Setting Healthcheck StartPeriod as 60s due to the Envoy initialization taking time as mentioned [here](aws/aws-app-mesh-roadmap#227)
@LancerRainier LancerRainier added Blocked on an Envoy fix Waiting for more information or for a dependency Priority: High labels Jul 15, 2020
@dastbe
Copy link
Contributor Author

dastbe commented Aug 5, 2020

WIP issue against Envoy: envoyproxy/envoy#12389

We're gathering some more information to help root cause the issue in Envoy.

@lavignes
Copy link

This should be fixed in Envoy 1.17

@herrhound herrhound removed the Blocked on an Envoy fix Waiting for more information or for a dependency label Apr 29, 2021
@Y0Username
Copy link
Contributor

Y0Username commented Aug 4, 2021

Correction: The fix wasn't actually in Envoy 1.17
Now tracking in envoyproxy/envoy#17529

@rajal-amzn rajal-amzn assigned rajal-amzn and unassigned dastbe Sep 30, 2021
@shsahu shsahu added Blocked on an Envoy fix Waiting for more information or for a dependency and removed Priority: High labels Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocked on an Envoy fix Waiting for more information or for a dependency Bug Something isn't working Envoy Docker Image
Projects
None yet
Development

No branches or pull requests

8 participants