Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow customizing initial_fetch_timeout in the envoy sidecar for Consul Service Mesh #17283

Closed
komapa opened this issue May 9, 2023 · 12 comments
Assignees

Comments

@komapa
Copy link
Contributor

komapa commented May 9, 2023

Please see istio/istio#31825 and also you can see AWS is doing the "right" thing and defaulting it to 0 with the option to modify it in the rare case that a different behavior is desired: https://docs.aws.amazon.com/app-mesh/latest/userguide/envoy-config.html

Feature Description

We are running into a pretty unpleasant problem where Envoy sidecar reaches the default 15s initial_fetch_timeout and then continues with starting up and responding with LIVE to the /ready endpoint while it has NOT loaded all upstreams for all clusters from Consul.

We believe Consul should default initial_fetch_timeout to 0 because starting the Envoy proxy sidecar with incorrect configuration is much worse than not starting at all (which we can handle much easier)

Use Case(s)

Not having broken service mesh :)

@luckymike
Copy link

To provide a little more color on why this is important: when envoy starts in this state, it continuously returns 503s for the upstreams that failed to populate, and the only solution is to restart the sidecar container (or kill the instance entirely).

@david-yu
Copy link
Contributor

Hi @komapa @luckymike from reviewing those links you provided it does seem like the best thing to do for default config is set this to 0. Is there a chance though that initial_fetch_timeout would ever need to be configured to something that is not 0 dynamically?

@komapa
Copy link
Contributor Author

komapa commented May 12, 2023

Hi @komapa @luckymike from reviewing those links you provided it does seem like the best thing to do for default config is set this to 0. Is there a chance though that initial_fetch_timeout would ever need to be configured to something that is not 0 dynamically?

Thank you for picking this ticket up @david-yu. I cannot think of a case in our setup where that would be needed but we obviously do not represent all of the users :) If it is not terribly difficult to make it an option, I would advise you do so.

@david-yu
Copy link
Contributor

Hi @komapa We just merged a PR that sets initial_fetch_timeout to 0 by default which should be released in 1.14.x and 1.15.x later this week. As far as customizing that option, we'll wait for further feedback before applying the flags to do so on Consul and Consul K8s. We will leave this issue open since we've only applied a more reasonable default setting but have not implemented the setting of arbitrary values for initial_fetch_timeout.

@david-yu
Copy link
Contributor

david-yu commented Jun 1, 2023

Hi @komapa Unfortunately we'll need to roll this fix back on 1.14.x and 1.15.x in the interim as we've discovered that our implementation causes issues on Ingress, Terminating and Mesh Gateways based on further testing. We're hoping to re-release this feature again in the future.

@komapa
Copy link
Contributor Author

komapa commented Jun 16, 2023

Hi @komapa Unfortunately we'll need to roll this fix back on 1.14.x and 1.15.x in the interim as we've discovered that our implementation causes issues on Ingress, Terminating and Mesh Gateways based on further testing. We're hoping to re-release this feature again in the future.

That is very unfortunate. Do you have any public details on what the issue is with the listed software? Also, instead of reverting, can we make it configurable so this way we can make it zero just for the sidecars?

Thank you!

@david-yu
Copy link
Contributor

Out of curiosity @komapa do you use any terminating or mesh gateways in your environment? We need to do more investigation to understand how to enable this. It's a lot trickier than we thought.

@komapa
Copy link
Contributor Author

komapa commented Jun 20, 2023

We do not actively use terminating gateway functionality and we never used any mesh gateways in our setup. We did upgrade our work in progress Kubernetes clusters and we do see there that the ingress gateways on 1.15.3 do seem to be having problems that I can take a closer look if needed.

How can we help so you can help us? :)

@komapa
Copy link
Contributor Author

komapa commented Jun 29, 2023

Bump

@DanStough
Copy link
Contributor

Hi @komapa 👋. I'm working on a permanent fix now that I am pretty confident will be in the next set of patch releases. Thanks for working with us while we get this sorted out.

The original changes should have been reverted for 1.15.3, so it might be unrelated if you're having problems with ingress gateways. Would be curious to know the issues if you don't mind reporting here or opening a new issue.

@komapa
Copy link
Contributor Author

komapa commented Jul 20, 2023

Thank you for fixing this. Greatly appreciated! I will report the ingress gateway problem if it happens again.

@david-yu
Copy link
Contributor

david-yu commented Jul 27, 2023

Will go ahead and close as we currently do not plan on making this customizable at the moment. For folks that find this issue please open up a new issue if you are looking to customize the initial_fetch_timeout config for Envoy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants