-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support fail_traffic_on_panic without zone aware routing #17659
Comments
Can you link to the docs you're referencing? From my understanding, you don't need to have zone-aware routing enabled for this since it's a common LB config. That being said, I really think what you want here is admission control-- this is exactly the scenario it's mean for, but it's just missing the per-cluster state tracking. The idea is based on client-side throttling from that Google SRE book. Are you trying to accomplish this without any code changes? If so, I think you may be limited to what you mention in the description. Otherwise, you'll have a much better experience by adding this functionality to the admission control filter. |
@tonya11en Thanks for the reply!
To disable panic mode (set healthy_panic_threshold to 0), users don't need zone-aware routing. But that may cause the issue I described above (clients may overwhelm servers unless ALL servers are down).
I agree admission control is better for our use case if it supports per-cluster state tracking. We are trying to achieve the goal in a very short term (by the end of August), so for now contributing to the admission control filter is not an option for me given my knowledge to envoy codebase. But I may look into this later. |
Hi @csssuf, I see you added the fail_traffic_on_panic feature. Is my above understanding correct (users have to turn on zone-aware routing if they want to use fail_traffic_on_panic)? Does this feature request sounds reasonable to you? Thanks |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
Can you reopen this issue? We use locality weighted lb instead of zone aware lb. Per CommonLbConfig doc, only one of zone_aware_lb_config, locality_weighted_lb_config may be set. We are out of luck to use Question is why zone aware lb is required for |
Hello folks,
I'm trying to implement client-side throttling using envoy. My use case is to prevent further overwhelming the upstream cluster when it's degraded/down. To do this, apart from the admission control (which is awesome but current doesn't support per cluster/vhost), I can see 2 potential solutions using outlier detection.
In our environment, we have to disable panic mode, because to enable fail_traffic_on_panic, it needs to enable zone aware routing. To enable zone aware routing, according to this doc, it needs users to define static local cluster in the bootstrap file, and it also needs change in the upstream service. These are hard to do in our environment.
But I think a potential concern with panic mode disabled approach is that envoy would return 503 only after every upstream host is down. Let's say 80% upstream cluster is down, if we disable panic mode, the traffic may further overwhelm the rest 20% of the traffic and bring the whole upstream service down which is the opposite of what we want to do here. Instead, enabled panic mode + fail_traffic_on_panic don't have this issue because it can start failing all requests when x% of traffic is down.
I don't know whether this is just a rare case in reality. I guess most likely in reality 100% service would go down, so panic mode disabled should work fine. But I feel this may happen when the upstream cluster does a partial rollout (say 80% on v2, 20% on v1)
Is my understanding correct? If so, is there a way to use fail_traffic_on_panic without zone aware routing?
Thanks!!
The text was updated successfully, but these errors were encountered: