Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Managed Ingestion of Envoy Telemetry #262

Open
dastbe opened this issue Sep 23, 2020 · 0 comments
Open

RFC: Managed Ingestion of Envoy Telemetry #262

dastbe opened this issue Sep 23, 2020 · 0 comments
Labels
App Mesh Envoy App Mesh RFC Roadmap: Awaiting Customer Feedback We need to get more information in order understand how we will implement this feature.

Comments

@dastbe
Copy link
Contributor

dastbe commented Sep 23, 2020

Today, the App Mesh control plane actually has very little insight into the customer experience in terms of their Mesh's health. While we are aware of sudden changes in the number of Envoys connected for given Virtual Nodes and when configuration is unusable by Envoy, we have no automatic feedback mechanism from customers for how this configuration is actually working out for customers.

What we're proposing is an optional integration that allows customers to egress operational data about their Envoys to App Mesh. This data would start by covering operational metrics such as request success/failure rates and latency, and would have granularity at the Envoy cluster (Virtual Node) and endpoint level. In combination with endpoint information

At a high level, some features we could build off this information

  • Direct integration with CloudWatch Metrics, including synthesis of metrics like number of healthy Envoys per Virtual Node.
  • Direct integration with services like CloudWatch Alarms and Personal Health Dashboard to provide automated
  • Proactive recommendations for configuration changes
  • Dynamic or "autoscaled" configuration for things like circuit breakers, retries, and timeouts where App Mesh will directly tune your network configuration.

At the same time, gathering this information would also help App Mesh improve the aggregate customer experience. As we've observed in #151 and #227, particular combinations of configuration and state change can lead to a surprising and poor customer experience. Even with our continued improvements in comprehensive testing, there is always the potential for a change to still impact customers, requiring them to reach out to us for support. By integrating customer telemetry into our release process, we can monitor behavior changes as new configuration is deployed and roll back in the event of customer impact. This will also allow us to analyze trends and behavior customers are seeing and proactively improve configuration.

@dastbe dastbe added App Mesh and removed App Mesh labels Sep 23, 2020
@jamsajones jamsajones added App Mesh App Mesh Envoy Question Roadmap: Awaiting Customer Feedback We need to get more information in order understand how we will implement this feature. RFC and removed Question labels Sep 23, 2020
@jamsajones jamsajones changed the title Feature Request: Managed Ingestion of Envoy Telemetry RFC: Managed Ingestion of Envoy Telemetry Sep 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
App Mesh Envoy App Mesh RFC Roadmap: Awaiting Customer Feedback We need to get more information in order understand how we will implement this feature.
Projects
None yet
Development

No branches or pull requests

2 participants