YamlFileConfigurationService fails to start health check monitors #624
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a race condition in
YamlFileConfigurationService
that sometimes prevents it from being able to start health check monitoring services.Also, yaml file configuration service now adds a backend service app name in the health check monitoring service name. This is useful as the application name will show up in possible error messages in case other bugs will surface in future.
User impact
Especially, this bug affects deployments several
YamlFileConfigurationService
providers (more than one). When the health check monitoring service is not started, the relevantHostProxy
objects for the origin are not added to the load balancing group, and therefore they are always unreachable.The issue doesn't affect, or is highly unlikely to affect deployments with only one
YamlFileConfigurationService
provider.Root cause
Styx object database is a lockless concurrent in-memory database for storing routing/provider/etc objects. Its
compute
method takes a lambda that provides a new styx object that is stored in the database. Thecompute
lambda action must idempotent because the database will call it again if it detects a concurrent modification to the database.But YamlFileConfigurationService was not idempotent. It attempted to start the health check monitoring service in this lambda callback. Therefore, a retry during the concurrent modification caused the health check monitoring service to be started again, thus resulting in an IllegalStateException.
Fixed this by caching the return value from the lambda callback. Consider this as a work around until I figure out something better.