You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We recently had a multi-hour outage on a vault cluster using DynamoDB backend and our investigation was challenging due to limited information in logs and metrics. Vault was reporting a very low RATE(dynamodb_get_total_count) close to zero, but the AWS console was reporting ~25k qps of DynamoDB read requests. The number of core_in_flight_requests spiked to 200k, and we had large numbers of go routines and high memory usage. We suspect the DynamoDB backend was returning errors and the AWS client was doing infinite retries (we did not have AWS_DYNAMODB_MAX_RETRIES set, but that is fixed now). If that was the case, then most likely we reached the limit on the DynamoDB PermitPool causing all the other requests to wait on that lock. But as far as we can tell, there is nothing in the logs or metrics that would indicate we have reached the limit of our PermitPool.
Describe the solution you'd like
Metrics covering the DynamoDB PermitPool would be very useful. If we had gauge metrics for Active Permits, Pool Size, and Permits Waiting we could quickly determine that DynamoDB layer is the cause of requests backing up.
Describe alternatives you've considered
Explain any additional use-cases
Additional context
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
We recently had a multi-hour outage on a vault cluster using DynamoDB backend and our investigation was challenging due to limited information in logs and metrics. Vault was reporting a very low RATE(dynamodb_get_total_count) close to zero, but the AWS console was reporting ~25k qps of DynamoDB read requests. The number of core_in_flight_requests spiked to 200k, and we had large numbers of go routines and high memory usage. We suspect the DynamoDB backend was returning errors and the AWS client was doing infinite retries (we did not have AWS_DYNAMODB_MAX_RETRIES set, but that is fixed now). If that was the case, then most likely we reached the limit on the DynamoDB PermitPool causing all the other requests to wait on that lock. But as far as we can tell, there is nothing in the logs or metrics that would indicate we have reached the limit of our PermitPool.
Describe the solution you'd like
Metrics covering the DynamoDB PermitPool would be very useful. If we had gauge metrics for
Active Permits
,Pool Size
, andPermits Waiting
we could quickly determine that DynamoDB layer is the cause of requests backing up.Describe alternatives you've considered
Explain any additional use-cases
Additional context
The text was updated successfully, but these errors were encountered: