-
Notifications
You must be signed in to change notification settings - Fork 849
Description
As parents of a topology are marked down under load, 'HostStatus::getHostStatus' can cause excessive lock behaviour resulting in high system time, reduced output and stats holes.
When performing failure testing: Overloading configured parents causes lock contention on the stats storage.
It was possible to consume almost all ET_NET thread time with a few failing parents and fewer than 5,000 RPS.
Fault replication
Increase load through an edge -> parent configuration until the parents start to fail.
I used connection limits as the failure trigger as it was predictable to fail.
Observations
As parents fail there is an increase in 'HostStatus::getHostStatus' contention, especially when the last parent fails.
This causes a reduction in all 'good' work, errors to clients, content already in cache.
- perf traces and flame graphs show near 100% system consumption on lock activity.
- traffic_server metrics stop updating
- Response and data rates drop
