roachtest: failover
tests should track outage duration, not pMax
#133361
Labels
A-kv
Anything in KV that doesn't belong in a more specific category.
A-testing
Testing tools and infrastructure
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Currently, the
failover
test suite measures an outage by the pMax of any request experienced during the outage. This means that it cannot differentiate between an outage that causes some requests to hit timeouts for a short period of time, and an outage that causes some requests to continue to hit timeouts for a longer period of time. As a result, we have had to make changes like #133214.This is a non-standard way to define outage which does not map to how our customers think about availability. We should switch the test to set a statement timeout and then measure the period where timeouts are hit.
Jira issue: CRDB-43556
Epic CRDB-42947
The text was updated successfully, but these errors were encountered: