Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

Count number of times partitioned tasks reenter the cluster as healthy #30

Open
DavidMcLaughlin opened this issue Aug 14, 2018 · 0 comments

Comments

@DavidMcLaughlin
Copy link
Contributor

Currently when a task is PARTITIONED and LOST, Aurora reschedules a replacement. Later on, the task can send a message saying it was healthy and then Aurora will kill the old task. Receiving this signal is a huge indicator that you could avoid unnecessary churn in the cluster by extending timeouts.

Add a metric to monitor how often this use case happens.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant