-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Health check / liveness endpoint / IsAlive function does not support multiple concurrent callers #40
Comments
hello @james-johnston-thumbtack , |
To be honest, I just changed my AWS configuration to use the Yeah, I agree and think adding some suffix as an ID could work to fix the existing implementation (an auto-incrementing ID would avoid collisions). I guess some things to think about might be if livenessChan needs a larger buffer. Also, if the "timeout for isalive probe from liveness channel" happens but the scheduler still does its thing, then there could be a memory leak if the livenessChan is not drained. (That is, the "timeout for isalive probe from liveness channel" error will abandon the event in the livenessChan if it were to later show up there.) But I also think having good prometheus metrics is a good if not better measure of health; users can set up alerts on those. I think that is actually my preference, and simple response indicating "200 OK" is good enough. |
If two clients concurrently call the
/liveness
route on the REST API, one of them will time out. This is easy to reproduce from the command line. Note that I use a&
after the firstcurl
command so that it runs asynchronously alongside the secondcurl
command. (localhost:17303
is the REST API for kafka scheduler for me)The first one completes successfully, as expected. But the second one times out. The server logs show a line like:
This presents a problem if multiple things in a distributed system are simultaneously checking the health. For example, EC2 target health checks documentation points out that "Health checks for a Network Load Balancer are distributed and use a consensus mechanism to determine target health. Therefore, targets receive more than the configured number of health checks."
As best I can tell, the issue is that the
IsAlive
function fakes a scheduled message from Kafka atkafka-message-scheduler/scheduler/scheduler.go
Line 207 in 9b44c3c
Which has a hard-coded ID of
||is-alive||
:kafka-message-scheduler/scheduler/helper.go
Line 41 in 9b44c3c
Two concurrent calls to
IsAlive
will result in two timers with the same ID being added. But the timer code deduplicates those using the ID:kafka-message-scheduler/internal/timers/timers.go
Lines 61 to 62 in 9b44c3c
IsAlive
call stomps over the firstIsAlive
call in timers, and thus only one timer event is ever returned in thelivenessChan
.The text was updated successfully, but these errors were encountered: