Health check / liveness endpoint / IsAlive function does not support multiple concurrent callers #40

james-johnston-thumbtack · 2022-09-15T03:22:50Z

If two clients concurrently call the /liveness route on the REST API, one of them will time out. This is easy to reproduce from the command line. Note that I use a & after the first curl command so that it runs asynchronously alongside the second curl command. (localhost:17303 is the REST API for kafka scheduler for me)

% ( time curl -i http://localhost:17303/liveness & ; time curl -i http://localhost:17303/liveness )
HTTP/1.1 200 OK
Date: Thu, 15 Sep 2022 03:02:51 GMT
Content-Length: 0

curl -i http://localhost:17303/liveness  0.00s user 0.01s system 0% cpu 2.351 total
HTTP/1.1 500 Internal Server Error
Date: Thu, 15 Sep 2022 03:02:53 GMT
Content-Length: 0

curl -i http://localhost:17303/liveness  0.00s user 0.01s system 0% cpu 5.024 total

The first one completes successfully, as expected. But the second one times out. The server logs show a line like:

[00] ERRO[2022-09-15T03:02:53Z] timeout for isalive probe from liveness channel

This presents a problem if multiple things in a distributed system are simultaneously checking the health. For example, EC2 target health checks documentation points out that "Health checks for a Network Load Balancer are distributed and use a consensus mechanism to determine target health. Therefore, targets receive more than the configured number of health checks."

As best I can tell, the issue is that the IsAlive function fakes a scheduled message from Kafka at

kafka-message-scheduler/scheduler/scheduler.go

Line 207 in 9b44c3c

case storeEvents <- isAliveSchedule(epoch):

Which has a hard-coded ID of ||is-alive||:

kafka-message-scheduler/scheduler/helper.go

Line 41 in 9b44c3c

const isAliveID string = "||is-alive||"

Two concurrent calls to IsAlive will result in two timers with the same ID being added. But the timer code deduplicates those using the ID:

kafka-message-scheduler/internal/timers/timers.go

Lines 61 to 62 in 9b44c3c

    
           // if found, we stop the existing timer 
        
           if t, ok := ts.items[s.ID()]; ok {

..... and since the ID is hard-coded, the second IsAlive call stomps over the first IsAlive call in timers, and thus only one timer event is ever returned in the livenessChan.

The text was updated successfully, but these errors were encountered:

fkarakas · 2022-09-16T08:59:46Z

hello @james-johnston-thumbtack ,
Thank you for the report.
Basically isAlive was designed to make sure the "core" of the scheduler is working properly. I am not sure that it is a actually a good idea and perhaps too much complicated.
Most of the time people just add a "dummy" endpoint on the http server, but the endpoint is nothing nothing...so not good also.
I think we can just add a random suffice to the id, so we will be able to identify these "probe" schedules.

james-johnston-thumbtack · 2022-09-16T17:45:45Z

To be honest, I just changed my AWS configuration to use the /info endpoint as a health check instead, to work around this problem and unblock myself.

Yeah, I agree and think adding some suffix as an ID could work to fix the existing implementation (an auto-incrementing ID would avoid collisions). I guess some things to think about might be if livenessChan needs a larger buffer. Also, if the "timeout for isalive probe from liveness channel" happens but the scheduler still does its thing, then there could be a memory leak if the livenessChan is not drained. (That is, the "timeout for isalive probe from liveness channel" error will abandon the event in the livenessChan if it were to later show up there.)

But I also think having good prometheus metrics is a good if not better measure of health; users can set up alerts on those. I think that is actually my preference, and simple response indicating "200 OK" is good enough.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health check / liveness endpoint / IsAlive function does not support multiple concurrent callers #40

Health check / liveness endpoint / IsAlive function does not support multiple concurrent callers #40

james-johnston-thumbtack commented Sep 15, 2022

fkarakas commented Sep 16, 2022

james-johnston-thumbtack commented Sep 16, 2022

Health check / liveness endpoint / IsAlive function does not support multiple concurrent callers #40

Health check / liveness endpoint / IsAlive function does not support multiple concurrent callers #40

Comments

james-johnston-thumbtack commented Sep 15, 2022

fkarakas commented Sep 16, 2022

james-johnston-thumbtack commented Sep 16, 2022