fix(*): avoid creating multiple timers to run the same active check #156

windmgc · 2024-05-09T08:22:20Z

This is an alternative of #155.

The PR adds a share dict key(lock) indicating that an active check job(timer) has been created, and the key will be removed when the active check job has finished. This avoids multiple active jobs being created/running for the same active checker.

Alleviates FTI-5847

windmgc · 2024-05-10T07:58:23Z

The badssl.com website seems to have some long latency thus causing the tests to fail intermittently. I added a long timeout and raised the whole timeout of the test.
Future we can consider replace this with local SSL server to eliminate flakiness

windmgc · 2024-05-10T07:59:06Z

Tested within Kong gateway 3.4.3.5 with limited running timer concurrency(32), and the fix can succeed limiting the pending healthcheck job to a reasonable number.

oowl

LTGTM

windmgc · 2024-05-10T12:47:30Z

@oowl @locao I added another test for checking running timers when active healthcheck is enabled. Without the fix, the running timer will be ~10 for a single target.

Could you please help review it again? Thanks!

locao · 2024-05-13T14:33:56Z

I have some concerns, but this change seems needed either way. Please note that I am writing this most out of my memory (which is not very good 😄).

Does this change fixes the issue in the long run? I mean, if we have enough problematic targets, the available timers would be exhausted either way, wouldn't them?

Also, did you check where the execution is waiting forever/for a long time? I believe it's in the run_single_check function, right? I think we should find out why it's never timing out. If we're still risking to exhaust the timers with enough unhealthy targets, maybe we should add some kind of watchdog there to stop the check if it is taking too long to finish.

What do you think?

windmgc · 2024-05-14T03:36:04Z

@locao Thank you for your review during busy hours!

Does this change fixes the issue in the long run? I mean, if we have enough problematic targets, the available timers would be exhausted either way, wouldn't them?

No, the number of "pending" timer will persist within the same level of how many targets we've configured inside the healthchecker. Which means that the number "pending" timer grows according to the scale of targets, and not multiplied by the interval size. Before the change, if an active healthcheck cannot be finished within the interval, we will still create new jobs when next interval come, which will result in indefinite growth in the pending active check jobs. But after the change it will not grow indefinitely.

Also, did you check where the execution is waiting forever/for a long time? I believe it's in the run_single_check function, right? I think we should find out why it's never timing out. If we're still risking to exhaust the timers with enough unhealthy targets, maybe we should add some kind of watchdog there to stop the check if it is taking too long to finish.

The execution of a single active healthcheck works as expected, it reaches time out as configured. But the real problem is the time to reach a timeout(or a failure/success, or whatever the result is) may be larger than the interval, and we're "producing" active check jobs(by creating timers) every interval, so the "consuming" speed of the active jobs(that already created) cannot catch up, that result in the indefinite growth and the exhaustion of the timers.

The final result of the "timer exhaustion" may behave differently when we use native timer or timer-ng. In native timer we may encounter "max running timers are not enough", but in timer-ng, if the concurrency limitation is not enough, it will result in huge number of pending "jobs" inside timer-ng's datastructure and memory oversize.

locao · 2024-05-14T15:02:10Z

@windmgc thanks for the detailed reply!

Which means that the number "pending" timer grows according to the scale of targets, and not multiplied by the interval size.

So if we have more problematic targets than available timers, we would exhaust them either way. TBH, the scenario I mentioned is way, way more unlikely to happen than what we had before your fix, and maybe not even fixable without a big overhaul of the library.

This is a big improvement, well done!

locao

Nice work!

…156)

…157) * fix(*): avoid creating multiple timers to run the same active check (#156) * docs(changelog): add changelog

fix(*): avoid creating multiple pending timers to run active check

cdabfc3

windmgc mentioned this pull request May 9, 2024

fix(healthcheck): replace checker callback timer with ngx.thread to lower timer usage #155

Closed

fix(*): callback lock will be released after finish active hc

7568c3f

windmgc changed the title ~~fix(*): avoid creating multiple pending timers to run active check~~ fix(*): avoid creating multiple timers to run the same active check May 10, 2024

windmgc added 7 commits May 10, 2024 13:40

trigger ci

fead005

fix(*): lock callback by health mode

d42f992

fix(*): remove callback lock

53503e8

fix(*): cleanup unnecessary change

cf8b13c

tests(*): add timeout for badssl due to long latency

3f737f0

tests(*): raise test timeout

ae1ea00

trigger ci

7319ffd

windmgc requested review from locao and oowl May 10, 2024 08:03

oowl previously approved these changes May 10, 2024

View reviewed changes

tests(*): add number of timers check test

29f3ec5

windmgc dismissed oowl’s stale review via 29f3ec5 May 10, 2024 09:58

windmgc added 4 commits May 10, 2024 18:01

trigger test

c70ba41

tests(*): fix test for resty events

5996cb7

tests(*): reduce sleep time a bit to avoid timeout

b97177d

trigger test

9cedd47

oowl approved these changes May 11, 2024

View reviewed changes

locao approved these changes May 14, 2024

View reviewed changes

locao merged commit fca7ad2 into release/1.6.x May 14, 2024
7 checks passed

windmgc added a commit that referenced this pull request May 15, 2024

fix(*): avoid creating multiple timers to run the same active check (#…

3701442

…156)

windmgc mentioned this pull request May 15, 2024

[cherry-pick] fix(*): avoid creating multiple timers to run the same active check #157

Merged

windmgc added a commit that referenced this pull request May 16, 2024

fix(*): avoid creating multiple timers to run the same active check (#…

2dc6ddb

…157) * fix(*): avoid creating multiple timers to run the same active check (#156) * docs(changelog): add changelog

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(*): avoid creating multiple timers to run the same active check #156

fix(*): avoid creating multiple timers to run the same active check #156

windmgc commented May 9, 2024 •

edited

Loading

windmgc commented May 10, 2024 •

edited

Loading

windmgc commented May 10, 2024

oowl left a comment

windmgc commented May 10, 2024

locao commented May 13, 2024

windmgc commented May 14, 2024

locao commented May 14, 2024

locao left a comment

fix(*): avoid creating multiple timers to run the same active check #156

fix(*): avoid creating multiple timers to run the same active check #156

Conversation

windmgc commented May 9, 2024 • edited Loading

windmgc commented May 10, 2024 • edited Loading

windmgc commented May 10, 2024

oowl left a comment

Choose a reason for hiding this comment

windmgc commented May 10, 2024

locao commented May 13, 2024

windmgc commented May 14, 2024

locao commented May 14, 2024

locao left a comment

Choose a reason for hiding this comment

windmgc commented May 9, 2024 •

edited

Loading

windmgc commented May 10, 2024 •

edited

Loading