-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(*): avoid creating multiple timers to run the same active check #156
Conversation
The |
Tested within Kong gateway 3.4.3.5 with limited running timer concurrency(32), and the fix can succeed limiting the pending healthcheck job to a reasonable number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LTGTM
I have some concerns, but this change seems needed either way. Please note that I am writing this most out of my memory (which is not very good 😄). Does this change fixes the issue in the long run? I mean, if we have enough problematic targets, the available timers would be exhausted either way, wouldn't them? Also, did you check where the execution is waiting forever/for a long time? I believe it's in the What do you think? |
@locao Thank you for your review during busy hours!
No, the number of "pending" timer will persist within the same level of how many targets we've configured inside the healthchecker. Which means that the number "pending" timer grows according to the scale of targets, and not multiplied by the interval size. Before the change, if an active healthcheck cannot be finished within the interval, we will still create new jobs when next interval come, which will result in indefinite growth in the pending active check jobs. But after the change it will not grow indefinitely.
The execution of a single active healthcheck works as expected, it reaches time out as configured. But the real problem is the time to reach a timeout(or a failure/success, or whatever the result is) may be larger than the interval, and we're "producing" active check jobs(by creating timers) every interval, so the "consuming" speed of the active jobs(that already created) cannot catch up, that result in the indefinite growth and the exhaustion of the timers. The final result of the "timer exhaustion" may behave differently when we use native timer or timer-ng. In native timer we may encounter "max running timers are not enough", but in timer-ng, if the concurrency limitation is not enough, it will result in huge number of pending "jobs" inside timer-ng's datastructure and memory oversize. |
@windmgc thanks for the detailed reply!
So if we have more problematic targets than available timers, we would exhaust them either way. TBH, the scenario I mentioned is way, way more unlikely to happen than what we had before your fix, and maybe not even fixable without a big overhaul of the library. This is a big improvement, well done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
This is an alternative of #155.
The PR adds a share dict key(lock) indicating that an active check job(timer) has been created, and the key will be removed when the active check job has finished. This avoids multiple active jobs being created/running for the same active checker.
Alleviates FTI-5847