-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pdk): reimplement plugin queues #10172
Conversation
fd6c2d8
to
15e5352
Compare
I started adding changes on #10197 on top of @hanshuebner 's changes. I still need to address:
|
I have seen @ADD-SP's comments about long-running timers potentially causing memory leaks. I'd like to know more, i.e. what does "long running" really mean and why do such timers cause leaks? I'm asking because the retry logic completely lives within the handler function and it could be running for a long time, depending on how the queue is configured. One of my goals with the refactoring was to simplify the code so that instead of creating multiple timers to organize polling and retries, there would just be one control flow. Rather than splitting this up again, can we maybe figure out how to create a long-running background process that does not rely on timers and callbacks at all? It seems that underneath, we already have that mechanism as both timer callbacks and worker functions can coexist and are scheduled onto the same system thread already. |
OpenResty's timer has a memory pool ( But the The pool will be destroyed when the timer was destroyed. So a long-running timer may cause memory leak. |
Thank you for the explanation - If I understand correctly, the problem is not really the run time, but the accumulation of allocations from the memory pool that can cause issues in the long run. Is it possible to determine the current size of the pool? Why is the memory pool tied to the timer callback in the first place? |
We can't get the current size of the pool.
In OpenResty, a timer is also a request, but a fake request, in order to adapt to the Nginx's request processing model (state machine). For Nginx code style, we usually get memory from the pool of the current request, so maybe some external lib (such as FFI function) will do this thing. |
Again, thank you for the very helpful explanation! Given that, switching from the long-running to |
1d45ce9
to
59a0af3
Compare
00b6d95
to
ef4f7fa
Compare
spec/01-unit/19-hybrid/03-compat_spec.lua leaks a workspace in ngx.ctx which causes the queue tests to fail.
Co-authored-by: Harry <harrybagdi@gmail.com>
Co-authored-by: Harry <harrybagdi@gmail.com>
Previously, 3xx would be interpreted to mean success, likely causing log entries to be lost.
e0be105
to
2fa9c8b
Compare
No further unresolved are present, but GitHub still indicates "change requested". Dismissing review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if queue then | ||
queue:log_debug("queue exists") | ||
-- We always use the latest configuration that we have seen for a queue and handler. | ||
queue.handler_conf = handler_conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we flush the queue before updating the conf? Or a previously enqueued item will be handled with new conf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the configuration update to be in effect immediately and affect previously queued items as well as new ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could simply bind the conf to the entity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seems to match our previous discussions, and it alters the behavior significantly from the current implementation. The risk is that the queue name must be chosen carefully under the assumption that config might change from when item was enqueued to when item gets flushed, and such decision should probably be considered more carefully at such a global level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, but it would incur a memory cost and make it impossible to get rid of items that are not making progress due to a misconfiguration. Imagine that a customer has set a high max_retry_time
and they need to change the log server address because the old address no longer works. They would have no way to get rid of already queues items in that case, other than waiting for the max_retry_time
to expire and the entries being deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seems to match our previous discussions, and it alters the behavior significantly from the current implementation.
@dndx Can you explain what you refer to by "This"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine that a customer has set a high max_retry_time and they need to change the log server address because the old address no longer works. They would have no way to get rid of already queues items in that case, other than waiting for the max_retry_time to expire and the entries being deleted.
It certainly deserves mentioning that because we don't have an explicit configure
callback in plugins that would be invoked when a reconfiguration is done, updated plugin parameters would only be put into effect when a new item is queued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It certainly deserves mentioning that because we don't have an explicit
configure
callback in plugins that would be invoked when a reconfiguration is done, updated plugin parameters would only be put into effect when a new item is queued.
rate-limiting advanced solves this problem by registering for plugin curd events. The configure
handler idea would also benefit at least: 1. exit-transformer, 2. rate-limiting advanced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are plugin crud events also issued in dbless and hybrid mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanshuebner in dbless RAL register for configure update
-- We've got our first entry from the queue. Collect more entries until max_coalescing_delay expires or we've collected | ||
-- max_batch_size entries to send | ||
while entry_count < self.max_batch_size and (now() - data_started) < self.max_coalescing_delay and not ngx.worker.exiting() do | ||
ok, err = self.semaphore:wait(((data_started + self.max_coalescing_delay) - now()) / 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried the latest version of the queue and I came to note that the queue is processed almost immediately after kong proxied a request, when using max_coalescing_delay=20
. This unit is supposed to be seconds but I believe it is used as ms here.
After setting the max_coalescing_delay=20000
the queue is processed after 20s.
When testing the shutdown behavior in that case Kong forcefully stops now. This wasn't the case previously.
At my last review the queue had a named_every timer that handled the shutdown with flushing the queue.
Line 157 in 2f188be
kong.timer:named_every(name, queue.poll_time, function(premature, q) |
This seems to be replaced with a single timer to decouple the queue process
Line 135 in 8189ce3
kong.timer:named_at("queue " .. name, 0, function(_, q) |
However I don't see where the premature event is handled. The queue is stuck here waiting for the semaphore.
I'm not sure if this is the right place to post this since the PR is already merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@27ascii Thank you for testing things out! Now that the PR is merged, new issues would be preferred for problem reports going forward. I'll create internal tickets for the two problems that you've reported above, though, so no need to open issues unless you feel like it.
After setting the max_coalescing_delay=20000 the queue is processed after 20s.
This is a bug that we're going to fix soon.
When testing the shutdown behavior in that case Kong forcefully stops now. [...]
I don't see where the premature event is handled. The queue is stuck here waiting for the semaphore.
Good catch! I need to dig a little deeper here, but right now, the situation is like you describe: If the shutdown grace period is shorter than max_coalescing_delay
, data is likely to be lost while waiting for the semaphore. Expect a fix for this as well.
Thank you again!
Hans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where the premature event is handled.
Just as a side note regarding this: We're now using an at
instead of an every
timer, which means that there is no need to distinguish whether the handler was invoked normally or in the unlikely edge case that is invoked between when it was scheduled to run (immediately) and a shutdown event arriving before it is invoked. In both cases, the queue needs to be flushed and that happens without further timer involvement. This is also the reason why graceful shutdown now works to some extent - The previous implementation needed to schedule new timers while flushing, which is not possible during shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary
This change improves the robustness of plugins that use queues (formerly "batch queues")
Design doc: https://docs.google.com/document/d/1F9unN4JOV8uA7cZWYSMhd4POG9O4JqJlwlls8rQJnto/edit?usp=sharing
Checklist
Issue reference
KAG-481 Reimplement plugin queues
KAG-503 http-log plugin doesn't update the config headers when sending logs to server