feat(pdk): reimplement plugin queues #10172

hanshuebner · 2023-01-26T18:06:48Z

Summary

This change improves the robustness of plugins that use queues (formerly "batch queues")

Better logging
Implement capacity limits (both # of entries and # of bytes)
Easier to understand configuration parameters
Ability to configure sharing queues between plugin instances
Unit test for queues added

Design doc: https://docs.google.com/document/d/1F9unN4JOV8uA7cZWYSMhd4POG9O4JqJlwlls8rQJnto/edit?usp=sharing

Checklist

The Pull Request has tests
There's an entry in the CHANGELOG
There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - #5088

Issue reference

KAG-481 Reimplement plugin queues
KAG-503 http-log plugin doesn't update the config headers when sending logs to server

kong/plugins/http-log/handler.lua

kong/plugins/http-log/schema.lua

kong/tools/queue.lua

spec/helpers.lua

spec/01-unit/27-queue_spec.lua

kong/tools/queue.lua

kong/plugins/datadog/schema.lua

kikito · 2023-02-01T00:38:09Z

I started adding changes on #10197 on top of @hanshuebner 's changes. I still need to address:

removal of the assertion
using timer.every
Apart from that I think it moves this PR closer to the finish line.

hanshuebner · 2023-02-01T06:47:59Z

using timer.every

I have seen @ADD-SP's comments about long-running timers potentially causing memory leaks. I'd like to know more, i.e. what does "long running" really mean and why do such timers cause leaks?

I'm asking because the retry logic completely lives within the handler function and it could be running for a long time, depending on how the queue is configured. One of my goals with the refactoring was to simplify the code so that instead of creating multiple timers to organize polling and retries, there would just be one control flow.

Rather than splitting this up again, can we maybe figure out how to create a long-running background process that does not rely on timers and callbacks at all? It seems that underneath, we already have that mechanism as both timer callbacks and worker functions can coexist and are scheduled onto the same system thread already.

ADD-SP · 2023-02-01T07:10:59Z

using timer.every

I have seen @ADD-SP's comments about long-running timers potentially causing memory leaks. I'd like to know more, i.e. what does "long running" really mean and why do such timers cause leaks?

I'm asking because the retry logic completely lives within the handler function and it could be running for a long time, depending on how the queue is configured. One of my goals with the refactoring was to simplify the code so that instead of creating multiple timers to organize polling and retries, there would just be one control flow.

Rather than splitting this up again, can we maybe figure out how to create a long-running background process that does not rely on timers and callbacks at all? It seems that underneath, we already have that mechanism as both timer callbacks and worker functions can coexist and are scheduled onto the same system thread already.

OpenResty's timer has a memory pool (ngx_pool_t), so if some C function (such as a FFI call) was called in the callback of timer, it might get memory from this pool.

But the ngx_pool_t doesn't really free (even if you call ngx_pfree() manually) a small memory block and also doesn't reuse this small memory block. These blocks will be free when this pool was destroyed.

The pool will be destroyed when the timer was destroyed. So a long-running timer may cause memory leak.

hanshuebner · 2023-02-01T07:35:07Z

The pool will be destroyed when the timer was destroyed. So a long-running timer may cause memory leak.

Thank you for the explanation - If I understand correctly, the problem is not really the run time, but the accumulation of allocations from the memory pool that can cause issues in the long run. Is it possible to determine the current size of the pool? Why is the memory pool tied to the timer callback in the first place?

ADD-SP · 2023-02-01T07:47:18Z

Is it possible to determine the current size of the pool?

We can't get the current size of the pool.

Why is the memory pool tied to the timer callback in the first place?

In OpenResty, a timer is also a request, but a fake request, in order to adapt to the Nginx's request processing model (state machine).

For Nginx code style, we usually get memory from the pool of the current request, so maybe some external lib (such as FFI function) will do this thing.

hanshuebner · 2023-02-01T07:54:57Z

In OpenResty, a timer is also a request, but a fake request, in order to adapt to the Nginx's request processing model (state machine).

Again, thank you for the very helpful explanation! Given that, switching from the long-running to timer.every makes a lot of sense, and we may also need implement a queue state machine so that we can run retries in separate invocations of the callback.

spec/01-unit/19-hybrid/03-compat_spec.lua leaks a workspace in ngx.ctx which causes the queue tests to fail.

Co-authored-by: Harry <harrybagdi@gmail.com>

Previously, 3xx would be interpreted to mean success, likely causing log entries to be lost.

No further unresolved are present, but GitHub still indicates "change requested". Dismissing review.

ADD-SP

LGTM

StarlightIbuki · 2023-04-13T06:38:49Z

kong/tools/queue.lua

+  if queue then
+    queue:log_debug("queue exists")
+    -- We always use the latest configuration that we have seen for a queue and handler.
+    queue.handler_conf = handler_conf


Should we flush the queue before updating the conf? Or a previously enqueued item will be handled with new conf.

We want the configuration update to be in effect immediately and affect previously queued items as well as new ones.

Or we could simply bind the conf to the entity.

This does not seems to match our previous discussions, and it alters the behavior significantly from the current implementation. The risk is that the queue name must be chosen carefully under the assumption that config might change from when item was enqueued to when item gets flushed, and such decision should probably be considered more carefully at such a global level.

We could do that, but it would incur a memory cost and make it impossible to get rid of items that are not making progress due to a misconfiguration. Imagine that a customer has set a high max_retry_time and they need to change the log server address because the old address no longer works. They would have no way to get rid of already queues items in that case, other than waiting for the max_retry_time to expire and the entries being deleted.

This does not seems to match our previous discussions, and it alters the behavior significantly from the current implementation.

@dndx Can you explain what you refer to by "This"?

Imagine that a customer has set a high max_retry_time and they need to change the log server address because the old address no longer works. They would have no way to get rid of already queues items in that case, other than waiting for the max_retry_time to expire and the entries being deleted.

It certainly deserves mentioning that because we don't have an explicit configure callback in plugins that would be invoked when a reconfiguration is done, updated plugin parameters would only be put into effect when a new item is queued.

It certainly deserves mentioning that because we don't have an explicit configure callback in plugins that would be invoked when a reconfiguration is done, updated plugin parameters would only be put into effect when a new item is queued.

rate-limiting advanced solves this problem by registering for plugin curd events. The configure handler idea would also benefit at least: 1. exit-transformer, 2. rate-limiting advanced

Are plugin crud events also issued in dbless and hybrid mode?

@hanshuebner in dbless RAL register for configure update

27ascii · 2023-04-14T06:47:05Z

kong/tools/queue.lua

+  -- We've got our first entry from the queue.  Collect more entries until max_coalescing_delay expires or we've collected
+  -- max_batch_size entries to send
+  while entry_count < self.max_batch_size and (now() - data_started) < self.max_coalescing_delay and not ngx.worker.exiting() do
+    ok, err = self.semaphore:wait(((data_started + self.max_coalescing_delay) - now()) / 1000)


@hanshuebner

I've tried the latest version of the queue and I came to note that the queue is processed almost immediately after kong proxied a request, when using max_coalescing_delay=20. This unit is supposed to be seconds but I believe it is used as ms here.
After setting the max_coalescing_delay=20000 the queue is processed after 20s.

When testing the shutdown behavior in that case Kong forcefully stops now. This wasn't the case previously.
At my last review the queue had a named_every timer that handled the shutdown with flushing the queue.

kong/kong/tools/queue.lua

Line 157 in 2f188be

kong.timer:named_every(name, queue.poll_time, function(premature, q)

This seems to be replaced with a single timer to decouple the queue process

kong/kong/tools/queue.lua

Line 135 in 8189ce3

kong.timer:named_at("queue " .. name, 0, function(_, q)

However I don't see where the premature event is handled. The queue is stuck here waiting for the semaphore.

I'm not sure if this is the right place to post this since the PR is already merged.

@27ascii Thank you for testing things out! Now that the PR is merged, new issues would be preferred for problem reports going forward. I'll create internal tickets for the two problems that you've reported above, though, so no need to open issues unless you feel like it.

After setting the max_coalescing_delay=20000 the queue is processed after 20s.

This is a bug that we're going to fix soon.

When testing the shutdown behavior in that case Kong forcefully stops now. [...]
I don't see where the premature event is handled. The queue is stuck here waiting for the semaphore.

Good catch! I need to dig a little deeper here, but right now, the situation is like you describe: If the shutdown grace period is shorter than max_coalescing_delay, data is likely to be lost while waiting for the semaphore. Expect a fix for this as well.

Thank you again!
Hans

I don't see where the premature event is handled.

Just as a side note regarding this: We're now using an at instead of an every timer, which means that there is no need to distinguish whether the handler was invoked normally or in the unlikely edge case that is invoked between when it was scheduled to run (immediately) and a shutdown event arriving before it is invoked. In both cases, the queue needs to be flushed and that happens without further timer involvement. This is also the reason why graceful shutdown now works to some extent - The previous implementation needed to schedule new timers while flushing, which is not possible during shutdown.

@27ascii Fixes in this PR, feel free to comment.

hanshuebner added the pr/do not merge label Jan 26, 2023

pull-request-size bot added the size/XXL label Jan 26, 2023

github-actions bot added core/docs plugins/datadog plugins/http-log plugins/opentelemetry plugins/statsd changelog labels Jan 26, 2023

hanshuebner removed the pr/do not merge label Jan 27, 2023

hanshuebner marked this pull request as ready for review January 27, 2023 08:04

hanshuebner force-pushed the feat/plugin-queue-rework branch from fd6c2d8 to 15e5352 Compare January 27, 2023 12:46

hanshuebner assigned kikito Jan 27, 2023

hanshuebner requested a review from bungle January 27, 2023 14:54

kikito added this to the 3.2 milestone Jan 30, 2023

dndx requested a review from ADD-SP January 30, 2023 11:00

kikito reviewed Jan 30, 2023

View reviewed changes

ADD-SP reviewed Jan 30, 2023

View reviewed changes

kong/tools/queue.lua Outdated Show resolved Hide resolved

kong/tools/queue.lua Outdated Show resolved Hide resolved

kong/tools/queue.lua Outdated Show resolved Hide resolved

kong/tools/queue.lua Outdated Show resolved Hide resolved

kikito reviewed Feb 1, 2023

View reviewed changes

kong/plugins/datadog/schema.lua Outdated Show resolved Hide resolved

oowl mentioned this pull request Feb 1, 2023

fix(plugin): fix batch queue closure function update bug #10090

Closed

github-actions bot added the core/db label Feb 1, 2023

hanshuebner mentioned this pull request Feb 1, 2023

feat(pdk): reimplement plugin queues #10197

Closed

3 tasks

hanshuebner force-pushed the feat/plugin-queue-rework branch from 1d45ce9 to 59a0af3 Compare February 1, 2023 14:56

github-actions bot removed the core/db label Feb 1, 2023

hanshuebner force-pushed the feat/plugin-queue-rework branch from 00b6d95 to ef4f7fa Compare February 1, 2023 15:56

hanshuebner and others added 17 commits April 13, 2023 06:46

Test robustness improvement

e8f36fc

prefix queue name with "queue" in timer name

15980ea

Adjust semaphore count when dropping entries

17f7b88

fix incorrect log message

166875f

reset workspace before running tests

a76f8d1

spec/01-unit/19-hybrid/03-compat_spec.lua leaks a workspace in ngx.ctx which causes the queue tests to fail.

fix ide warning about unused argument

9654208

Update CHANGELOG.md

84c252f

Co-authored-by: Harry <harrybagdi@gmail.com>

Update kong/tools/queue.lua

7b8b094

Co-authored-by: Harry <harrybagdi@gmail.com>

Re-apply queue changes to opentelemetry handler

6c3f92d

Check Queue.enqueue return values

8c6011d

improve message when unsupported parameters are used

7a0dfba

Update CHANGELOG

50e34f6

Adress review comments by hbagdi

36a8596

fix(http-log): interpret http status codes 3xx as failures

2a6c902

Previously, 3xx would be interpreted to mean success, likely causing log entries to be lost.

remove unused variable

b88f2ea

change expected message for after http-log 3xx fix

d1c3a0d

Fix assertion

2fa9c8b

hanshuebner force-pushed the feat/plugin-queue-rework branch from e0be105 to 2fa9c8b Compare April 13, 2023 04:46

Fix test for new initial_retry_delay parameter

583e71e

Fix another test to include new initial_retry_delay parameter

8189ce3

ADD-SP approved these changes Apr 13, 2023

View reviewed changes

hanshuebner merged commit 9df893f into master Apr 13, 2023

hanshuebner deleted the feat/plugin-queue-rework branch April 13, 2023 06:08

StarlightIbuki reviewed Apr 13, 2023

View reviewed changes

StarlightIbuki mentioned this pull request Apr 14, 2023

test(plugin): test for fix of otel reconfigure issue #KAG-1061 #10666

Merged

3 tasks

27ascii reviewed Apr 14, 2023

View reviewed changes

hanshuebner mentioned this pull request Apr 14, 2023

Queue bug fixes and improvements #10679

Merged

3 tasks

hanshuebner mentioned this pull request May 8, 2023

Reworked plugin queues Kong/docs.konghq.com#5088

Merged

2 tasks

Vonter mentioned this pull request May 17, 2023

docs(changelog): fix name of renamed queue module #10884

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pdk): reimplement plugin queues #10172

feat(pdk): reimplement plugin queues #10172

hanshuebner commented Jan 26, 2023 •

edited

Loading

kikito commented Feb 1, 2023

hanshuebner commented Feb 1, 2023

ADD-SP commented Feb 1, 2023 •

edited

Loading

hanshuebner commented Feb 1, 2023

ADD-SP commented Feb 1, 2023 •

edited

Loading

hanshuebner commented Feb 1, 2023

ADD-SP left a comment

StarlightIbuki Apr 13, 2023

hanshuebner Apr 13, 2023

StarlightIbuki Apr 13, 2023

dndx Apr 13, 2023

hanshuebner Apr 13, 2023

hanshuebner Apr 13, 2023

hanshuebner Apr 13, 2023

StarlightIbuki Apr 14, 2023

hanshuebner Apr 14, 2023

StarlightIbuki Apr 19, 2023

27ascii Apr 14, 2023

hanshuebner Apr 14, 2023

hanshuebner Apr 14, 2023

hanshuebner Apr 14, 2023

feat(pdk): reimplement plugin queues #10172

feat(pdk): reimplement plugin queues #10172

Conversation

hanshuebner commented Jan 26, 2023 • edited Loading

Summary

Checklist

Issue reference

kikito commented Feb 1, 2023

hanshuebner commented Feb 1, 2023

ADD-SP commented Feb 1, 2023 • edited Loading

hanshuebner commented Feb 1, 2023

ADD-SP commented Feb 1, 2023 • edited Loading

hanshuebner commented Feb 1, 2023

ADD-SP left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanshuebner commented Jan 26, 2023 •

edited

Loading

ADD-SP commented Feb 1, 2023 •

edited

Loading

ADD-SP commented Feb 1, 2023 •

edited

Loading