Analyze 2.x config update frequency and fluctuation #1519

rainest · 2021-07-08T21:20:06Z

During memory testing, I observed that 2.x both posts config more often than expected (without Kubernetes updates, where 1.x did not re-post config using the same set of Kubernetes resources) and exhibits considerable (multiple kB) unexplained fluctuations in config size (unclear if this also happened with 1.x because it wasn't updating as frequently). The frequency is worthy of investigation, as frequent DB-less config posts result in proxy instability.

Gathering large config blobs over the network is difficult, so this will probably wait on #1308 or a partial debug implementation (none of the proper flag support or anything beyond a "write blob to disk") instead.

Acceptance criteria:

the root cause of the configurations varying in sizes has been identified
if any bugs are found, they are written up and marked as blockers for 2.0 GA
a regression test exists (or is planned in the issue created as part of the criterion above) for the bug if found

ccfishk · 2021-07-08T22:04:38Z

For memory usage, the prometheus tooling #1520 provides metrics through accessing loadbalancer external IP, which has a simple GUI. Me also need launch this when running performance test cases.

ccfishk · 2021-07-14T15:14:03Z

Currently, our defined update frequency is every 3 seconds https://github.com/Kong/kubernetes-ingress-controller/blob/next/railgun/internal/proxy/proxy.go#L26

rainest · 2021-07-30T00:04:01Z

New round of testing (using services.yaml, conplugin.yaml, and ingresses.yaml from the previous test indicates that the difference in config blob size between v1 and v2 may have been a misread. Config dumps actually end up being identical for each, and eventually reach a state where they do not fluctuate.

configs.tar.gz is uninteresting because all the configs are identical, but the a configs are from v2 and the b configs are from v1. They have the full set of expected resources. Some upstreams are missing targets because minikube legit fell over and isn't populating Endpoints for everything properly and/or httpbin died (note to self: set enableServiceLinks: false in Pod specs for Pods that you create 10000 Services for, otherwise K8S creates 10000 envvars inside the container and apparently lots of things don't like that).

Previously observed fluctuation appears to be related to slow reconciliation. The current proxy loop means that a large number of new resources (such as you have when starting with a large number of extant resources) will result in a significant wait before config stabilizes. Concurrency hypotheticals are still a bit 🤔 but I believe the below explains what happens re the channel handling here:

Cache changes and proxy updates are effectively single-threaded. We can only process one or the other at once, since the select will only choose one case at once, process it to completion, and then choose the next. Not 100% sure on this--the spec isn't clear if it can immediately go back and start handling another once it chooses and begins execution, but I'm guessing no.
Cache adds are internal and fastish (dunno, I don't have detailed profiling, so this is a guess), but not instant. If you already have over the DefaultObjectBufferSize resources, you'll back up several times and get repeated "too busy" messages. Unclear on the value of this--I think we put it in place originally with the idea that reconciles were synchronous (same as finalizers), but that's not really true, and instead we're just generating config with whatever the current contents of cache are when we hit the sync ticker.
The default proxy client timeout is way too short and exacerbates things since many config attempts are just aborted. It needs to be longer than 3s, full stop. There appears to another timer at play somewhere (edit: yes, we hardcode a 10-second limit that will fire before the client timeout if the latter is longer), since I'm getting timeouts more quickly than I'd expect when setting --proxy-timeout-seconds 500. The timeouts combined with frequent default ticks apparently exacerbate proxy instability.

IIRC 1.x just does ALL THE RESOURCES at once, so it doesn't exhibit this and gets to the full config size immediately. Unsure if there's a good reason to keep that around for 2.x as-is--there's no reason to back off adding stuff to the cache since those changes aren't directly tied to updates. There may be reason to back off actual syncs, but we should be able to continuously add resources to cache and just use the full contents whenever we next sync.

Other observations:

10k Ingress/Service/Consumer/Plugin still appears to hit practical limitations of DB-less mode on my machine, without much clear explanation. The only obvious limitation is timer exhaustion due to many upstreams, which doesn't really have any resolution other than "use fewer upstreams". The other times the process randomly terminates itself and the POST /config 500s with no tracebacks remain a mystery.
The reconciling and updating proxy logs are spammy as all hell at startup. We probably should aggregate those over a 60s period or so and then display the first 5 only, with a "and N more" footer if there were more than that. If we want guaranteed individual messages, those should go down to debug.
Endpoints in particular appear to get a number of updates that we probably don't care about--it looks like these are used for some non-Service endpoints also. Minikube at least has some sort of lease thing that updates every 2s. Upshot of this is that we basically mark an update before every sync tick, and that limiter never does anything.

shaneutt · 2021-07-30T12:32:10Z

Cache changes and proxy updates are effectively single-threaded

Internally yes. The server thread is either

processing inbound update channels
firing off updates to the kong admin API.

We can only process one or the other at once, since the select will only choose one case at once, process it to completion, and then choose the next. Not 100% sure on this--the spec isn't clear if it can immediately go back and start handling another once it chooses and begins execution, but I'm guessing no.

Correct, and this was the intentional design (but if we're seeing problems with it lets change it).

Cache adds are internal and fastish (dunno, I don't have detailed profiling, so this is a guess), but not instant. If you already have over the DefaultObjectBufferSize resources, you'll back up several times and get repeated "too busy" messages.
Unclear on the value of this--I think we put it in place originally with the idea that reconciles were synchronous (same as finalizers), but that's not really true, and instead we're just generating config with whatever the current contents of cache are when we hit the sync ticker.

Part of the way things work in this regard have to do with intentionally trying to make updates to the kong admin api itself single threaded, to avoid state loss from concurrent api updates in DBLESS mode.

The default proxy client timeout is way too short and exacerbates things since many config attempts are just aborted. It needs to be longer than 3s, full stop. There appears to another timer at play somewhere (edit: yes, we hardcode a 10-second limit that will fire before the client timeout if the latter is longer), since I'm getting timeouts more quickly than I'd expect when setting --proxy-timeout-seconds 500. The timeouts combined with frequent default ticks apparently exacerbate proxy instability.

I've run into this while doing work on GKE, I have another PR where this is increased:

e060741

Ultimately I think the fact that it was set to 3 seconds was an accident because it was just re-using another unrelated variable (the proxy sync seconds) as its defaults.

Given your report and need for this change as well, I've cherry-picked it out into its own PR: #1610

IIRC 1.x just does ALL THE RESOURCES at once, so it doesn't exhibit this and gets to the full config size immediately. Unsure if there's a good reason to keep that around for 2.x as-is--there's no reason to back off adding stuff to the cache since those changes aren't directly tied to updates. There may be reason to back off actual syncs, but we should be able to continuously add resources to cache and just use the full contents whenever we next sync.

Given your findings it seems like we need to make some kind of change, the simplest one that seems to fit with your findings would be to remove the backpressure mechanism and make the proxy.UpdateObject and proxy.DeleteObject methods influence the cache directly. This change will mean that updates go faster and don't block, but by its nature also means that reconciliation will be effectively back to having no idea when things are going wrong with the backend proxy. Let me know your thoughts.

I'm working on some alternative implementations, will update soon.

shaneutt · 2021-07-30T14:10:33Z

@rainest I've created #1612 to remove the backpressure system to improve performance.

Between #1610 and #1612 I expect these will solve the problems you found in your investigations and resolve this issue, let me know if that sounds right to you and what your testing results are.

rainest · 2021-07-30T18:40:16Z

Testing with #1612 clears the weirdess around update batching/blocking. All existing resources insert into the controller cache at start. Then, with some additional changes to bump a timeout we plan to change elsewhere and for logging clarity:

diff --git a/internal/sendconfig/common_workflows.go b/internal/sendconfig/common_workflows.go
index 8f57c295..dab4129a 100644
--- a/internal/sendconfig/common_workflows.go
+++ b/internal/sendconfig/common_workflows.go
@@ -66,13 +66,15 @@ func UpdateKongAdminSimple(ctx context.Context,
        }
 
        // apply the configuration update in Kong
-       timedCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
+       timedCtx, cancel := context.WithTimeout(ctx, 500*time.Second)
        defer cancel()
+       deprecatedLogger.Info("sending config update")
        configSHA, err := PerformUpdate(timedCtx,
                deprecatedLogger, &kongConfig,
                kongConfig.InMemory, enableReverseSync,
                targetConfig, kongConfig.FilterTags, nil, lastConfigSHA, false,
        )
+       deprecatedLogger.Info("update done or failed")
        if err != nil {
                if diagnostic != (util.ConfigDumpDiagnostic{}) {
                        select {

Initial apply and apply with a change to one route took ~5m to complete, with workers only using 20% of a core each throughout:

time="2021-07-30T18:17:27Z" level=info msg="sending config update" subsystem=proxy-cache-resolver
time="2021-07-30T18:22:00Z" level=info msg="successfully synced configuration to kong." subsystem=proxy-cache-resolver
time="2021-07-30T18:22:00Z" level=info msg="update done or failed" subsystem=proxy-cache-resolver

2021/07/30 18:17:32 [notice] 23#0: *5 [lua] init.lua:260: purge(): [DB cache] purging (local) cache, client: 127.0.0.1, server: kong_admin, request: "POST /config?check_hash=1 HTTP/1.1", host: "127.0.0.1:8444"
2021/07/30 18:17:32 [notice] 23#0: *5 [lua] init.lua:260: purge(): [DB cache] purging (local) cache, client: 127.0.0.1, server: kong_admin, request: "POST /config?check_hash=1 HTTP/1.1", host: "127.0.0.1:8444"
127.0.0.1 - - [30/Jul/2021:18:22:00 +0000] "POST /config?check_hash=1 HTTP/1.1" 201 294672 "-" "Go-http-client/1.1"

time="2021-07-30T18:25:18Z" level=info msg="sending config update" subsystem=proxy-cache-resolver
time="2021-07-30T18:30:47Z" level=info msg="successfully synced configuration to kong." subsystem=proxy-cache-resolver
time="2021-07-30T18:30:47Z" level=info msg="update done or failed" subsystem=proxy-cache-resolver

2021/07/30 18:25:23 [notice] 23#0: *323008 [lua] init.lua:260: purge(): [DB cache] purging (local) cache, client: 127.0.0.1, server: kong_admin, request: "POST /config?check_hash=1 HTTP/1.1", host: "127.0.0.1:8444"
2021/07/30 18:25:23 [notice] 23#0: *323008 [lua] init.lua:260: purge(): [DB cache] purging (local) cache, client: 127.0.0.1, server: kong_admin, request: "POST /config?check_hash=1 HTTP/1.1", host: "127.0.0.1:8444"
2021/07/30 18:25:27 [error] 24#0: in lua semaphore gc wait queue is not empty while the semaphore 00007F6BB52E47E8 is being destroyed
127.0.0.1 - - [30/Jul/2021:18:30:47 +0000] "POST /config?check_hash=1 HTTP/1.1" 201 245520 "-" "Go-http-client/1.1"

So that needs attention upstream, but performance concerns from our side are addressed: nothing more we can do about that; our goals are to reach stable config quickly so we don't send excess updates.

Removal of the backpressure mechanism moots the concern about endpoint updates interfering with the in-controller update checker. Logs are still noisy but that's not a performance problem.

rainest · 2021-08-06T22:47:33Z

perf.tar.gz collects profiling data with no config (run 1), partway through adding config (runs 2 and 3), and after all config has been applied (run 4) in DB-backed mode, with 25000 each of Ingresses, Services, Consumers, and Plugins:

15:29:30-0700 yagody $ kubectl top po -n kong --containers
POD                             NAME                 CPU(cores)   MEMORY(bytes)   
ingress-kong-75b949d647-xvjxw   ingress-controller   1648m        1405Mi          
ingress-kong-75b949d647-xvjxw   proxy                1m           389Mi           
postgres-0                      postgres             2m           69Mi

I'm not great at interpreting pprof data, though there doesn't seem to be too much of interest other than FillConsumersAndCredentials being particularly memory-inefficient.

ccfishk · 2021-08-06T23:53:20Z

DB-backed mode, we have the slow path which does several rounds of json.Marshal and UnMarshal.
https://github.com/Kong/kubernetes-ingress-controller/blob/main/internal/sendconfig/sendconfig.go#L83

which is very suspicious.

rainest added the priority/medium label Jul 8, 2021

rainest added this to the Blockers for cutting KIC 2.0-beta.1 milestone Jul 8, 2021

rainest self-assigned this Jul 8, 2021

rainest mentioned this issue Jul 8, 2021

Verify the memory footprint of KIC 2.0 wrt 1.x #1465

Closed

2 tasks

mflendrich added bug Something isn't working needs investigation labels Jul 27, 2021

This was referenced Jul 30, 2021

fix: increase the default for proxy requests #1610

Merged

[fix] disable proxy backpressure mechanism #1612

Merged

shaneutt linked a pull request Jul 30, 2021 that will close this issue

[fix] disable proxy backpressure mechanism #1612

Merged

rainest closed this as completed Jul 30, 2021

rainest mentioned this issue Aug 16, 2021

Confirm performance impact of new upstream map #1709

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyze 2.x config update frequency and fluctuation #1519

Analyze 2.x config update frequency and fluctuation #1519

rainest commented Jul 8, 2021 •

edited

Loading

ccfishk commented Jul 8, 2021

ccfishk commented Jul 14, 2021 •

edited

Loading

rainest commented Jul 30, 2021 •

edited

Loading

shaneutt commented Jul 30, 2021 •

edited

Loading

shaneutt commented Jul 30, 2021

rainest commented Jul 30, 2021

rainest commented Aug 6, 2021 •

edited

Loading

ccfishk commented Aug 6, 2021

Analyze 2.x config update frequency and fluctuation #1519

Analyze 2.x config update frequency and fluctuation #1519

Comments

rainest commented Jul 8, 2021 • edited Loading

ccfishk commented Jul 8, 2021

ccfishk commented Jul 14, 2021 • edited Loading

rainest commented Jul 30, 2021 • edited Loading

shaneutt commented Jul 30, 2021 • edited Loading

shaneutt commented Jul 30, 2021

rainest commented Jul 30, 2021

rainest commented Aug 6, 2021 • edited Loading

ccfishk commented Aug 6, 2021

rainest commented Jul 8, 2021 •

edited

Loading

ccfishk commented Jul 14, 2021 •

edited

Loading

rainest commented Jul 30, 2021 •

edited

Loading

shaneutt commented Jul 30, 2021 •

edited

Loading

rainest commented Aug 6, 2021 •

edited

Loading