Verify the memory footprint of KIC 2.0 wrt 1.x #1465

mflendrich · 2021-06-29T16:41:38Z

As @hbagdi stated, there is a hypothesis that KIC 2.0 can have different memory usage characteristics which could potentially break upgrading users with their container limits tuned to 1.x.

Acceptance criteria:

compare the memory usage of KIC 1.3 alongside 2.0 in a setting with N services, N ingresses, N consumers (N ~= 10000)
verify that the surprising difference in proxy performance isn't a mistake

rainest · 2021-07-06T23:56:38Z

We use less. End results with 10k each Ingresses, Services, and KongConsumers are below. Not really sure what's going on with the proxy CPU/RAM consumption. A bit more detail in test.tar.gz

2.x as of current next:

$ kubectl top po -n kong --containers
POD                             NAME                 CPU(cores)   MEMORY(bytes)   
ingress-kong-758c8b9f46-zd4lj   ingress-controller   54m          189Mi           
ingress-kong-758c8b9f46-zd4lj   proxy                1m           229Mi

1.3:

$ kubectl top po -n kong --containers
POD                             NAME                 CPU(cores)   MEMORY(bytes)   
ingress-kong-54858f8f54-zzv52   ingress-controller   131m         356Mi           
ingress-kong-54858f8f54-zzv52   proxy                945m         646Mi

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:32:49Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

Minikube on my laptop, which has a 3.1GHz i5.

shaneutt · 2021-07-07T13:07:27Z

Thanks for digging into that @rainest 👍

So I was expecting a significant improvement in CPU/MEM utilization with KIC 2.0 so no surprise there and I'm happy to see the gains were so high. However, the enormous difference in the proxy caught me off guard: I feel like we should try to account for the difference there, since we only changed how we used upstream but not upstream itself?

hbagdi · 2021-07-07T18:21:47Z

Can we add plugins to the mix and test again?
Can you please verify if the proxy is getting populated with the configuration? The difference between the two is very large and we should have at least some explanation or hypothesis on why that would be happening.

rainest · 2021-07-08T19:20:42Z

Possibly that there was no config in Kong when I initially checked :)

It looks like 2.x is stuck adding finalizers before generating config. It's been in this state for quite some time:

time="2021-07-07T15:31:29Z" level=info msg="reconciling resource" CoreV1Service="{\"Namespace\":\"default\",\"Name\":\"httpbin7243\"}" logger=controllers.Service name=httpbin7243 namespace=default
time="2021-07-07T15:31:29Z" level=info msg="reconciling resource" KongV1KongConsumer="{\"Namespace\":\"default\",\"Name\":\"consumer1567\"}" logger=controllers.KongConsumer name=consumer1567 namespace=default
time="2021-07-07T15:31:29Z" level=info msg="updating the proxy with new Service" CoreV1Service="{\"Namespace\":\"default\",\"Name\":\"httpbin7243\"}" logger=controllers.Service name=httpbin7243 namespace=default
10:56:41-0500 yagody $ kubectl get po -n kong
NAME                            READY   STATUS    RESTARTS   AGE
ingress-kong-758c8b9f46-qzqxx   1/2     Running   10         73m

It looks like it maybe finished shortly after (unclear--logs appeared to still have events, but it looks like it pushed a non-empty config) and then sent a config that overflowed Kong's cache size!

hbagdi · 2021-07-08T19:23:02Z

It looks like it maybe finished shortly after (unclear--logs appeared to still have events, but it looks like it pushed a non-empty config) and then sent a config that overflowed Kong's cache size!

Umm, can we fix the cache size and observe the footprint once things have settled?

This begs another question: After creating these many resources, how long does it take for 1.x vs 2.x to reach a steady state again?

rainest · 2021-07-08T20:00:42Z

Added plugins (one per consumer), not much difference. Proxy usage remains larger on 1.3, but there's not a clear reason why.

Confirmed that we had at least one successful config POST. This runs up against the practical limits of DB-less mode (at least on my machine), and this much config is prone to issues with timer exhaustion and/or the NGINX process getting killed out of the blue by a kworker (I do not know why: I am not imposing limits, am not out of memory, and do not see obvious explanations in minikube service, kernel, Docker, or Pod event logs). The proxy container remains around long enough to receive a non-empty config and report status. I can get services to confirm they're not empty, but not all at once because pagination. Trying to GET /config reliably results in NGINX dying.

Whatever is using memory isn't accounted for in status, which is only reporting 100s of MB versus the GBs in use. If we want to dig into that more, I can try to retrieve and analyze core files, but I don't know that it's worth the effort, especially since usage is less on 2.x.

Umm, can we fix the cache size and observe the footprint once things have settled?

Well, apparently no. I have cache size set to 4GB. Instability happens regardless. DB-less 🤷

That all said, I'm not too concerned with this from the controller memory usage perspective. It shouldn't care about proxy memory usage or instability, as its memory usage should be largely:

Go memory structures for Kubernetes resources the controller ingests.
Go memory structures for Kong resources built from Kubernetes resources, rebuilt and discarded each sync loop (we don't cache these at all, correct?)
Config blobs (rebuilt and discarded each sync loop, also sent to the 2.x status updater)

The controller builds all of these regardless of whether it can send them to Kong. We don't expect memory usage to vary based on successful completion of the config POSTs, do we? I suppose the ephemeral structures may remain around for longer, but those are (a) apparently not the bulk of usage (monitoring is kinda limited without Prometheus, but a basic "top containers" loop doesn't show much fluctuation during updates) and (b) are shared between 1.x and 2.x (they both use the same parser and Kong config blob generation code).

Non-memory concerns appear more pressing: it looks like 2.x has a flaw in its config generation/hash comparison logic, as it's sending config updates without K8S resource changes. I've retrieved some pcaps to try and pull config details from them.

run_2.md
test2.tar.gz

rainest · 2021-07-08T21:20:40Z

Alright, conclusion is that trying to gather pcaps with this large a config is futile. Things are getting cut off for some reason.

Smaller ones would probably demonstrate the same issue, but it's a pain to collect them, so moving that to #1519.

hbagdi · 2021-07-09T18:29:10Z

To clarify, at this point I'm not concerned about the ingress-controller container footprints, which is the scope of the issue at hand. Feel free to close this one.

I want to understand why 1.3 and 2.0 result in such a large difference in Kong's footprint (229 vs 646).
Also, I'm assuming that the Kong version between these two are the same.
An alternative here could be to reduce the amount of configuration so that it fits in Kong's cache and then observe the memory usage difference between 1.3 and 2.0. The reason I'm so focused on memory usage of proxy is because I want to make sure that this difference in memory usage is not highlight another (currently unknown) problem.

shaneutt · 2021-07-09T18:36:24Z

@rainest: As it has a tangential relationship to the context of this thread so far, fixes for finalizer logic have been added in #1522: I would try testing 2.x again off this branch (or after it merges) as I expect it will make the finalizer processing you mentioned as slow go a little faster.

rainest · 2021-07-09T18:53:29Z

Agreed that the difference in memory could indicate some other issue (something incorrect in configuration). Direct comparison of the JSON blobs for #1519 should indicate what those are and if they're problematic.

If there's nothing obvious there we may need to engage the core team for better memory analysis tools, since the bulk of memory appears to be allocated to the nether zone that standard tools do not report on, and further appears to fluctuate considerably (although 2.x appears to result in consistently less proxy usage, IIRC I observed that both 1.x and 2.x usage per Pod could vary by up to a GB versus other runs with the same controller version).

mflendrich added this to the Blockers for cutting KIC 2.0-beta.1 milestone Jun 29, 2021

mflendrich mentioned this issue Jun 29, 2021

[spike] Performance Testing #1197

Closed

4 tasks

shaneutt added area/perf Performance Related Issues priority/high labels Jun 30, 2021

rainest self-assigned this Jul 6, 2021

rainest mentioned this issue Jul 8, 2021

Analyze 2.x config update frequency and fluctuation #1519

Closed

3 tasks

rainest closed this as completed Jul 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify the memory footprint of KIC 2.0 wrt 1.x #1465

Verify the memory footprint of KIC 2.0 wrt 1.x #1465

mflendrich commented Jun 29, 2021 •

edited

Loading

rainest commented Jul 6, 2021

shaneutt commented Jul 7, 2021

hbagdi commented Jul 7, 2021

rainest commented Jul 8, 2021

hbagdi commented Jul 8, 2021

rainest commented Jul 8, 2021 •

edited

Loading

rainest commented Jul 8, 2021

hbagdi commented Jul 9, 2021

shaneutt commented Jul 9, 2021

rainest commented Jul 9, 2021

Verify the memory footprint of KIC 2.0 wrt 1.x #1465

Verify the memory footprint of KIC 2.0 wrt 1.x #1465

Comments

mflendrich commented Jun 29, 2021 • edited Loading

rainest commented Jul 6, 2021

shaneutt commented Jul 7, 2021

hbagdi commented Jul 7, 2021

rainest commented Jul 8, 2021

hbagdi commented Jul 8, 2021

rainest commented Jul 8, 2021 • edited Loading

rainest commented Jul 8, 2021

hbagdi commented Jul 9, 2021

shaneutt commented Jul 9, 2021

rainest commented Jul 9, 2021

mflendrich commented Jun 29, 2021 •

edited

Loading

rainest commented Jul 8, 2021 •

edited

Loading