Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify the memory footprint of KIC 2.0 wrt 1.x #1465

Closed
2 tasks
mflendrich opened this issue Jun 29, 2021 · 10 comments
Closed
2 tasks

Verify the memory footprint of KIC 2.0 wrt 1.x #1465

mflendrich opened this issue Jun 29, 2021 · 10 comments
Assignees
Labels
area/perf Performance Related Issues priority/high

Comments

@mflendrich
Copy link
Contributor

mflendrich commented Jun 29, 2021

As @hbagdi stated, there is a hypothesis that KIC 2.0 can have different memory usage characteristics which could potentially break upgrading users with their container limits tuned to 1.x.

Acceptance criteria:

  • compare the memory usage of KIC 1.3 alongside 2.0 in a setting with N services, N ingresses, N consumers (N ~= 10000)
  • verify that the surprising difference in proxy performance isn't a mistake
@rainest
Copy link
Contributor

rainest commented Jul 6, 2021

We use less. End results with 10k each Ingresses, Services, and KongConsumers are below. Not really sure what's going on with the proxy CPU/RAM consumption. A bit more detail in test.tar.gz

2.x as of current next:

$ kubectl top po -n kong --containers
POD                             NAME                 CPU(cores)   MEMORY(bytes)   
ingress-kong-758c8b9f46-zd4lj   ingress-controller   54m          189Mi           
ingress-kong-758c8b9f46-zd4lj   proxy                1m           229Mi

1.3:

$ kubectl top po -n kong --containers
POD                             NAME                 CPU(cores)   MEMORY(bytes)   
ingress-kong-54858f8f54-zzv52   ingress-controller   131m         356Mi           
ingress-kong-54858f8f54-zzv52   proxy                945m         646Mi 
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:32:49Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

Minikube on my laptop, which has a 3.1GHz i5.

@rainest rainest self-assigned this Jul 6, 2021
@shaneutt
Copy link
Contributor

shaneutt commented Jul 7, 2021

Thanks for digging into that @rainest 👍

So I was expecting a significant improvement in CPU/MEM utilization with KIC 2.0 so no surprise there and I'm happy to see the gains were so high. However, the enormous difference in the proxy caught me off guard: I feel like we should try to account for the difference there, since we only changed how we used upstream but not upstream itself?

@hbagdi
Copy link
Member

hbagdi commented Jul 7, 2021

  1. Can we add plugins to the mix and test again?
  2. Can you please verify if the proxy is getting populated with the configuration? The difference between the two is very large and we should have at least some explanation or hypothesis on why that would be happening.

@rainest
Copy link
Contributor

rainest commented Jul 8, 2021

Possibly that there was no config in Kong when I initially checked :)

It looks like 2.x is stuck adding finalizers before generating config. It's been in this state for quite some time:

time="2021-07-07T15:31:29Z" level=info msg="reconciling resource" CoreV1Service="{\"Namespace\":\"default\",\"Name\":\"httpbin7243\"}" logger=controllers.Service name=httpbin7243 namespace=default
time="2021-07-07T15:31:29Z" level=info msg="reconciling resource" KongV1KongConsumer="{\"Namespace\":\"default\",\"Name\":\"consumer1567\"}" logger=controllers.KongConsumer name=consumer1567 namespace=default
time="2021-07-07T15:31:29Z" level=info msg="updating the proxy with new Service" CoreV1Service="{\"Namespace\":\"default\",\"Name\":\"httpbin7243\"}" logger=controllers.Service name=httpbin7243 namespace=default
10:56:41-0500 yagody $ kubectl get po -n kong
NAME                            READY   STATUS    RESTARTS   AGE
ingress-kong-758c8b9f46-qzqxx   1/2     Running   10         73m

It looks like it maybe finished shortly after (unclear--logs appeared to still have events, but it looks like it pushed a non-empty config) and then sent a config that overflowed Kong's cache size!

@hbagdi
Copy link
Member

hbagdi commented Jul 8, 2021

It looks like it maybe finished shortly after (unclear--logs appeared to still have events, but it looks like it pushed a non-empty config) and then sent a config that overflowed Kong's cache size!

Umm, can we fix the cache size and observe the footprint once things have settled?

This begs another question: After creating these many resources, how long does it take for 1.x vs 2.x to reach a steady state again?

@rainest
Copy link
Contributor

rainest commented Jul 8, 2021

Added plugins (one per consumer), not much difference. Proxy usage remains larger on 1.3, but there's not a clear reason why.

Confirmed that we had at least one successful config POST. This runs up against the practical limits of DB-less mode (at least on my machine), and this much config is prone to issues with timer exhaustion and/or the NGINX process getting killed out of the blue by a kworker (I do not know why: I am not imposing limits, am not out of memory, and do not see obvious explanations in minikube service, kernel, Docker, or Pod event logs). The proxy container remains around long enough to receive a non-empty config and report status. I can get services to confirm they're not empty, but not all at once because pagination. Trying to GET /config reliably results in NGINX dying.

Whatever is using memory isn't accounted for in status, which is only reporting 100s of MB versus the GBs in use. If we want to dig into that more, I can try to retrieve and analyze core files, but I don't know that it's worth the effort, especially since usage is less on 2.x.

Umm, can we fix the cache size and observe the footprint once things have settled?

Well, apparently no. I have cache size set to 4GB. Instability happens regardless. DB-less 🤷

That all said, I'm not too concerned with this from the controller memory usage perspective. It shouldn't care about proxy memory usage or instability, as its memory usage should be largely:

  • Go memory structures for Kubernetes resources the controller ingests.
  • Go memory structures for Kong resources built from Kubernetes resources, rebuilt and discarded each sync loop (we don't cache these at all, correct?)
  • Config blobs (rebuilt and discarded each sync loop, also sent to the 2.x status updater)

The controller builds all of these regardless of whether it can send them to Kong. We don't expect memory usage to vary based on successful completion of the config POSTs, do we? I suppose the ephemeral structures may remain around for longer, but those are (a) apparently not the bulk of usage (monitoring is kinda limited without Prometheus, but a basic "top containers" loop doesn't show much fluctuation during updates) and (b) are shared between 1.x and 2.x (they both use the same parser and Kong config blob generation code).

Non-memory concerns appear more pressing: it looks like 2.x has a flaw in its config generation/hash comparison logic, as it's sending config updates without K8S resource changes. I've retrieved some pcaps to try and pull config details from them.

run_2.md
test2.tar.gz

@rainest
Copy link
Contributor

rainest commented Jul 8, 2021

Alright, conclusion is that trying to gather pcaps with this large a config is futile. Things are getting cut off for some reason.

Smaller ones would probably demonstrate the same issue, but it's a pain to collect them, so moving that to #1519.

@hbagdi
Copy link
Member

hbagdi commented Jul 9, 2021

To clarify, at this point I'm not concerned about the ingress-controller container footprints, which is the scope of the issue at hand. Feel free to close this one.

I want to understand why 1.3 and 2.0 result in such a large difference in Kong's footprint (229 vs 646).
Also, I'm assuming that the Kong version between these two are the same.
An alternative here could be to reduce the amount of configuration so that it fits in Kong's cache and then observe the memory usage difference between 1.3 and 2.0. The reason I'm so focused on memory usage of proxy is because I want to make sure that this difference in memory usage is not highlight another (currently unknown) problem.

@shaneutt
Copy link
Contributor

shaneutt commented Jul 9, 2021

@rainest: As it has a tangential relationship to the context of this thread so far, fixes for finalizer logic have been added in #1522: I would try testing 2.x again off this branch (or after it merges) as I expect it will make the finalizer processing you mentioned as slow go a little faster.

@rainest
Copy link
Contributor

rainest commented Jul 9, 2021

Agreed that the difference in memory could indicate some other issue (something incorrect in configuration). Direct comparison of the JSON blobs for #1519 should indicate what those are and if they're problematic.

If there's nothing obvious there we may need to engage the core team for better memory analysis tools, since the bulk of memory appears to be allocated to the nether zone that standard tools do not report on, and further appears to fluctuate considerably (although 2.x appears to result in consistently less proxy usage, IIRC I observed that both 1.x and 2.x usage per Pod could vary by up to a GB versus other runs with the same controller version).

@rainest rainest closed this as completed Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/perf Performance Related Issues priority/high
Projects
None yet
Development

No branches or pull requests

4 participants