-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analyze 2.x config update frequency and fluctuation #1519
Comments
For memory usage, the prometheus tooling #1520 provides metrics through accessing loadbalancer external IP, which has a simple GUI. Me also need launch this when running performance test cases. |
Currently, our defined update frequency is every 3 seconds https://github.com/Kong/kubernetes-ingress-controller/blob/next/railgun/internal/proxy/proxy.go#L26 |
New round of testing (using services.yaml, conplugin.yaml, and ingresses.yaml from the previous test indicates that the difference in config blob size between v1 and v2 may have been a misread. Config dumps actually end up being identical for each, and eventually reach a state where they do not fluctuate. configs.tar.gz is uninteresting because all the configs are identical, but the a configs are from v2 and the b configs are from v1. They have the full set of expected resources. Some upstreams are missing targets because minikube legit fell over and isn't populating Endpoints for everything properly and/or httpbin died (note to self: set Previously observed fluctuation appears to be related to slow reconciliation. The current proxy loop means that a large number of new resources (such as you have when starting with a large number of extant resources) will result in a significant wait before config stabilizes. Concurrency hypotheticals are still a bit 🤔 but I believe the below explains what happens re the channel handling here:
IIRC 1.x just does ALL THE RESOURCES at once, so it doesn't exhibit this and gets to the full config size immediately. Unsure if there's a good reason to keep that around for 2.x as-is--there's no reason to back off adding stuff to the cache since those changes aren't directly tied to updates. There may be reason to back off actual syncs, but we should be able to continuously add resources to cache and just use the full contents whenever we next sync. Other observations:
|
Internally yes. The server thread is either
Correct, and this was the intentional design (but if we're seeing problems with it lets change it).
Part of the way things work in this regard have to do with intentionally trying to make updates to the kong admin api itself single threaded, to avoid state loss from concurrent api updates in DBLESS mode.
I've run into this while doing work on GKE, I have another PR where this is increased: Ultimately I think the fact that it was set to 3 seconds was an accident because it was just re-using another unrelated variable (the proxy sync seconds) as its defaults. Given your report and need for this change as well, I've cherry-picked it out into its own PR: #1610
Given your findings it seems like we need to make some kind of change, the simplest one that seems to fit with your findings would be to remove the backpressure mechanism and make the I'm working on some alternative implementations, will update soon. |
Testing with #1612 clears the weirdess around update batching/blocking. All existing resources insert into the controller cache at start. Then, with some additional changes to bump a timeout we plan to change elsewhere and for logging clarity:
Initial apply and apply with a change to one route took ~5m to complete, with workers only using 20% of a core each throughout:
So that needs attention upstream, but performance concerns from our side are addressed: nothing more we can do about that; our goals are to reach stable config quickly so we don't send excess updates. Removal of the backpressure mechanism moots the concern about endpoint updates interfering with the in-controller update checker. Logs are still noisy but that's not a performance problem. |
perf.tar.gz collects profiling data with no config (run 1), partway through adding config (runs 2 and 3), and after all config has been applied (run 4) in DB-backed mode, with 25000 each of Ingresses, Services, Consumers, and Plugins:
I'm not great at interpreting pprof data, though there doesn't seem to be too much of interest other than FillConsumersAndCredentials being particularly memory-inefficient. |
DB-backed mode, we have the slow path which does several rounds of json.Marshal and UnMarshal. which is very suspicious. |
Follow up from #1465 (comment)
During memory testing, I observed that 2.x both posts config more often than expected (without Kubernetes updates, where 1.x did not re-post config using the same set of Kubernetes resources) and exhibits considerable (multiple kB) unexplained fluctuations in config size (unclear if this also happened with 1.x because it wasn't updating as frequently). The frequency is worthy of investigation, as frequent DB-less config posts result in proxy instability.
Gathering large config blobs over the network is difficult, so this will probably wait on #1308 or a partial debug implementation (none of the proper flag support or anything beyond a "write blob to disk") instead.
Acceptance criteria:
The text was updated successfully, but these errors were encountered: