very slow apply performance due to openapi schema reparsing #1682

juliantaylor · 2024-12-04T10:14:14Z

What happened:
kubectl apply is very slow in reading manifest files. It appears to be the case that kubectl rereads its configuration several times per manifest it is supposed to apply which gets extremely slow, several minutes when applying large amount of manifests.

How to reproduce it (as minimally and precisely as possible):

$ cat kubeconfig
apiVersion: v1
clusters:
- cluster:
    certificate-authority: /tmp/cluster.pem
    server: https://cluster.url
  name: cluster
contexts:
- context:
    cluster: cluster
    user: user
  name: cluster-user
current-context: cluster-user
kind: Config
preferences: {}
users:
- name: user
  user:
    token: ...

for i in {1..10}; do echo '{  "apiVersion": "v1",  "kind": "Namespace",   "metadata": { "name": "ns'$i'" }}' ; done > manifests
$ cat manifests
{  "apiVersion": "v1",  "kind": "Namespace",   "metadata": { "name": "ns1" }}
{  "apiVersion": "v1",  "kind": "Namespace",   "metadata": { "name": "ns2" }}
{  "apiVersion": "v1",  "kind": "Namespace",   "metadata": { "name": "ns3" }}
...

Now run following dry-run kubectl and observe how often it opens (and reads) the certificate-authority file cluster.pem:

$ kubectl version
Client Version: v1.31.0
Kustomize Version: v5.4.2

$ for i in {1..10}; do echo '{  "apiVersion": "v1",  "kind": "Namespace",   "metadata": { "name": "ns'$i'" }}' ; done > manifests
$ KUBECONFIG=config GOMAXPROCS=2 strace -f -e openat apply --dry-run=client  -f manifests|& grep cluster.pem -c
27
# run in 1 second
$ for i in {1..100}; do echo '{  "apiVersion": "v1",  "kind": "Namespace",   "metadata": { "name": "ns'$i'" }}' ; done > manifests
$ KUBECONFIG=config GOMAXPROCS=2 strace -f -e openat apply --dry-run=client  -f manifests|& grep cluster.pem -c
207
# run in 12 seconds

Increasing the number of objects in the manifests file increase the times it reads the cluster certificate from its configuration which does not seem necessary.
edit: see below, the real problem is repeated json decoding of the the openapi schema documents

When applying more manifests the time before kubectl even contacts the server increases to several minutes which can be very relevant for e.g. CI test runs.

The problem should not be the the manifest parsing time, as e.g. python yaml or the golang yq version can parse these manifests in a fraction of the time it takes kubectl.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-12-04T10:14:23Z

This issue is currently awaiting triage.

SIG CLI takes a lead on issue triage for this repo, but any Kubernetes member can accept issues by applying the triage/accepted label.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ardaguclu · 2024-12-05T09:36:40Z

What about passing --dry-run=server or using --server-side?, the same situation occurs in server-side apply as well?.

juliantaylor · 2024-12-05T12:14:06Z

yes, the problem occurs before it even starts talking to the server regarding the manifests (and after it has setup its resource caches from the servers api-resources)

ardaguclu · 2024-12-05T12:26:05Z

There might be a latency in discovery endpoint. Could you please run by increasing the log verbosity (-v=9) and check that exactly which endpoint is so slow?.

juliantaylor · 2024-12-05T12:28:11Z

it is not an endpoint that is slow, it is spending all its team reading the certificate authority over and over again at 100%-150% cpu usage

juliantaylor · 2024-12-05T12:33:15Z

here is the output with network log:

$ kubectl apply --dry-run=client -f manifests -v 6 |& ts 
Dez 05 13:31:33 I1205 13:31:33.126524  397199 loader.go:395] Config loaded from file:  config
Dez 05 13:31:33 I1205 13:31:33.159773  397199 round_trippers.go:553] GET server/openapi/v3?timeout=32s 200 OK in 30 milliseconds
Dez 05 13:31:33 I1205 13:31:33.191357  397199 round_trippers.go:553] GET server/openapi/v3/api/v1?hash=......&timeout=32s 200 OK in 25 milliseconds

<<< 11 seconds of reading cluster.pem >>

Dez 05 13:31:44 I1205 13:31:44.831916  397199 round_trippers.go:553] GET .../api/v1/namespaces/ns1 404 Not Found in 6 milliseconds

juliantaylor · 2024-12-05T12:34:38Z

if you give me some pointers how to create a cpu profile I can pinpoint the exact cause for you, I'm not very familiar with golang tooling.

ardaguclu · 2024-12-05T12:36:21Z

Reading the cluster.pem shouldn't take 11 seconds. I'd recommend checking the output by running highest verbosity (i.e. -v=99).

juliantaylor · 2024-12-05T12:41:24Z

please read my full report, reading the cluster pem does not take 11 seconds, reading the cluster pem hundreds of times (it seems twice per input manifest) and the other code associated with loading the kubeconfig does.

adding -v99 does not change the output as you do not log anything for that part of the code

Dez 05 13:40:14 I1205 13:40:14.767803  397762 cached_discovery.go:77] returning cached discovery info from .../crd.projectcalico.org/v1/serverresources.json


<<< 11 seconds of reading clsuter.pem +whatver else kubectl does with the config >>


Dez 05 13:40:24 I1205 13:40:24.913038  397762 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json" -H "User-Agent: kubectl/v1.30.6 (linux/amd64) kubernetes/00f20d4" 'servernamespaces/ns1'

juliantaylor · 2024-12-05T12:54:14Z

using --validate=false removes the performance problem

ardaguclu · 2024-12-05T13:01:06Z

Validation interacts with the API server as in here. I think, there is an issue on your cluster. What is your cluster version?

ardaguclu · 2024-12-05T13:05:04Z

/remove-kind bug
/kind support

juliantaylor · 2024-12-05T13:55:11Z

the cluster version is 1.30.6 but it is not a cluster problem, there is no network traffic at all during this part. have you tried running my reproducer yourself?

I checked how to profile go and the parsing of the cluster.pem is not the main problem (although it is very odd this happens) but in decoding of the 1.3mib large openapi schema which appears to be done for every single input manifest again

#6  0x00000000021caade in k8s.io/client-go/openapi3.(*root).GVSpec (r=0xc00061b9d0, gv=..., ~r0=0x0, ~r1=...)
    at /kubernetes/kubernetes/staging/src/k8s.io/client-go/openapi3/root.go:102
102		err = json.Unmarshal(openAPISchemaBytes, &parsedV3Schema)
(gdb) p openAPISchemaBytes
$2 = {
  array = 0xc00114a000 "{\"openapi\":\"3.0.0\",\"info\":{\"title\":\"Kubernetes\",\"version\":\"v1.30.6\"},\"paths\":{\"/api/v1/\":{\"get\":{\"tags\":[\"core_v1\"],\"description\":\"get available resources\",\"operationId\":\"getCoreV1APIResources\",\"respo"..., len = 1383492, cap = 1400832}

...
#7  0x00000000021caade in k8s.io/client-go/openapi3.(*root).GVSpec (r=0xc0002be620, gv=..., ~r0=0x0, ~r1=...)
    at /kubernetes/kubernetes/staging/src/k8s.io/client-go/openapi3/root.go:102
102             err = json.Unmarshal(openAPISchemaBytes, &parsedV3Schema)
(gdb) break
Breakpoint 1 at 0x21caade: file /kubernetes/kubernetes/staging/src/k8s.io/client-go/openapi3/root.go, line 102.
(gdb) ignore 1 100 
Will ignore next 100 crossings of breakpoint 1.
(gdb) c


...
(gdb) info break
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00000000021caade kubernetes/kubernetes/staging/src/k8s.io/client-go/openapi3/root.go:102
        breakpoint already hit 42 times

ardaguclu · 2024-12-05T14:00:11Z

yes, openapi schema validation would take significant amount of time to validate the resource. You can disable validation as you did above (that is highly not recommended). But I don't think there is anything we can do with respect to this. Validation should be done and it, unfortunately, takes time.

juliantaylor · 2024-12-05T14:01:43Z

why does the openapi schema need to be decoded for every single manifest?
I can load the schema with python and validate all the manifests it in fractions of a second, not multiple seconds to minutes like kubectl takes here.

juliantaylor · 2024-12-05T14:59:37Z

proof of concept patch just caching the parsed schemas:

--- a/staging/src/k8s.io/client-go/openapi3/root.go
+++ b/staging/src/k8s.io/client-go/openapi3/root.go
@@ -59,6 +59,7 @@ type Root interface {
 type root struct {
        // OpenAPI client to retrieve the OpenAPI V3 documents.
        client openapi.Client
+       schemas map[string]*spec3.OpenAPI
 }
 
 // Validate root implements the Root interface.
@@ -67,7 +68,7 @@ var _ Root = &root{}
 // NewRoot returns a structure implementing the Root interface,
 // created with the passed rest client.
 func NewRoot(client openapi.Client) Root {
-       return &root{client: client}
+       return &root{client: client, schemas: make(map[string]*spec3.OpenAPI)}
 }
 
 func (r *root) GroupVersions() ([]schema.GroupVersion, error) {
@@ -93,6 +94,12 @@ func (r *root) GroupVersions() ([]schema.GroupVersion, error) {
 }
 
 func (r *root) GVSpec(gv schema.GroupVersion) (*spec3.OpenAPI, error) {
+       apiPath := gvToAPIPath(gv)
+       schema, found := r.schemas[apiPath]
+       if found {
+               return schema, nil
+       }
+
        openAPISchemaBytes, err := r.retrieveGVBytes(gv)
        if err != nil {
                return nil, err
@@ -100,6 +107,7 @@ func (r *root) GVSpec(gv schema.GroupVersion) (*spec3.OpenAPI, error) {
        // Unmarshal the downloaded Group/Version bytes into the spec3.OpenAPI struct.
        var parsedV3Schema spec3.OpenAPI
        err = json.Unmarshal(openAPISchemaBytes, &parsedV3Schema)
+       r.schemas[apiPath] = &parsedV3Schema
        return &parsedV3Schema, err
 }

makes it 10 times faster:

 /tmp$ for i in {1..100}; do echo '{  "apiVersion": "v1",  "kind": "Namespace",   "metadata": { "name": "ns'$i'" }}' ; done > manifests
 /tmp$ time ./kubectl.patched --profile=cpu apply --dry-run=client -f manifests > /dev/null 

real    0m1.067s
user    0m0.546s
sys     0m0.103s
 /tmp$ time ./kubectl.orig --profile=cpu apply --dry-run=client -f manifests > /dev/null 

real    0m9.366s
user    0m11.569s
sys     0m0.395s

juliantaylor · 2024-12-10T13:37:49Z

I don't see an obvious way to make kubectl reuse the decoded schemas for multiple resource visits, though the client-go openapi class already does cache the json document in bytes form.
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/openapi/cached/groupversion.go#L47

So a possible solution would be to adjust client-go to cache the decoded json instead of or in addition and the kubectl issue would automatically be resolved.
I assume that would be an issue to be filed in the main kubernetes project?

juliantaylor added the kind/bug Categorizes issue or PR as related to a bug. label Dec 4, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 4, 2024

k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Dec 5, 2024

juliantaylor changed the title ~~very slow apply performance due to kubeconfig reparsing~~ very slow apply performance due to openapi schema reparsing Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

very slow apply performance due to openapi schema reparsing #1682

very slow apply performance due to openapi schema reparsing #1682

juliantaylor commented Dec 4, 2024 •

edited

Loading

k8s-ci-robot commented Dec 4, 2024

ardaguclu commented Dec 5, 2024

juliantaylor commented Dec 5, 2024 •

edited

Loading

ardaguclu commented Dec 5, 2024

juliantaylor commented Dec 5, 2024 •

edited

Loading

juliantaylor commented Dec 5, 2024

juliantaylor commented Dec 5, 2024

ardaguclu commented Dec 5, 2024

juliantaylor commented Dec 5, 2024 •

edited

Loading

juliantaylor commented Dec 5, 2024

ardaguclu commented Dec 5, 2024

ardaguclu commented Dec 5, 2024

juliantaylor commented Dec 5, 2024

ardaguclu commented Dec 5, 2024

juliantaylor commented Dec 5, 2024 •

edited

Loading

juliantaylor commented Dec 5, 2024

juliantaylor commented Dec 10, 2024

very slow apply performance due to openapi schema reparsing #1682

very slow apply performance due to openapi schema reparsing #1682

Comments

juliantaylor commented Dec 4, 2024 • edited Loading

k8s-ci-robot commented Dec 4, 2024

ardaguclu commented Dec 5, 2024

juliantaylor commented Dec 5, 2024 • edited Loading

ardaguclu commented Dec 5, 2024

juliantaylor commented Dec 5, 2024 • edited Loading

juliantaylor commented Dec 5, 2024

juliantaylor commented Dec 5, 2024

ardaguclu commented Dec 5, 2024

juliantaylor commented Dec 5, 2024 • edited Loading

juliantaylor commented Dec 5, 2024

ardaguclu commented Dec 5, 2024

ardaguclu commented Dec 5, 2024

juliantaylor commented Dec 5, 2024

ardaguclu commented Dec 5, 2024

juliantaylor commented Dec 5, 2024 • edited Loading

juliantaylor commented Dec 5, 2024

juliantaylor commented Dec 10, 2024

juliantaylor commented Dec 4, 2024 •

edited

Loading

juliantaylor commented Dec 5, 2024 •

edited

Loading

juliantaylor commented Dec 5, 2024 •

edited

Loading

juliantaylor commented Dec 5, 2024 •

edited

Loading

juliantaylor commented Dec 5, 2024 •

edited

Loading