Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve every k8s resource generated by a Carvel package installation #4062

Closed
wants to merge 12 commits into from

Conversation

antgamdia
Copy link
Contributor

@antgamdia antgamdia commented Jan 11, 2022

Description of the change

This PR removes the workaround we implemented when we released it last time and implements a generic way to retrieve all the kubernetes resources matching a certain label. This way we can fetch every k8s resource created as a result of a Carvel PackageInstall.

Benefits

The retrieved resources for a carvel package will now include every possible one, not a fixed list of pods, svc and deployments.

Possible drawbacks

Hitting the k8s APIs for querying every possible available resource is expensive in time. I guess it is also due to my Kubelet config in kind with limited resources, but I'm still worried about it.

  • Example: an app with 97 resources (in a cluster with 88 different resource types) takes around 15 seconds to retrieve the resource refs.
    • With plenty of Waited for <some time> ms due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/<api resource>?labelSelector=kapp.k14s.io%2Fapp%3D<my-id> console messages

Note that the kapp CLI is already is able to fetch its generated resources; however, solely using a Kubernetes client (not the kapp binary), there is no CR holding this information.
See how the kapp inspect command under the hood here: https://github.com/vmware-tanzu/carvel-kapp/blob/v0.44.0/pkg/kapp/cmd/app/inspect.go#L76

Edit: they are doing a similar approach, but with a cache layer above: https://github.com/vmware-tanzu/carvel-kapp/blob/v0.44.0/pkg/kapp/resources/resource_types.go#L111

Applicable issues

Additional information

Some resources from a Habor installation:
image

Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
@antgamdia antgamdia changed the title 3854 get resources Retrieve every k8s resource generated by a Carvel package installation Jan 11, 2022
Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
// have a look at the CachedDiscoveryClient in k8s
// https://github.com/kubernetes/client-go/blob/release-1.22/discovery/cached/disk/cached_discovery.go
discoveryClient := typedClient.Discovery()
if discoveryClient == nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the discoveryClient as part of the clientGetter so that we can swap it and mock it later for testing

if err != nil {
return nil, err
}
discoveryClient := discovery.NewDiscoveryClient(client)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we still create a new discovery, but as Michael did in the past, we are using the RESTMapper just in production. For testing, we mock the entire kindToResource function

}
wg.Add(1)
// spawn a goroutine to fetch the matching k8s resources
go func(resource metav1.APIResource, ch chan resourceAndGvk, wg *sync.WaitGroup) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to make it as parallel as possible, but the bottleneck seems to be in the k8s apiserver itself (in my local environment, haven't tested in a real cluster yet).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unless I'm missing something, this code is calling List on every listable resource in all namespaces on the cluster? I'm skeptical that we should be doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, yes, it's exactly what the code is doing :S
However, I can't think of another way of "getting an unknown number of resources matching a label in an unknown number of different namespaces". Perhaps by banning some resources we certainly know they can't be created (like Nodes ?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, OK. If this is what the kapp code does as well, I assume they've thought about it more than us (and would prefer to just use their code, thanks for the second version).


// create channel to collect the matching k8s resources thus avoid blocking the waitgroup
// ussing a buffered channel with size 100; assumption: 60% native k8s resources + 40 CRDs
ch := make(chan resourceAndGvk, 100)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not used to channels yet, I've made this assumption, but please feel free to tell me more about the buffer sizes here if it is wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not clear to me why you're using a buffered channel, especially of a size 100. It implies that you think your consumers of the channel data will get very behind the producers.

@antgamdia antgamdia marked this pull request as ready for review January 11, 2022 21:39
Copy link
Contributor

@absoludity absoludity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through the channel code that you mentioned you were uncertain about and left a few thoughts, as well as taken a look through the rest of the code. My main question or concern is that this appears to be getting all resources across all namespaces of the cluster to filter them by whatever the kapp selector is. If that's the case, then yep, I think your concern about the timing is a valid one. Is that the only option we've got?

I had a brief look at the kapp code that you pointed to, I didn't think it appeared to be doing so, but rather doing something much more specific (it appears to be listing identified resources only, but could be reading it incorrectly). Then I read the docs and tried the following on my dev cluster:

kapp inspect -a 'label:'
...
878 resources

Wow. So it's possible to use kapp to do that, but is that what it's doing by default when you inspect an app? (Haven't checked but assume you have).

Either way, what I'd hope we can do here is use kapp itself to get the resources for an app, so we're not implementing our own version. Why can we not create a NewInspectionCommand and run it, similar to how we run Helm commands? That way, we'd not be re-implementing kapp functionality (listing all resources for an app), in the same way we try to run the Helm CLI commands where we can?

}
}

// GetClients ensures a client getter is available and uses it to return both a typed and dynamic k8s client.
func (s *Server) GetClients(ctx context.Context, cluster string) (kubernetes.Interface, dynamic.Interface, error) {
func (s *Server) GetClients(ctx context.Context, cluster string) (kubernetes.Interface, dynamic.Interface, discovery.DiscoveryInterface, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it makes sense to include the discovery client in GetClients here (or the private getClients)? The reason that GetClients exists is because we need k8s clients that are dependent on the request context, as they use the credentials/token of the request. This isn't the case for the discovery client, I would think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, that's true... we can just use the typedClient and later on, call typedClient.Discovery(). However, I wasn't able to swap it by the fake.FakeDiscovery in the tests (that's why c7004f8).
If we go this way (ie not using the carvel code), I'll give it another thought anyway

Comment on lines +412 to +416
// Apparently, App CRs not being created by a "kapp deploy" command don't have the proper annotations.
// So, in order to retrieve the annotation value,
// we have to get the ConfigMap <AppName>-ctrl and, then, fetch the value of the key "labelValue" in data.spec.
// See https://kubernetes.slack.com/archives/CH8KCCKA5/p1637842398026700
// https://github.com/vmware-tanzu/carvel-kapp-controller/issues/430
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this sounds pretty temporary? Will the longer term solution be an upstream fix to the App CR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know. The issue is not very active and I haven't heard about it on the community slack recently.
As of today (v 0.44 got released), it still needs the -ctrl suffix.


// create channel to collect the matching k8s resources thus avoid blocking the waitgroup
// ussing a buffered channel with size 100; assumption: 60% native k8s resources + 40 CRDs
ch := make(chan resourceAndGvk, 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not clear to me why you're using a buffered channel, especially of a size 100. It implies that you think your consumers of the channel data will get very behind the producers.

continue
}
gv, err := schema.ParseGroupVersion(apiResourceList.GroupVersion)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we'd just ignore an error like this? (There could be a reason, but it should be documented here).

Copy link
Contributor Author

@antgamdia antgamdia Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem as in another comment, just using the k8s code directly (https://github.com/kubernetes/kubectl/blob/release-1.22/pkg/cmd/apiresources/apiresources.go#L172) :S
However, you're right, we should do something (at least a warning msg in the console)

continue
}
// filter to resources that support the specified "List" verb
if !sets.NewString(resource.Verbs...).HasAll([]string{"list"}...) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I would have thought !sets.NewString(resource.Verbs...).Has("list") would be simpler and clearer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I just used the same code as in the kubectl client, which is aimed at supporting any set of verbs. In our scenario, your suggestion is way simpler. Thanks for pointing it out

close(ch)
}()

// once closed, iterate over the closed channel and push the objects to the response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is correct? That is, the channel will be closed when the wg.Wait() returns, but that's correctly happening inside a go routine. So the code below will start processing the data on the channel as it arrives (as it should) and will only complete (the range ch) when the channel is closed and all data has been drained.

Which still leaves me wondering why you need a large buffer on your channel? The work of the consumer here should be very quick compared to the producer (which is doing network fetches), so I'd not think you'd need a buffer at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right! I mixed it up... initially, I had an unbuffered channel, so I needed the goroutine encapsulating the wg.wait() and close(ch). Then I moved to a buffered one, but left this goroutine back.

Yes.. the consumer is just aggregating data, which should be really fast; not really needing a buffer here.

}
wg.Add(1)
// spawn a goroutine to fetch the matching k8s resources
go func(resource metav1.APIResource, ch chan resourceAndGvk, wg *sync.WaitGroup) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unless I'm missing something, this code is calling List on every listable resource in all namespaces on the cluster? I'm skeptical that we should be doing this?

}

// ignore the namespace as the resources may appear in different ns from which the PackageInstall is created
resources, err := dynamicClient.Resource(gvr).List(ctx, listOptions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I'd thought the above syntax was for non-namespaced (ie. cluster-level) resources, rather than resources across all namespaces, but see that the dynamic client apparently supports both (NamespaceableResourceInterface which is implemented to do both)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed! I initially thought the same, even I had an if for both cases. After I hit the issue I mentioned (getting resources from other namespaces rather than the one in which the PackageInstall is), I noticed the dynamic client actually supports both.

Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>

Conflicts:
	cmd/kubeapps-apis/plugins/kapp_controller/packages/v1alpha1/server_ctrl_packages.go
@antgamdia
Copy link
Contributor Author

antgamdia commented Jan 12, 2022

Either way, what I'd hope we can do here is use kapp itself to get the resources for an app, so we're not implementing our own version. Why can we not create a NewInspectionCommand and run it, similar to how we run Helm commands? That way, we'd not be re-implementing kapp functionality (listing all resources for an app), in the same way we try to run the Helm CLI commands where we can?

Quick notes, I'm in a rush; will try to explain myself better later:

TLDR: I don't see any performance increase when using Kapp's code directly (I'd say it's slower)

View alternative version using code from Kapp at https://github.com/kubeapps/kubeapps/pull/4068/files

  • I have successfully used the Kapp code as in the Helm scenario.
  • There is no way to configure it with a custom rest.Config for generating the actions. You have to copy some parts (creating the clients: NewResourceTypesImpl, NewResourcesImpl, NewIdentifiedResources, NewApps)
  • Then using the same code as in the cmd inspect action

Perhaps it's a matter of our Kubernetes client default config; the QPS and other params. I'll have a look later.
Carvel ones: https://github.com/vmware-tanzu/carvel-kapp/blob/develop/pkg/kapp/cmd/core/kube_api_flags.go

Edit: Increasing QPS/Burst does the trick... we were throttling our calls to 5req/s; meaning ~100 resources/5s each =~ 20s, which is consistent with my results. The carvel qps is set to 1000, that's why my approach seemed slow 🤦‍♂️

carvelResourcesMine.mp4

My plan is:

  • Create a PR allowing customizing the QPS/Burst for the whole kubeappsapis server
  • Evaluate both approaches with the same configuration
    • Mainly because of maintainability reasons, we probably go for the Kapp way

Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
Signed-off-by: Antonio Gamez Diaz <agamez@vmware.com>
@absoludity
Copy link
Contributor

Excellent, nice work Antonio! I'm +1 whichever way you decide, though I personally lean the same way (use Kapp and get upstream fixed so it can be configured, so we don't maintain more code unnecessarily).

@absoludity
Copy link
Contributor

Excellent, nice work Antonio! I'm +1 whichever way you decide, though I personally lean the same way (use Kapp and get upstream fixed so it can be configured, so we don't maintain more code unnecessarily).

Looks like you haven't yet pushed your changes here (no rush, I just wanted to leave a +1 after seeing those, but will do so tonight / tomorrow)

@antgamdia
Copy link
Contributor Author

Looks like you haven't yet pushed your changes here (no rush, I just wanted to leave a +1 after seeing those, but will do so tonight / tomorrow)

Not yet, I wanted to discuss both alternatives; that's why I pushed the changes in a separate PR (#4068).

The current situation is: my approach is fully working (with the QPS fix) but the alternative is not yet (not getting resources from other namespaces yet).

My plan: working today on having the carvel approach fully working and tested and benchmark both alternatives.

@antgamdia antgamdia marked this pull request as draft January 13, 2022 17:43
@antgamdia
Copy link
Contributor Author

Closing this PR in favor of #4068.
Thanks for your comments and feedback on this version!

@antgamdia antgamdia closed this Jan 13, 2022
@antgamdia antgamdia deleted the 3854-getResources branch May 31, 2022 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a proper implementation for GetInstalledPackageResourceRefs
2 participants