Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Enhance Resource Health Monitoring within App CR #717

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

varshaprasad96
Copy link

Addresses the issue: carvel-dev/kapp-controller#1412

Addresses the issue: carvel-dev/kapp-controller#1412

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Copy link

netlify bot commented Jan 23, 2024

Deploy Preview for carvel ready!

Name Link
🔨 Latest commit 5ae1a1f
🔍 Latest deploy log https://app.netlify.com/sites/carvel/deploys/65b0187fc1ec37000878d7b2
😎 Deploy Preview https://deploy-preview-717--carvel.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.


We intend to extend the existing App API by adding a new status condition to expose the system's health. To do so, the following needs to be implemented:

1. The controller reconciling the App CR needs to dynamically set up watches for the resources being deployed by the package.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to know more and discuss about how we can enable watches in the App reconciler. Looks like currently, the kapp command is called, and based on its output the App status is popoulated:
https://github.com/carvel-dev/kapp-controller/blob/df87efdcf0c0c140ff644c8286257cd38a74fd42/pkg/app/app_deploy.go#L25

If we could return the list of resources which are being created (which currently is present in the cmd output) and dynamically set up watches, it would make it easier.

This is how we do in Rukpak for reference: https://github.com/operator-framework/rukpak/blob/6a8a84c9aff05efaba7b05992704ad38462a7ee8/internal/controllers/bundledeployment/bundledeployment.go#L389-L402.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be ideal, but you can find all the resources by taking the involved GroupKinds from the ConfigMap associated with the kapp app, then listing/watching/informing using the appropriate app label selector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible approach would be to have kapp write some information about specific resources to a file upon reconciliation and have this information copied over to the status.
This is how we have information about used group versions and namespaces to the App status, using output from the --app-metadata-file-output flag.

This however means this information will be reported whenever the App syncs.


#### Use Case: Monitoring the state of resources

Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm - the output of inspect command in App's status is populated during deploy. After which it is not dynamically updated when the health of any resources change? Am I missing anything here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, it would only report the health of resources when a reconciliation occurs.

One correction (though it is not very relevant to the context) would be that inspect is not a part of rawOptions, enabling it would look more like:

#.....other spec
deploy:
  - kapp:
      inspect: {}
#....

## Open Questions:

1. Can using informers to watch resources increase cache size, potentially impacting the performance?
2. Can the output in the `inspect` status field be combined with that of proposed `healthy` condition?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is what needs to happen eventually. It doesn't make sense to have two status fields serve similar purpose. The healthy condition can just list all the resources (instead of just the failed/unhealthy) ones or vice-versa.

Before refactoring the proposal for the same, I would like to confirm the use case of inspect and make sure of the direction we would like to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inspect section initially just aggregated the statuses of all resources after a finished reconciliation. It was essentially the output of the kapp inspect command.
We disabled it by default in favour of reducing the number of API calls we make.

Since it is a separate feature altogether, I think we can work towards having a separate section for the additional information we want to surface.

Copy link
Author

@varshaprasad96 varshaprasad96 Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a separate feature altogether, I think we can work towards having a separate section for the additional information we want to surface.

If we are watching all the resources in the cluster anyway, and triggering a reconcile - wondering if calling an inspect command on top of it is necessary. If so, this may also end up loading the API server - since I can expect more no. of reconciliations due to dependent resources.

Additionally, the second aspect is inspect and healthy showing conflicting information at any point in time to the user (I haven't looked into the codebase of inspect yet, but assuming controller client and the one used with kapp are different?).

Given:
Inspect - would show a superset health status of all the resources.
Healthy - would show only the unhealthy resources.

If we decide to support both of them, then probably we should make them exclusive?

Copy link
Contributor

@100mik 100mik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for putting this together!
I aggregated some of my thoughts into comments.

So far the two goals that stand out for me are:

  1. Immediate reporting of failures for certain resources
  2. Structured per-resource reporting in case of failure

I would be curious to know if I am missing something else we are looking for too 🙏🏼
Let's take the discussion forward in the community meeting 🚀

We intend to extend the existing App API by adding a new status condition to expose the system's health. To do so, the following needs to be implemented:

1. The controller reconciling the App CR needs to dynamically set up watches for the resources being deployed by the package.
2. Introduce a `Healthy` condition in App CR's `status` [field][app_cr_status].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believed the original suggestion was to introduce conditions per resource as well, is that not required for your use case?
To illustrate something like

- type: HealthCheck/someapi/someversion/somenamespace/resource
  status: False
  message: "Failed to meet condition: "some more information""

Which could live in a separate in a separate field such as status.resourceConditions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced we need to report a condition for every resource. Imagine the results of an App being hundreds of resources created on cluster. The primary need is to signal to a user "this app is degraded". Including a subset of the unhealthy resources would be useful. I'd like to ensure we don't get anywhere close to the 1.5MB size limit for data in etcd, and whenever there's an unbounded data set (# of resources in this case), I start to get concerned.

From there, the user can go investigate further.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what Andy mentioned. A single condition, which consists of a consolidated list of unhealthy objects is sufficient. Something like this:

- lastTransitionTime: "2023-08-02T04:24:27Z"
  Message: "unhealthy resources: ["apiextensions.k8s.io/v1/CustomResourceDefinition/my.new.crd":"InvalidVersion", "deployments/test-ns/my-deploy":"MinimumReplicasUnavailable", "pods/test-ns/standalone-pod":"ImagePullBackoff"]


7. All other unspecified resources will be considered healthy.

If any of the watched resource is unhealthy, the `Message` field of the healthy condition will have the statuses of the unhealthy resources ordered lexicographically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to not cases where the ReconcileFailed condition is not present, but the Healthy condition is false.
If this is not a possibility, is the problem we are trying to solve: having more structured information about failed resources surfaced?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous comment re not wanting to list every unhealthy resource


If any of the watched resource is unhealthy, the `Message` field of the healthy condition will have the statuses of the unhealthy resources ordered lexicographically.

Since the resources deployed by the App reconciler have informers created for them, any change in the resource state will trigger a reconcile that in turn will re-evaluate the health of all resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is value to be able to treat some resources as more critical! (carvel-dev/kapp-controller#1279 comes to mind)

Today, in case of failure we have a mechanism which leads to immediate reconciliation on failure. However, in case of repeated failure, the reconciler exponentially backs off. Meaning that it would take longer to reconcile the app again if it has already failed >3 times (for example).
Worth noting the longest the app waits will always be equal to it's syncPeriod.

This prevents an app or a set of apps that is doomed to fail from hogging the reconciliation queue. Would we want something similar here as well?


#### Use Case: Monitoring the state of resources

Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, it would only report the health of resources when a reconciliation occurs.

One correction (though it is not very relevant to the context) would be that inspect is not a part of rawOptions, enabling it would look more like:

#.....other spec
deploy:
  - kapp:
      inspect: {}
#....

## Open Questions:

1. Can using informers to watch resources increase cache size, potentially impacting the performance?
2. Can the output in the `inspect` status field be combined with that of proposed `healthy` condition?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inspect section initially just aggregated the statuses of all resources after a finished reconciliation. It was essentially the output of the kapp inspect command.
We disabled it by default in favour of reducing the number of API calls we make.

Since it is a separate feature altogether, I think we can work towards having a separate section for the additional information we want to surface.


Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR.

Though this command provides information about the resources created by the respective App CR, it does so by sending API requests during the reconciliation. Instead, using informers provides additional advantages of having real-time updates, efficient resource utilization and reduced load on API server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we work with informers it might be interesting to see what's an optimal number of resources to be watched we would recommend keeping resource utilisation in mind.
(Just a note, not something this proposal should address)

Comment on lines +50 to +56
5. An APIService resource will be healthy if/when:
- `Available` type condition in status is true.

6. A CustomResourceDefinition resource will be healthy if/when:
- `StoredVersions` has the expected API version for the CRD.

7. All other unspecified resources will be considered healthy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case where 5,6,7 would not lead to a deployment failure? if so do we really need to report health in these resources types?


## Open Questions:

1. Can using informers to watch resources increase cache size, potentially impacting the performance?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be the case since the same kapp-controller can be in charge of hundreds of apps, and if we do this for all the apps, we might end up getting informers for every resource in the cluster.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have this as an optional feature to start, similar to how inspect is right now?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this proposal is implemented, health would ultimately be determined by evaluating only the following kinds:

  • Pod
  • ReplicationController
  • ReplicaSet
  • Deployment
  • StatefulSet
  • APIService
  • CustomResourceDefinition

Which would mean the additional overhead is at most 6 more informers with label selectors limiting the cache contents to just what kapp-controller is managing. We don't have to have informers for the App resources that do not contribute to the health condition, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it'd only need to be a subset of all APIs

Copy link
Author

@varshaprasad96 varshaprasad96 Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the discussion in the community meeting today -

The two use-cases for setting up informers to watch resources are:

  1. Health monitoring and aggregating status.
  2. Triggering a reconcile if any resource is unhealthy.
  • From OLM's end, the use case we want to fulfil is (1).
  • (2) is something that can cause performance issues in terms of continuously reconciling for any unhealthy resources even if we have informers set up for limited number of GVK's (especially on clusters where Kapp-ctrl is managing large no of App CRs).

If (2) is not to be addressed, to maintain modularity in terms of kapp controller's functionality, @joaopapereira suggested we explore having a separate controller to monitor the health status of individual resources which can be optionally enabled.


1. Can using informers to watch resources increase cache size, potentially impacting the performance?
2. Can the output in the `inspect` status field be combined with that of proposed `healthy` condition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another open question is if kapp-controller starts reacting to all changes in the cluster, what will happen to performance in general? At this point in time kapp-controller can become the major consumer of CPU of the full cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants