feat: Report runtime health checks into Integration readiness condition #2719

astefanutti · 2021-10-26T13:06:36Z

This PR probes the readiness checks exposed by the Camel runtime (Camel Quarkus / MicroProfile Health / SmallRye Health), to reconcile the Integration phase and readiness condition. It aims at surfacing the response from the Camel readiness checks, and exposing useful information, like error messages, so that the Integration status can serve as a single interface for higher level controllers.

As details from the readiness probe responses are not accessible from the Pod(s) status, the readiness probes are directly called by the operator (via the API server proxy), each time the Integration Pod(s) readiness condition change.

This follows up #2682, so that readiness / error status reconciliation now covers the entire lifecycle of an Integration.

TODO:

Decouple liveness and readiness probes in the container trait: Separate readiness and liveness probe #1610
Create a dedicated health trait (done in feat: Health trait #2740)
e2e tests

Release Note

feat: Report runtime health checks into Integration readiness condition

astefanutti · 2021-10-26T14:20:52Z

While that PR proposes to use the Camel readiness checks as a channel / interface to peek into the runtime status, I've identified the following open points / gaps:

camel-microprofile-health: The error details are not propagated during the conversion from the Camel health check result (org.apache.camel.health.HealthCheck.Result) to the MP Health response (org.eclipse.microprofile.health.HealthCheckResponse): https://github.com/apache/camel/blob/80b92e3624ae5db59a1a24a441f1b10b39eaa1a5/components/camel-microprofile/camel-microprofile-health/src/main/java/org/apache/camel/microprofile/health/AbstractCamelMicroProfileHealthCheck.java#L46
camel-health: It seems the default RouteHealthCheck generally associates the check status to the route status, that is UP (resp. DOWN) when the route is started (resp. when the route is stopped). So, by default, with a consumer like from("telegram?authorizationToken=WRONG_TOKEN"), the route starts, so health is UP, despite the connection keeps failing. A RoutePolicy that stops the route after an number of retries may be a solution. Also the error details do not seem to be propagated: https://github.com/apache/camel/blob/1b374423fb9371075184d5ab2d9c3b15015b2334/core/camel-health/src/main/java/org/apache/camel/impl/health/RouteHealthCheck.java#L45.

@lburgazzoli @nicolaferraro @jamesnetherton @davsclaus would you be kind to shed your expertise / opinion on this?

jamesnetherton · 2021-10-26T14:51:36Z

The error details are not propagated during the conversion from the Camel health check result to the MP Health response

Yeah, it'd be nice to implement that. Should be a simple enough enhancement.

lburgazzoli · 2021-10-26T15:25:33Z

* `camel-health`: It seems the default `RouteHealthCheck` generally associates the check status to the route status, that is `UP` (resp. `DOWN`) when the route is started (resp. when the route is stopped). So, by default, with the consumer like `from("telegram?authorizationToken=WRONG_TOKEN")`, the route starts, so health is `UP`, despite the connection keeps failing.

This is a long standing issue and I think we need to start adding health check at component level

Also the error details do not seem to be propagated

This needs to be fixed too

Mind opening some issue on camel ?

davsclaus · 2021-10-27T07:18:02Z

Yes whether a route can startup vs the consumer is connected is component specific.

Some components have built in their own recovery so they startup the route, and then will automatic self-heal / failover etc. Such as JMS, SQL etc.

And some components does not and fail starting the route if so.

As Luca says then very likely the best solution is to add component level health check, so we can add the logic needed per component (we can have a default readiness that is based on consumer is started).

davsclaus · 2021-10-27T07:20:03Z

There are some tickets already for component level health checks
https://issues.apache.org/jira/browse/CAMEL-16975
https://issues.apache.org/jira/browse/CAMEL-16976
https://issues.apache.org/jira/browse/CAMEL-15133

And also some way of checking the caused error in the consumer if its connectivity error or a business error
https://issues.apache.org/jira/browse/CAMEL-16977

davsclaus · 2021-10-27T07:22:54Z

@astefanutti for the 1st bullet we need a JIRA ticket about this. Then maybe @jamesnetherton can take a look, seems like we can copy over the message/error to MP.

There are also some other details such as

        builder.detail("route.id", route.getId());
        builder.detail("route.status", status.name());
        builder.detail("route.context.name", context.getName());

And then some general information for counters, eg number of checks and failures in row etc. See the base class source code

astefanutti · 2021-10-27T10:09:34Z

@davsclaus, thanks a lot, these tickets capture exactly what would be needed 👍🏼.

I've created CAMEL-17138 to track the propagation of the Camel health check result details into the MP Health responses.

nicolaferraro · 2021-10-27T10:45:01Z

I'm thinking to a multi-tenant cluster with strict network policies that disallow cross-namespace connections, and a Camel K operator deployed globally. It seems from the code that this would result in a kind of "health unavailable" (if the connection error is catched).

Maybe tunneling the request e.g. via the apiserver proxy could make it work in any configuration. Wdyt @astefanutti ?

astefanutti · 2021-10-27T11:21:44Z

Maybe tunneling the request e.g. via the apiserver proxy could make it work in any configuration. Wdyt @astefanutti ?

Ah right, that's a very good point. I took the shortest path and mimicked what the kubelet does, but the operator Pod is indeed subjected to network policies. Let me rework it based on the API server proxy. Ultimately, it would be possible to have the reason reported into the Pod readiness condition directly, alike the termination message, but in the interim, relying on the API server proxy should do it.

davsclaus · 2021-10-28T09:26:59Z

Okay have a prototype for camel health-check in camel-telegram that reports it as DOWN or UP

You can set the threshold in the standard way today, so either global with a * or by route id, (pattern)

camel.health.config[*].failure-threshold = 10

or via route id

camel.health.config[myRoute].failure-threshold = 10

There is no threshold by default, so we may consider something special for this, or let camel-k auto assign a default value or something.

astefanutti · 2021-10-28T10:13:21Z

Okay have a prototype for camel health-check in camel-telegram that reports it as DOWN or UP

Wow that was fast! There is also @Croway that has a PoC for propagating the error details in camel-microprofile-health.

You can set the threshold in the standard way today, so either global with a * or by route id, (pattern)

camel.health.config[*].failure-threshold = 10

or via route id

camel.health.config[myRoute].failure-threshold = 10

There is no threshold by default, so we may consider something special for this, or let camel-k auto assign a default value or something.

Yes, I think we could have a health trait that would provide users the ability to configure theses, and auto assign sensible defaults there. I think encapsulating the health configuration into a dedicated trait would also help disentangling the container trait.

Also it could be useful to have a success-threshold parameter, akin to Kubernetes health probes.

davsclaus · 2021-10-28T14:51:22Z

Good idea about a trait for health checks. This can then make configuring this easier for end users.

The success threshold is a nice touch, as today when a route/consumer is successful again after 1 attempt its UP. We can add similar threshold as we have for failure.
https://issues.apache.org/jira/browse/CAMEL-17143

astefanutti · 2021-10-29T13:48:36Z

I've updated the logic to call the health probe via the API server proxy.

davsclaus · 2021-10-29T13:52:02Z

Another ticket to allow to control Camel to auto stop un-healthy routes
https://issues.apache.org/jira/browse/CAMEL-17148

astefanutti · 2021-11-09T12:24:14Z

The new health trait and the fix for #1610 is addressed via #2740.

…g via API server

astefanutti · 2021-11-17T09:13:36Z

I think it's ready. We'll be able to iterate as soon as we upgrade to Camel 3.13+, to leverage the new features developed by @davsclaus, add more test cases, and fix bugs 😇.

astefanutti added kind/feature New feature or request status/wip Work in progress labels Oct 26, 2021

oscerd approved these changes Oct 26, 2021

View reviewed changes

davsclaus approved these changes Oct 27, 2021

View reviewed changes

astefanutti mentioned this pull request Oct 28, 2021

ErrorHandler for stopping a route when an error occurs #2724

Closed

astefanutti changed the title ~~feat: Integrate Camel readiness checks into Integration readiness condition~~ feat: Report runtime health checks into Integration readiness condition Nov 15, 2021

astefanutti removed the status/wip Work in progress label Nov 15, 2021

astefanutti added 5 commits November 16, 2021 11:08

feat: Report runtime health checks into Integration readiness condition

d41e93d

chore(heath): Extract port number from readiness probe before proxyin…

a2f262c

…g via API server

chore(health): Handle readiness probe call errors

9b4f40e

chore(health): Do not account for non-ready terminating Pods

20a59d1

test: Add health trait e2e tests

fd037b9

astefanutti merged commit 87b3a94 into apache:main Nov 17, 2021

astefanutti deleted the pr-321 branch November 17, 2021 09:32

astefanutti mentioned this pull request Nov 19, 2021

feat: Comprehensive Integration error status #2682

Merged

oscerd mentioned this pull request Jan 20, 2022

Release 1.8.0 #2848

Closed

astefanutti mentioned this pull request Mar 8, 2022

fix: The KameletBinding readiness condition should mirror its Integration one #3092

Merged

tadayosi mentioned this pull request Nov 16, 2022

Ready condition message not always taken from Camel Health Check #3761

Closed

gansheer mentioned this pull request Mar 28, 2023

add startup probes into the health trait #4182

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Report runtime health checks into Integration readiness condition #2719

feat: Report runtime health checks into Integration readiness condition #2719

astefanutti commented Oct 26, 2021 •

edited

Loading

astefanutti commented Oct 26, 2021 •

edited

Loading

jamesnetherton commented Oct 26, 2021

lburgazzoli commented Oct 26, 2021

davsclaus commented Oct 27, 2021

davsclaus commented Oct 27, 2021

davsclaus commented Oct 27, 2021 •

edited

Loading

astefanutti commented Oct 27, 2021

nicolaferraro commented Oct 27, 2021

astefanutti commented Oct 27, 2021 •

edited

Loading

davsclaus commented Oct 28, 2021

astefanutti commented Oct 28, 2021

davsclaus commented Oct 28, 2021 •

edited

Loading

astefanutti commented Oct 29, 2021

davsclaus commented Oct 29, 2021

astefanutti commented Nov 9, 2021

astefanutti commented Nov 17, 2021

feat: Report runtime health checks into Integration readiness condition #2719

feat: Report runtime health checks into Integration readiness condition #2719

Conversation

astefanutti commented Oct 26, 2021 • edited Loading

astefanutti commented Oct 26, 2021 • edited Loading

jamesnetherton commented Oct 26, 2021

lburgazzoli commented Oct 26, 2021

davsclaus commented Oct 27, 2021

davsclaus commented Oct 27, 2021

davsclaus commented Oct 27, 2021 • edited Loading

astefanutti commented Oct 27, 2021

nicolaferraro commented Oct 27, 2021

astefanutti commented Oct 27, 2021 • edited Loading

davsclaus commented Oct 28, 2021

astefanutti commented Oct 28, 2021

davsclaus commented Oct 28, 2021 • edited Loading

astefanutti commented Oct 29, 2021

davsclaus commented Oct 29, 2021

astefanutti commented Nov 9, 2021

astefanutti commented Nov 17, 2021

astefanutti commented Oct 26, 2021 •

edited

Loading

astefanutti commented Oct 26, 2021 •

edited

Loading

davsclaus commented Oct 27, 2021 •

edited

Loading

astefanutti commented Oct 27, 2021 •

edited

Loading

davsclaus commented Oct 28, 2021 •

edited

Loading