-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Timestamp for Health Status (#16972) #18660
feat: Timestamp for Health Status (#16972) #18660
Conversation
11944ca
to
6d65cda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have a couple of initial things, PTAL :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mkieweg for the PR. I have left a suggestion. I think below section also needs an update.
argo-cd/controller/appcontroller.go
Lines 1777 to 1791 in 029b5ac
if orig.Status.Health.Status != newStatus.Health.Status { | |
message := fmt.Sprintf("Updated health status: %s -> %s", orig.Status.Health.Status, newStatus.Health.Status) | |
ctrl.logAppEvent(orig, argo.EventInfo{Reason: argo.EventReasonResourceUpdated, Type: v1.EventTypeNormal}, message, context.TODO()) | |
} | |
var newAnnotations map[string]string | |
if orig.GetAnnotations() != nil { | |
newAnnotations = make(map[string]string) | |
for k, v := range orig.GetAnnotations() { | |
newAnnotations[k] = v | |
} | |
delete(newAnnotations, appv1.AnnotationKeyRefresh) | |
} | |
patch, modified, err := diff.CreateTwoWayMergePatch( | |
&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: orig.GetAnnotations()}, Status: orig.Status}, | |
&appv1.Application{ObjectMeta: metav1.ObjectMeta{Annotations: newAnnotations}, Status: *newStatus}, appv1.Application{}) |
Looks like Update: not required.CreateTwoWayMergePatch
will always detect a drift here due to the new timestamp field and try to patch the status when there is no actual change in app health status. We might need to explicitly set it to old/orig value if there is not change in app health status.
Thanks for pointing that out, I'll give it a look |
Hi @mkieweg, I think you can ignore my previous suggestion regarding changes due to argo-cd/controller/appcontroller.go Line 1504 in 029b5ac
and will differ only when explicitly updated, which happens at 2 places. We need to ensure timestamp is updated at these places (looks like 2. is already handled in your change).
|
Hi @svghadi, I've added setting the timestamp in case we set the health status to unknown. The second one is indeed already handled by |
We actually used to have a timestamp, but the updating of the timestamp as part of every app reconciliation would cause a ton of K8s API server pressure. I need to find the PR that removed this, but this could actually cause a performance problem. |
I found the old issue. It is here: #1340 Basically, we used to update |
I took a closer look at this implementation, and noticed that we don't needlessly update timestamps (as we did with the problematic // If the status didn't change, we don't want to update the timestamp
if healthStatus.Status == statuses[i].Health.Status {
now = lastTransitionTime
} So my performance concerns may be alleviated. |
In the E2E test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some tests and found some corner cases that need to be handled:
- Timestamp for individual resources is updated to Application's LastTransitionTime even when there is no change in the resource status.
- The timestamp for Application's health is not updated when the status changes from Non Healthy to Healthy state.
controller/health.go
Outdated
// If the status didn't change, we don't want to update the timestamp | ||
if healthStatus.Status == statuses[i].Health.Status { | ||
now = lastTransitionTime | ||
} | ||
|
||
resHealth := appv1.HealthStatus{Status: healthStatus.Status, Message: healthStatus.Message, LastTransitionTime: now} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section sets health of individual resources of the application. If there is no status change, I feel it should be set to lastTransitionTime of that particular resource instead of setting it to application's lastTransitionTime, as the values may not necessarily be the same.
// If the status didn't change, we don't want to update the timestamp | |
if healthStatus.Status == statuses[i].Health.Status { | |
now = lastTransitionTime | |
} | |
resHealth := appv1.HealthStatus{Status: healthStatus.Status, Message: healthStatus.Message, LastTransitionTime: now} | |
// If the status didn't change, we don't want to update the timestamp | |
if healthStatus.Status == statuses[i].Health.Status { | |
now = statuses[i].Health.LastTransitionTime | |
} | |
resHealth := appv1.HealthStatus{Status: healthStatus.Status, Message: healthStatus.Message, LastTransitionTime: now} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me have a look and get back to you on this. I vaguely remember, that the current implementation only persists the timestamp on an application level. So while setting it on the resource level here might make sense, but I'm not sure if it has any benefit if the rest of the implementation remains unchanged. Which I would advocate for at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the code. The resource-level timestamp is indeed difficult to achieve due to the current implementation.
Since the resource timestamp will contain the app health timestamp, I feel it may not be that useful. How about we skip it at the resource level? Just have it at the application health level, i.e. .status.health
. If you change the timestamp to a pointer type, I think we can ignore it at resource level and it won't show up in .status.resources
.
controller/health.go
Outdated
if persistResourceHealth { | ||
appHealth.LastTransitionTime = statuses[i].Health.LastTransitionTime | ||
} else { | ||
appHealth.LastTransitionTime = now | ||
} | ||
} else if healthStatus.Status == health.HealthStatusHealthy { | ||
appHealth.LastTransitionTime = lastTransitionTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to handle case where status transition happens from Non Healthy -> Healthy. Since we have access to old status values, would it make sense to set the time value outside the for
loop after the new status of application is calculated? Something like
diff --git a/controller/health.go b/controller/health.go
index b1c457776..e1a2c5d19 100644
--- a/controller/health.go
+++ b/controller/health.go
@@ -86,15 +86,11 @@ func setApplicationHealth(resources []managedResource, statuses []appv1.Resource
if health.IsWorse(appHealth.Status, healthStatus.Status) {
appHealth.Status = healthStatus.Status
- if persistResourceHealth {
- appHealth.LastTransitionTime = statuses[i].Health.LastTransitionTime
- } else {
- appHealth.LastTransitionTime = now
- }
- } else if healthStatus.Status == health.HealthStatusHealthy {
- appHealth.LastTransitionTime = lastTransitionTime
}
}
+ if persistResourceHealth {
+ if app.Status.Health.Status == appHealth.Status {
+ appHealth.LastTransitionTime = lastTransitionTime
+ } else {
+ appHealth.LastTransitionTime = metav1.Now()
+ }
+ }
if persistResourceHealth {
app.Status.ResourceHealthSource = appv1.ResourceHealthLocationInline
} else {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've managed to repro the issue. Unfortunately it persists with your suggestion. I'll look into it deeper tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added another test to appcontroller_test.go
where the issue you're describing doesn't happen. I think the code works as expected, but maybe you can have another look :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a manual test by creating a pod with delayed start. The time is updated for Missing->Progressing
state change but it doesn't get updated when state moves from Progressing->Healthy
.
timestamp-no-update.mov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting. Let me have a closer look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some more extensive testing. Currently they break because I couldn't find a way to make them work with the way we're using mocks. Any suggestions on how to make TestUpdateHealthStatusProgression
work are welcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've managed to repro the issue. Unfortunately it persists with your suggestion.
Hmm, I haven't been able to repro this after applying @svghadi's suggestion.
kubectl get app guestbook -o jsonpath='{.status.health}'
{"lastTransitionTime":"2024-06-27T11:13:06Z","status":"Progressing"}
kubectl get app guestbook -o jsonpath='{.status.health}'
{"lastTransitionTime":"2024-06-27T11:13:31Z","status":"Healthy"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkieweg I added some changes which should fix your tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hiddeco wrote in #16972 that this doesn't quite account for all edge cases.
While I am super supportive of this being added, I have an edge case that I think falls into this category but is not covered by just this addition.
Within Argo CD, the sync operations are completely separated from the application health. However, in certain scenarios, you do want to know if the current state of the health check is the outcome of a previous sync operation.
With the proposal, this would become possible by comparing the .finishedAt from the .status.operationState against the .lastUpdateTime. This, however, only works when you sync workload resources, which means that it will not help in scenarios where you only have non-workload resources, which will always end up Healthy without causing a transition.
...
I wonder if it would be an idea to introduce a "heartbeat" kind of timestamp that updates at most every 30 seconds (to not cause any API pressure) to indicate the "freshness" of the observation even without it transitioning.
By modifying @svghadi's suggestion slightly I think that would be achievable by doing this (haven't tested this so caveat emptor):
diff --git a/controller/health.go b/controller/health.go
index b1c457776..e1a2c5d19 100644
--- a/controller/health.go
+++ b/controller/health.go
@@ -86,15 +86,11 @@ func setApplicationHealth(resources []managedResource, statuses []appv1.Resource
if health.IsWorse(appHealth.Status, healthStatus.Status) {
appHealth.Status = healthStatus.Status
- if persistResourceHealth {
- appHealth.LastTransitionTime = statuses[i].Health.LastTransitionTime
- } else {
- appHealth.LastTransitionTime = now
- }
- } else if healthStatus.Status == health.HealthStatusHealthy {
- appHealth.LastTransitionTime = lastTransitionTime
}
}
if persistResourceHealth {
+ if app.Status.Health.Status == appHealth.Status && time.Since(lastTransitionTime.Time) < 30*time.Second {
+ appHealth.LastTransitionTime = lastTransitionTime
+ } else {
+ appHealth.LastTransitionTime = metav1.Now()
+ }
app.Status.ResourceHealthSource = appv1.ResourceHealthLocationInline
} else {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion however would be to have the heartbeat timestamp as a separate field (as e.g. "last observed" or "heartbeat"), as otherwise it could create a false impression that a transition did take place.
8d4030b
to
1be8574
Compare
Hi @mkieweg, did you get a chance to work on this again? If you're busy with other tasks, I'm happy to jump in and continue with it. |
Hi @svghadi, sorry I had to pause working on this due to shifting priorities. I was planning on picking this back up this week. I'm not sure how much time I can actually commit, so if you're happy to take over that would be okay with me |
@svghadi I can also help out if needed |
@mkieweg - Sure. Could I get collaborator access to your argo-cd fork, or should I cherry-pick the changes and continue in a new PR? @blakepettersson - Thank you. Should we address @hiddeco’s use case in this implementation, or would it be better to keep the scope limited and handle it in a separate PR? |
@svghadi I've added you as a contributor to my fork. Can do the same for @blakepettersson if needed |
Ideally I'd like to see his use case addressed in this PR as well, since we have a particular use case in mind for it (it's related to Kargo). |
Sounds good to me, thanks! |
Hi @blakepettersson , I was thinking about how to incorporate @hiddeco 's use case into this PR. It seems a bit challenging without introducing a new field, as @mkieweg's use case requires a timestamp that doesn't update periodically.
In contrast, @hiddeco 's use case needs a periodically updating status timestamp.
Here are my thoughts:
The health status would look something like this: status:
health:
lastTransitionTime: "2024-09-03T08:51:32Z"
observedAt: "2024-09-03T08:51:32Z" # configurable
status: Progressing Thoughts/suggestions? |
Hi @svghadi, I agree that we'll need two separate fields, one for the health status and one to indicate "freshness".
👍
I think it's fine for now to keep it at max 30 secs - if there is a need to configure this at a later stage we can address that in another PR. |
e62a31f
to
1036899
Compare
b4fd9fa
to
eb80af0
Compare
Hi @blakepettersson, @crenshaw-dev, I have removed the concerning |
hs.status = "Degraded" | ||
end | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to override statuses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you can use a modernized way of specifying healthchecks (not that it matters here), e.g. https://argo-cd.readthedocs.io/en/latest/operator-manual/health/#argocd-app
for _, tc := range testCases { | ||
t.Run(tc.name, func(t *testing.T) { | ||
deployment.SetLabels(map[string]string{"status": tc.initialStatus}) | ||
ctrl.processAppRefreshQueueItem() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to trigger it manually instead of requesting app refresh?
eb80af0
to
f2e069b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks great overall! I ran some tests on my end, and everything seems to be working smoothly. I have left some minor nits.
controller/appcontroller.go
Outdated
@@ -1631,8 +1631,10 @@ func (ctrl *ApplicationController) processAppRefreshQueueItem() (processNext boo | |||
project, hasErrors := ctrl.refreshAppConditions(app) | |||
ts.AddCheckpoint("refresh_app_conditions_ms") | |||
if hasErrors { | |||
now := metav1.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We seem to be defining now := metav1.Now()
at multiple places in the function. I do not think there would be a huge time difference in both the timestamps. Can we define at just 1 place before this if call or after ts.AddCheckpoint("refresh_app_conditions_ms")
statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it out of if
@@ -1617,6 +1617,8 @@ type HealthStatus struct { | |||
Status health.HealthStatusCode `json:"status,omitempty" protobuf:"bytes,1,opt,name=status"` | |||
// Message is a human-readable informational message describing the health status | |||
Message string `json:"message,omitempty" protobuf:"bytes,2,opt,name=message"` | |||
// LastTransitionTime is the time the HealthStatus was set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// LastTransitionTime is the time the HealthStatus was set | |
// LastTransitionTime is the time the HealthStatus was set or updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
The PR needs rebase and |
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Co-authored-by: Blake Pettersson <blake.pettersson@gmail.com> Signed-off-by: Manuel Kieweg <2939765+mkieweg@users.noreply.github.com>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Due to implementation limitations, setting LastTransitionTime at the resource level is challenging. Converting it to a pointer type allows it to be skipped at the resource level and prevents it from appearing in .status.resources of the Application CR. Additionally, it doesn’t provide much value or have a known use case right now. Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>
Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>
Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>
f2e069b
to
5b48033
Compare
Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>
Great job @svghadi!! Do you mind creating a PR for the |
Thanks. |
Hi @blakepettersson , the new PR (#21120) is up for |
This PR contains an implementation of the enhancement proposal #16972.
Closes #16972
Checklist: