-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: metrics don't get emitted properly during retry. Fixes #8207 #10463 #10489
Conversation
Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com>
Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com>
Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com>
Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com>
Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com>
@@ -57,6 +60,23 @@ func (s *MetricsSuite) TestMetricsEndpoint() { | |||
}) | |||
} | |||
|
|||
func (s *MetricsSuite) TestRetryMetrics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, the e2e test is added to the MetricsSuite
(within api
test group) since we don't really have a suite for testing metrics functionalities.
// Runtime parameters (e.g., `status`, `resourceDuration`) in the output will be used to emit metrics. | ||
if lastChildNode != nil { | ||
retryParentNode.Outputs = lastChildNode.Outputs.DeepCopy() | ||
woc.wf.Status.Nodes[node.ID] = *retryParentNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need locking with mutex to prevent race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this probably doesn't require locking because the woc.wf.Status.Nodes[node.ID] = *retryParentNode
only happens when the retryParentNode.Fulfilled()
meets, which indicates that either all retries are failed or one retry succeeds, and for both cases, we only set woc.wf.Status.Nodes[node.ID]
once to the lastChildNode
.
Also, the code is actually moved from https://github.com/argoproj/argo-workflows/pull/10489/files#diff-f321d4af83fcf8311dc80c0d50c59ac4c240f761206e7bb652709870eb9feb43L1925-L1928 to the current place because the Outputs
are used for emitting metrics. Was it already a race condition before this PR?
@alexec Want to take another look since you left some comments? |
…8207 argoproj#10463 (argoproj#10489) Signed-off-by: Jiacheng Xu <xjcmaxwellcjx@gmail.com> Co-authored-by: Saravanan Balasubramanian <33908564+sarabala1979@users.noreply.github.com>
Signed-off-by: Jiacheng Xu xjcmaxwellcjx@gmail.com
Fixes #8207
Fixes #10463
This PR fixes the issue that metrics don't get emitted correctly, and lets the controller also emit metrics during every retry.
Please do not open a pull request until you have checked ALL of these:
make pre-commit -B
to fix codegen and lint problems.If changes were requested, and you've made them, dismiss the review to get it reviewed again.