-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record the error when the creation pod fails #3112
Conversation
Signed-off-by: utam0k <k0ma@utam0k.jp>
Hey @utam0k, Thank you for your contributions! Can you please help me understand why would you like to update statuses per failure? To me, it seems like the EphemeralRunner resource should act in a similar way as the job. It should keep the state until something is either successful or not. It shouldn't be inspected after each failure but try its best to actually spin up the pod and only when it assumes it won't be able to start it (with the limit of 5 retries), it should set its state to signal the failure. We also don't do back-off periods so it would be harder to inspect it anyway. But I would love to hear your opinion |
Sure!
I agree with you. However, errors in pod creation are different from this. In this case, the object(Pod) is not created in the first place because creation requests are denied by kube-api. Therefore, EphemeralRunner statuses will remain empty unless an error is made in Failure at the time of pod creation. If you actually run the test with the following patch, you will see that the status is not updated, and the test will be timed out. $ git diff controllers/actions.github.com/ephemeralrunner_controller.go
diff --git a/controllers/actions.github.com/ephemeralrunner_controller.go b/controllers/actions.github.com/ephemeralrunner_controller.go
index 1c010af..920316e 100644
--- a/controllers/actions.github.com/ephemeralrunner_controller.go
+++ b/controllers/actions.github.com/ephemeralrunner_controller.go
@@ -201,15 +201,16 @@ func (r *EphemeralRunnerReconciler) Reconcile(ctx context.Context, req ctrl.Requ
default:
// Pod was not found. Create if the pod has never been created
log.Info("Creating new EphemeralRunner pod.")
- result, err := r.createPod(ctx, ephemeralRunner, secret, log)
- if err != nil {
- errMessage := fmt.Sprintf("Pod has failed to create: %v", err)
- if err := r.markAsFailed(ctx, ephemeralRunner, errMessage, "CreationPodFailure", log); err != nil {
- log.Error(err, "Failed to set ephemeral runner to phase Failed")
- return ctrl.Result{}, err
- }
- }
- return result, nil
+ return r.createPod(ctx, ephemeralRunner, secret, log)
+ // result, err := r.createPod(ctx, ephemeralRunner, secret, log)
+ // if err != nil {
+ // errMessage := fmt.Sprintf("Pod has failed to create: %v", err)
+ // if err := r.markAsFailed(ctx, ephemeralRunner, errMessage, "CreationPodFailure", log); err != nil {
+ // log.Error(err, "Failed to set ephemeral runner to phase Failed")
+ // return ctrl.Result{}, err
+ // }
+ // }
+ // return result, nil
}
} I agree about the 5 retries. That means more |
Signed-off-by: utam0k <k0ma@utam0k.jp>
Oh I misunderstood the original intention of the PR, sorry. I have another suggestion. I would suggest the following:
// EphemeralRunner pod creation failure reasons
const (
ReasonTooManyPodFailures = "TooManyPodFailures"
ReasonInvalidPodFailure = "InvalidPod"
)
result, err := r.createPod(ctx, ephemeralRunner, secret, log)
switch {
case err == nil:
return result, nil
case kerrors.IsInvalid(err):
// mark ephemeral runner as failed, apply ReasonInvalidPodFailure and add an appropriate message
default:
// log the error and return the result, err. So for example, if there is a temporary failure while communicating with the API, we shouldn't count that as a creation failure
} My reasoning is that we don't want to count on network failures for example. Also, this is a great change because we don't need to retry if the pod is invalid. It will never be successful so let's avoid unnecessary reconciliations. And thank you for including the test! Does this suggestion make sense? 😄 |
This reverts commit a74f533.
Signed-off-by: utam0k <k0ma@utam0k.jp>
@nikola-jokic Thanks for your feedback. They make sense 👍 I have already addressed them. Please take a look at this PR again🙏 |
Signed-off-by: utam0k <k0ma@utam0k.jp>
Signed-off-by: utam0k <k0ma@utam0k.jp>
…roller into record-creation-failure
Signed-off-by: utam0k <k0ma@utam0k.jp>
…roller into record-creation-failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you!
@nikola-jokic Thanks for your quick review 🙏 |
When a pod creation fails, EphemeralContainer will tell you why.
Follow up: #3074