Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record the error when the creation pod fails #3112

Merged
merged 10 commits into from
Dec 7, 2023

Conversation

utam0k
Copy link
Contributor

@utam0k utam0k commented Nov 29, 2023

When a pod creation fails, EphemeralContainer will tell you why.

Follow up: #3074

Signed-off-by: utam0k <k0ma@utam0k.jp>
@nikola-jokic
Copy link
Collaborator

Hey @utam0k,

Thank you for your contributions! Can you please help me understand why would you like to update statuses per failure? To me, it seems like the EphemeralRunner resource should act in a similar way as the job. It should keep the state until something is either successful or not. It shouldn't be inspected after each failure but try its best to actually spin up the pod and only when it assumes it won't be able to start it (with the limit of 5 retries), it should set its state to signal the failure.

We also don't do back-off periods so it would be harder to inspect it anyway. But I would love to hear your opinion ☺️

@utam0k
Copy link
Contributor Author

utam0k commented Nov 29, 2023

Thank you for your contributions! Can you please help me understand why would you like to update statuses per failure?

Sure!

To me, it seems like the EphemeralRunner resource should act in a similar way as the job. It should keep the state until something is either successful or not. It shouldn't be inspected after each failure but try its best to actually spin up the pod and only when it assumes it won't be able to start it (with the limit of 5 retries), it should set its state to signal the failure.

I agree with you. However, errors in pod creation are different from this. In this case, the object(Pod) is not created in the first place because creation requests are denied by kube-api. Therefore, EphemeralRunner statuses will remain empty unless an error is made in Failure at the time of pod creation. If you actually run the test with the following patch, you will see that the status is not updated, and the test will be timed out.

$ git diff controllers/actions.github.com/ephemeralrunner_controller.go
diff --git a/controllers/actions.github.com/ephemeralrunner_controller.go b/controllers/actions.github.com/ephemeralrunner_controller.go
index 1c010af..920316e 100644
--- a/controllers/actions.github.com/ephemeralrunner_controller.go
+++ b/controllers/actions.github.com/ephemeralrunner_controller.go
@@ -201,15 +201,16 @@ func (r *EphemeralRunnerReconciler) Reconcile(ctx context.Context, req ctrl.Requ
                default:
                        // Pod was not found. Create if the pod has never been created
                        log.Info("Creating new EphemeralRunner pod.")
-                       result, err := r.createPod(ctx, ephemeralRunner, secret, log)
-                       if err != nil {
-                               errMessage := fmt.Sprintf("Pod has failed to create: %v", err)
-                               if err := r.markAsFailed(ctx, ephemeralRunner, errMessage, "CreationPodFailure", log); err != nil {
-                                       log.Error(err, "Failed to set ephemeral runner to phase Failed")
-                                       return ctrl.Result{}, err
-                               }
-                       }
-                       return result, nil
+                       return r.createPod(ctx, ephemeralRunner, secret, log)
+                       // result, err := r.createPod(ctx, ephemeralRunner, secret, log)
+                       // if err != nil {
+                       //      errMessage := fmt.Sprintf("Pod has failed to create: %v", err)
+                       //      if err := r.markAsFailed(ctx, ephemeralRunner, errMessage, "CreationPodFailure", log); err != nil {
+                       //              log.Error(err, "Failed to set ephemeral runner to phase Failed")
+                       //              return ctrl.Result{}, err
+                       //      }
+                       // }
+                       // return result, nil
                }
        }

I agree about the 5 retries. That means more ephemeralRunner.Status.Failures) if the pod creation fails, right?

Signed-off-by: utam0k <k0ma@utam0k.jp>
@nikola-jokic
Copy link
Collaborator

Oh I misunderstood the original intention of the PR, sorry.

I have another suggestion. I would suggest the following:

  1. In constants.go, create two constants
// EphemeralRunner pod creation failure reasons
const (
  ReasonTooManyPodFailures = "TooManyPodFailures"
  ReasonInvalidPodFailure = "InvalidPod"
)
  1. Revert back to your original change.
  2. In your original change, on the caller side, inspect the error:
result, err := r.createPod(ctx, ephemeralRunner, secret, log)
switch {
  case err == nil:
    return result, nil
  case kerrors.IsInvalid(err):
    // mark ephemeral runner as failed, apply ReasonInvalidPodFailure and add an appropriate message
  default:
    // log the error and return the result, err. So for example, if there is a temporary failure while communicating with the API, we shouldn't count that as a creation failure
}

My reasoning is that we don't want to count on network failures for example. Also, this is a great change because we don't need to retry if the pod is invalid. It will never be successful so let's avoid unnecessary reconciliations.

And thank you for including the test! Does this suggestion make sense? 😄

@nikola-jokic nikola-jokic added the gha-runner-scale-set Related to the gha-runner-scale-set mode label Nov 30, 2023
@utam0k
Copy link
Contributor Author

utam0k commented Nov 30, 2023

@nikola-jokic Thanks for your feedback. They make sense 👍 I have already addressed them. Please take a look at this PR again🙏

Signed-off-by: utam0k <k0ma@utam0k.jp>
Signed-off-by: utam0k <k0ma@utam0k.jp>
@utam0k utam0k requested a review from nikola-jokic December 3, 2023 23:17
Signed-off-by: utam0k <k0ma@utam0k.jp>
@utam0k utam0k requested a review from nikola-jokic December 4, 2023 22:40
Copy link
Collaborator

@nikola-jokic nikola-jokic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

@nikola-jokic nikola-jokic merged commit b08d533 into actions:master Dec 7, 2023
14 checks passed
@utam0k
Copy link
Contributor Author

utam0k commented Dec 8, 2023

@nikola-jokic Thanks for your quick review 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants