-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testgrid.k8s.io periodics show two entries per run, one still running #19082
Comments
/area testgrid /area prow |
Looks like all those dead jobs are stuck in Pending states I am not sure if something aborted? Maybe a merge update? @cjwagner told me something about in issue that triggers that. |
It also looks like one is being kicked shortly after each one that is stuck in |
Definitely looks like the problem has stopped at least. I did some debugging and turned up the following: As Grant noted, the pending jobs are paired with runs that start a minute later. I can see from the prowjob.json that those runs are associated with the same ProwJob which suggests that plank replaced the pod for some reason. # Horologium (11:11:14): Triggered the ProwJob
{
component: "horologium"
file: "prow/cmd/horologium/main.go:149"
func: "main.sync"
job: "ci-kubernetes-e2e-gci-gce"
level: "info"
msg: "Triggering new run of interval periodic."
name: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"
previous-found: true
should-trigger: true
type: "periodic"
}
# Plank (11:11:23): Transitioning states. triggered -> pending (pod should be created now)
{
component: "plank"
file: "prow/plank/controller.go:523"
from: "triggered"
func: "k8s.io/test-infra/prow/plank.(*Controller).syncTriggeredJob"
job: "ci-kubernetes-e2e-gci-gce"
level: "info"
msg: "Transitioning states."
name: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"
to: "pending"
type: "periodic"
}
# Sinker (11:11:26): Deleted old completed pod (orphaned means we don't know of a PJ for the pod?!)
{
cluster: "k8s-infra-prow-build"
component: "sinker"
file: "prow/cmd/sinker/main.go:473"
func: "main.(*controller).deletePod"
level: "info"
msg: "Deleted old completed pod."
pod: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"
reason: "orphaned"
}
# Crier (11:11:30): Failed processing item, no more retries. (Failed to add finalizer which Alvaro fixed in #19048)
{
component: "crier"
error: "failed to add finalizer to pod: failed to patch pod: Pod "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"prow.x-k8s.io/gcsk8sreporter"}"
file: "prow/crier/controller.go:153"
func: "k8s.io/test-infra/prow/crier.(*Controller).retry"
jobName: "ci-kubernetes-e2e-gci-gce"
jobStatus: "pending"
key: "default/5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"
level: "error"
msg: "Failed processing item, no more retries"
prowjob: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"
reporter: "gcsk8sreporter"
}
# Plank (11:12:23): Pod is missing, starting a new pod.
{
component: "plank"
file: "prow/plank/controller.go:334"
func: "k8s.io/test-infra/prow/plank.(*Controller).syncPendingJob"
job: "ci-kubernetes-e2e-gci-gce"
level: "info"
msg: "Pod is missing, starting a new pod"
name: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"
type: "periodic"
} So it looks like sinker was deleting the pods shortly after the job starts them due to thinking they are orphaned. I'm not sure what would cause that behavior without looking deeper, but given that we saw an error related to finalizers and given that the issues stopped after Alvaro's fix, thats where I would look for the root cause. The UX here is bad because we usually treat ProwJobs as a single run of a pod, but in reality Prow may replace a pod if it is missing or otherwise encounters weird behavior from k8s (e.g. eviction). We assign a new buildID when we do this in order to define a new upload location which is needed to avoid mixing results from the partially run pod with the results from the replacement pod. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Our Prow dashboard is displaying a few Periodic jobs in Pending state forever!
Below are the logs for the related prow components: @MushuEE @cjwagner Can you please give some pointer to debug further. @mkumatag ^^ |
I have seen this behavior before, @cjwagner I seem to remember you providing and explanation. |
@Rajalakshmi-Girish That sounds like the pod was replaced like I described in #19082 (comment). I'd inspect the component logs and see if it seems like the same problem, then go from there. In particular, the logs I shared suggest that sinker deleted a pod due to incorrectly identifying it as orphaned. |
I like the idea of updating the prowjob CR with enough info to understand:
I also wonder if we could have the testgrid updater remove columns if their corresponding results in GCS lack a finished.json after some timeout |
/remove-lifecycle rotten |
/sig testing |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
e.g. https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&width=5
e.g. https://testgrid.k8s.io/sig-release-master-blocking#gce-device-plugin-gpu-master&width=5
e.g. https://testgrid.k8s.io/sig-release-1.19-blocking#verify-1.19&width=5
e.g. https://testgrid.k8s.io/sig-release-master-blocking#bazel-test-master&width=5
e.g. https://testgrid.k8s.io/sig-contribex-org#ci-peribolos
What you expected to happen:
One column per run
Please provide links to example occurrences, if any:
See above
Anything else we need to know?:
I think whatever this is, was fixed by #19048
I'd like to root cause or at least better understand what happened and why
The text was updated successfully, but these errors were encountered: