testgrid.k8s.io periodics show two entries per run, one still running #19082

spiffxp · 2020-09-01T02:54:50Z

What you expected to happen:

One column per run

Please provide links to example occurrences, if any:
See above

Anything else we need to know?:
I think whatever this is, was fixed by #19048

I'd like to root cause or at least better understand what happened and why

spiffxp · 2020-09-01T02:56:51Z

/area testgrid
testgrid is almost certainly just displaying the symptom and not at fault here

/area prow
crier? plank? something else?

MushuEE · 2020-09-01T05:52:03Z

Looks like all those dead jobs are stuck in Pending states

I am not sure if something aborted? Maybe a merge update? @cjwagner told me something about in issue that triggers that.

MushuEE · 2020-09-01T15:30:44Z

It also looks like one is being kicked shortly after each one that is stuck in Pending so I assume it is killing the other POD and not writing finished?

cjwagner · 2020-09-02T01:31:10Z

Definitely looks like the problem has stopped at least. I did some debugging and turned up the following:

As Grant noted, the pending jobs are paired with runs that start a minute later. I can see from the prowjob.json that those runs are associated with the same ProwJob which suggests that plank replaced the pod for some reason.
From the prow logs for one such ProwJob (in chronological order):

# Horologium (11:11:14): Triggered the ProwJob
{
  component: "horologium"   
  file: "prow/cmd/horologium/main.go:149"   
  func: "main.sync"   
  job: "ci-kubernetes-e2e-gci-gce"   
  level: "info"   
  msg: "Triggering new run of interval periodic."   
  name: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"   
  previous-found: true   
  should-trigger: true   
  type: "periodic"   
}
# Plank (11:11:23): Transitioning states. triggered -> pending (pod should be created now)
{
  component: "plank"   
  file: "prow/plank/controller.go:523"   
  from: "triggered"   
  func: "k8s.io/test-infra/prow/plank.(*Controller).syncTriggeredJob"   
  job: "ci-kubernetes-e2e-gci-gce"   
  level: "info"   
  msg: "Transitioning states."   
  name: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"   
  to: "pending"   
  type: "periodic"   
}
# Sinker (11:11:26): Deleted old completed pod (orphaned means we don't know of a PJ for the pod?!)
{
  cluster: "k8s-infra-prow-build"   
  component: "sinker"   
  file: "prow/cmd/sinker/main.go:473"   
  func: "main.(*controller).deletePod"   
  level: "info"   
  msg: "Deleted old completed pod."   
  pod: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"   
  reason: "orphaned"   
}
# Crier (11:11:30): Failed processing item, no more retries. (Failed to add finalizer which Alvaro fixed in #19048)
{
  component: "crier"   
  error: "failed to add finalizer to pod: failed to patch pod: Pod "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"prow.x-k8s.io/gcsk8sreporter"}"   
  file: "prow/crier/controller.go:153"   
  func: "k8s.io/test-infra/prow/crier.(*Controller).retry"   
  jobName: "ci-kubernetes-e2e-gci-gce"   
  jobStatus: "pending"   
  key: "default/5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"   
  level: "error"   
  msg: "Failed processing item, no more retries"   
  prowjob: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"   
  reporter: "gcsk8sreporter"   
}
# Plank (11:12:23): Pod is missing, starting a new pod.
{
  component: "plank"   
  file: "prow/plank/controller.go:334"   
  func: "k8s.io/test-infra/prow/plank.(*Controller).syncPendingJob"   
  job: "ci-kubernetes-e2e-gci-gce"   
  level: "info"   
  msg: "Pod is missing, starting a new pod"   
  name: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"   
  type: "periodic"   
}

So it looks like sinker was deleting the pods shortly after the job starts them due to thinking they are orphaned. I'm not sure what would cause that behavior without looking deeper, but given that we saw an error related to finalizers and given that the issues stopped after Alvaro's fix, thats where I would look for the root cause.

The UX here is bad because we usually treat ProwJobs as a single run of a pod, but in reality Prow may replace a pod if it is missing or otherwise encounters weird behavior from k8s (e.g. eviction). We assign a new buildID when we do this in order to define a new upload location which is needed to avoid mixing results from the partially run pod with the results from the replacement pod.
I'm not sure what the best way for us to improve this would be. One idea would be to annotate jobs with their previous buildIDs from partially run pods and, if they are present, have crier upload something to GCS that TestGrid could use as a signal to ignore the partial results. Alternatively we could have crier delete partial results from GCS in that case, but that may not be a good idea?

fejta-bot · 2020-12-01T01:59:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-12-31T02:43:48Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rajalakshmi-Girish · 2021-01-21T19:42:01Z

Our Prow dashboard is displaying a few Periodic jobs in Pending state forever!
Such Pending rows are displayed once in a while and the successive build comes up in the span of close to 40min, well before the actual 3hr interval.

The successive build has the same ProwJob ID as the one that is stuck in Pending state.
The entry that is stuck in Pending state has no Pod mapped to it.
The two successive entries have different build_ids and hence different logs folders in GCS.
Though the successive entry has same Job ID, it successfully creates a Pod and completes the Job unlike the Pending one.
Pending state artifacts in GCS:
https://console.cloud.google.com/storage/browser/ppc64le-kubernetes/logs/periodic-kubernetes-bazel-test-ppc64le/1351809092238184448?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&authuser=0&project=ibm-kubernetes&prefix=&forceOnObjectsSortingFiltering=false
Completed Job with Same ID as above:
https://console.cloud.google.com/storage/browser/ppc64le-kubernetes/logs/periodic-kubernetes-bazel-test-ppc64le/1351816554479620096?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&authuser=0&project=ibm-kubernetes&prefix=&forceOnObjectsSortingFiltering=false

Below are the logs for the related prow components:
crier.log
control.log

@MushuEE @cjwagner Can you please give some pointer to debug further.

@mkumatag ^^

MushuEE · 2021-01-26T05:33:34Z

I have seen this behavior before, @cjwagner I seem to remember you providing and explanation.

cjwagner · 2021-01-26T19:02:34Z

@Rajalakshmi-Girish That sounds like the pod was replaced like I described in #19082 (comment). I'd inspect the component logs and see if it seems like the same problem, then go from there. In particular, the logs I shared suggest that sinker deleted a pod due to incorrectly identifying it as orphaned.
Please also ensure your Prow instance is up to date to include any fixes we've already made.

spiffxp · 2021-02-05T18:07:30Z

I'm not sure what the best way for us to improve this would be. One idea would be to annotate jobs with their previous buildIDs from partially run pods and, if they are present, have crier upload something to GCS that TestGrid could use as a signal to ignore the partial results. Alternatively we could have crier delete partial results from GCS in that case, but that may not be a good idea?

I like the idea of updating the prowjob CR with enough info to understand:

what pods / build-ids were previously used for this prowjob run
possibly? why they failed (evicted, deleted, timed out scheduling, etc)

I also wonder if we could have the testgrid updater remove columns if their corresponding results in GCS lack a finished.json after some timeout

spiffxp · 2021-02-05T23:08:35Z

/remove-lifecycle rotten
/lifecycle stale
to keep this open long enough to figure out whether it's worth breaking out the ideas in #19082 (comment)

spiffxp · 2021-02-09T20:55:39Z

/sig testing

fejta-bot · 2021-03-11T21:26:21Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-04-10T22:06:00Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-04-10T22:06:09Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label Sep 1, 2020

k8s-ci-robot added area/testgrid area/prow Issues or PRs related to prow labels Sep 1, 2020

spiffxp mentioned this issue Sep 1, 2020

(ci|pull)-kubernetes-kind-e2e.* jobs are failing #19080

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 31, 2020

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Feb 5, 2021

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Feb 9, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 11, 2021

k8s-ci-robot closed this as completed Apr 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testgrid.k8s.io periodics show two entries per run, one still running #19082

testgrid.k8s.io periodics show two entries per run, one still running #19082

spiffxp commented Sep 1, 2020

spiffxp commented Sep 1, 2020

MushuEE commented Sep 1, 2020 •

edited

Loading

MushuEE commented Sep 1, 2020

cjwagner commented Sep 2, 2020

fejta-bot commented Dec 1, 2020

fejta-bot commented Dec 31, 2020

Rajalakshmi-Girish commented Jan 21, 2021 •

edited

Loading

MushuEE commented Jan 26, 2021

cjwagner commented Jan 26, 2021

spiffxp commented Feb 5, 2021

spiffxp commented Feb 5, 2021

spiffxp commented Feb 9, 2021

fejta-bot commented Mar 11, 2021

fejta-bot commented Apr 10, 2021

k8s-ci-robot commented Apr 10, 2021

testgrid.k8s.io periodics show two entries per run, one still running #19082

testgrid.k8s.io periodics show two entries per run, one still running #19082

Comments

spiffxp commented Sep 1, 2020

spiffxp commented Sep 1, 2020

MushuEE commented Sep 1, 2020 • edited Loading

MushuEE commented Sep 1, 2020

cjwagner commented Sep 2, 2020

fejta-bot commented Dec 1, 2020

fejta-bot commented Dec 31, 2020

Rajalakshmi-Girish commented Jan 21, 2021 • edited Loading

MushuEE commented Jan 26, 2021

cjwagner commented Jan 26, 2021

spiffxp commented Feb 5, 2021

spiffxp commented Feb 5, 2021

spiffxp commented Feb 9, 2021

fejta-bot commented Mar 11, 2021

fejta-bot commented Apr 10, 2021

k8s-ci-robot commented Apr 10, 2021

MushuEE commented Sep 1, 2020 •

edited

Loading

Rajalakshmi-Girish commented Jan 21, 2021 •

edited

Loading