Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testgrid.k8s.io periodics show two entries per run, one still running #19082

Closed
spiffxp opened this issue Sep 1, 2020 · 15 comments
Closed

testgrid.k8s.io periodics show two entries per run, one still running #19082

spiffxp opened this issue Sep 1, 2020 · 15 comments
Labels
area/prow Issues or PRs related to prow area/testgrid kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented Sep 1, 2020

What happened:
e.g. https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&width=5
Screen Shot 2020-08-31 at 7 42 58 PM

e.g. https://testgrid.k8s.io/sig-release-master-blocking#gce-device-plugin-gpu-master&width=5
Screen Shot 2020-08-31 at 7 44 00 PM

e.g. https://testgrid.k8s.io/sig-release-1.19-blocking#verify-1.19&width=5
Screen Shot 2020-08-31 at 7 46 05 PM

e.g. https://testgrid.k8s.io/sig-release-master-blocking#bazel-test-master&width=5
Screen Shot 2020-08-31 at 7 47 33 PM

e.g. https://testgrid.k8s.io/sig-contribex-org#ci-peribolos
Screen Shot 2020-08-31 at 7 50 01 PM

What you expected to happen:

One column per run

Please provide links to example occurrences, if any:
See above

Anything else we need to know?:
I think whatever this is, was fixed by #19048

I'd like to root cause or at least better understand what happened and why

@spiffxp spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label Sep 1, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Sep 1, 2020

/area testgrid
testgrid is almost certainly just displaying the symptom and not at fault here

/area prow
crier? plank? something else?

@MushuEE
Copy link
Contributor

MushuEE commented Sep 1, 2020

Looks like all those dead jobs are stuck in Pending states
image

image

I am not sure if something aborted? Maybe a merge update? @cjwagner told me something about in issue that triggers that.

@MushuEE
Copy link
Contributor

MushuEE commented Sep 1, 2020

It also looks like one is being kicked shortly after each one that is stuck in Pending so I assume it is killing the other POD and not writing finished?

@cjwagner
Copy link
Member

cjwagner commented Sep 2, 2020

Definitely looks like the problem has stopped at least. I did some debugging and turned up the following:

As Grant noted, the pending jobs are paired with runs that start a minute later. I can see from the prowjob.json that those runs are associated with the same ProwJob which suggests that plank replaced the pod for some reason.
From the prow logs for one such ProwJob (in chronological order):

# Horologium (11:11:14): Triggered the ProwJob
{
  component: "horologium"   
  file: "prow/cmd/horologium/main.go:149"   
  func: "main.sync"   
  job: "ci-kubernetes-e2e-gci-gce"   
  level: "info"   
  msg: "Triggering new run of interval periodic."   
  name: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"   
  previous-found: true   
  should-trigger: true   
  type: "periodic"   
}
# Plank (11:11:23): Transitioning states. triggered -> pending (pod should be created now)
{
  component: "plank"   
  file: "prow/plank/controller.go:523"   
  from: "triggered"   
  func: "k8s.io/test-infra/prow/plank.(*Controller).syncTriggeredJob"   
  job: "ci-kubernetes-e2e-gci-gce"   
  level: "info"   
  msg: "Transitioning states."   
  name: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"   
  to: "pending"   
  type: "periodic"   
}
# Sinker (11:11:26): Deleted old completed pod (orphaned means we don't know of a PJ for the pod?!)
{
  cluster: "k8s-infra-prow-build"   
  component: "sinker"   
  file: "prow/cmd/sinker/main.go:473"   
  func: "main.(*controller).deletePod"   
  level: "info"   
  msg: "Deleted old completed pod."   
  pod: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"   
  reason: "orphaned"   
}
# Crier (11:11:30): Failed processing item, no more retries. (Failed to add finalizer which Alvaro fixed in #19048)
{
  component: "crier"   
  error: "failed to add finalizer to pod: failed to patch pod: Pod "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"prow.x-k8s.io/gcsk8sreporter"}"   
  file: "prow/crier/controller.go:153"   
  func: "k8s.io/test-infra/prow/crier.(*Controller).retry"   
  jobName: "ci-kubernetes-e2e-gci-gce"   
  jobStatus: "pending"   
  key: "default/5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"   
  level: "error"   
  msg: "Failed processing item, no more retries"   
  prowjob: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"   
  reporter: "gcsk8sreporter"   
}
# Plank (11:12:23): Pod is missing, starting a new pod.
{
  component: "plank"   
  file: "prow/plank/controller.go:334"   
  func: "k8s.io/test-infra/prow/plank.(*Controller).syncPendingJob"   
  job: "ci-kubernetes-e2e-gci-gce"   
  level: "info"   
  msg: "Pod is missing, starting a new pod"   
  name: "5b4b2ae7-ebb5-11ea-b653-e644ef4dd131"   
  type: "periodic"   
}

So it looks like sinker was deleting the pods shortly after the job starts them due to thinking they are orphaned. I'm not sure what would cause that behavior without looking deeper, but given that we saw an error related to finalizers and given that the issues stopped after Alvaro's fix, thats where I would look for the root cause.

The UX here is bad because we usually treat ProwJobs as a single run of a pod, but in reality Prow may replace a pod if it is missing or otherwise encounters weird behavior from k8s (e.g. eviction). We assign a new buildID when we do this in order to define a new upload location which is needed to avoid mixing results from the partially run pod with the results from the replacement pod.
I'm not sure what the best way for us to improve this would be. One idea would be to annotate jobs with their previous buildIDs from partially run pods and, if they are present, have crier upload something to GCS that TestGrid could use as a signal to ignore the partial results. Alternatively we could have crier delete partial results from GCS in that case, but that may not be a good idea?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 1, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 31, 2020
@Rajalakshmi-Girish
Copy link
Contributor

Rajalakshmi-Girish commented Jan 21, 2021

Our Prow dashboard is displaying a few Periodic jobs in Pending state forever!
Such Pending rows are displayed once in a while and the successive build comes up in the span of close to 40min, well before the actual 3hr interval.

  1. The successive build has the same ProwJob ID as the one that is stuck in Pending state.
  2. The entry that is stuck in Pending state has no Pod mapped to it.
  3. The two successive entries have different build_ids and hence different logs folders in GCS.
    Though the successive entry has same Job ID, it successfully creates a Pod and completes the Job unlike the Pending one.
    Pending state artifacts in GCS:
    https://console.cloud.google.com/storage/browser/ppc64le-kubernetes/logs/periodic-kubernetes-bazel-test-ppc64le/1351809092238184448?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&authuser=0&project=ibm-kubernetes&prefix=&forceOnObjectsSortingFiltering=false
    Completed Job with Same ID as above:
    https://console.cloud.google.com/storage/browser/ppc64le-kubernetes/logs/periodic-kubernetes-bazel-test-ppc64le/1351816554479620096?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&authuser=0&project=ibm-kubernetes&prefix=&forceOnObjectsSortingFiltering=false

Below are the logs for the related prow components:
crier.log
control.log

@MushuEE @cjwagner Can you please give some pointer to debug further.

@mkumatag ^^

@MushuEE
Copy link
Contributor

MushuEE commented Jan 26, 2021

I have seen this behavior before, @cjwagner I seem to remember you providing and explanation.

@cjwagner
Copy link
Member

@Rajalakshmi-Girish That sounds like the pod was replaced like I described in #19082 (comment). I'd inspect the component logs and see if it seems like the same problem, then go from there. In particular, the logs I shared suggest that sinker deleted a pod due to incorrectly identifying it as orphaned.
Please also ensure your Prow instance is up to date to include any fixes we've already made.

@spiffxp
Copy link
Member Author

spiffxp commented Feb 5, 2021

I'm not sure what the best way for us to improve this would be. One idea would be to annotate jobs with their previous buildIDs from partially run pods and, if they are present, have crier upload something to GCS that TestGrid could use as a signal to ignore the partial results. Alternatively we could have crier delete partial results from GCS in that case, but that may not be a good idea?

I like the idea of updating the prowjob CR with enough info to understand:

  • what pods / build-ids were previously used for this prowjob run
  • possibly? why they failed (evicted, deleted, timed out scheduling, etc)

I also wonder if we could have the testgrid updater remove columns if their corresponding results in GCS lack a finished.json after some timeout

@spiffxp
Copy link
Member Author

spiffxp commented Feb 5, 2021

/remove-lifecycle rotten
/lifecycle stale
to keep this open long enough to figure out whether it's worth breaking out the ideas in #19082 (comment)

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Feb 5, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Feb 9, 2021

/sig testing

@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Feb 9, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 11, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prow Issues or PRs related to prow area/testgrid kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

6 participants