Skip to content

[Flaking Test][sig-node] Container Runtime blackbox test on terminated container should report termination message from file when pod succeeds and TerminationMessagePolicy FallbackToLogsOnError is set [NodeConformance] [Conformance] #129760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
stmcginnis opened this issue Jan 22, 2025 · 12 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@stmcginnis
Copy link
Contributor

stmcginnis commented Jan 22, 2025

Which jobs are flaking?

master-blocking

  • gce-ubuntu-master-containerd

Which tests are flaking?

E2eNode Suite.[It] [sig-node] Container Runtime blackbox test on terminated container should report termination message from file when pod succeeds and TerminationMessagePolicy FallbackToLogsOnError is set [NodeConformance] [Conformance]

Prow
Triage

Since when has it been flaking?

1/10/2025, 6:57:34 PM
1/21/2025, 5:54:07 AM

Testgrid link

https://testgrid.k8s.io/sig-release-master-blocking#ci-crio-cgroupv1-node-e2e-conformance

Reason for failure (if possible)

{ failed [FAILED] Timed out after 300.001s.
Expected
    <v1.PodPhase>: Failed
to equal
    <v1.PodPhase>: Succeeded
In [It] at: k8s.io/kubernetes/test/e2e/common/node/runtime.go:157 @ 01/21/25 12:18:04.935
}

Anything else we need to know?

N/A

Relevant SIG(s)

/sig node
cc @kubernetes/release-team-release-signal

@stmcginnis stmcginnis added the kind/flake Categorizes issue or PR as related to a flaky test. label Jan 22, 2025
@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jan 22, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 22, 2025
@SergeyKanzhelev
Copy link
Member

CI meeting notes:

  • Very rare flake.
  • We cannot find containerd job for this test to compare.

/cc @haircommander
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 22, 2025
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Issues - To do in SIG Node CI/Test Board Jan 22, 2025
@haircommander
Copy link
Contributor

@bitoku would you PTAL?

@bitoku
Copy link
Contributor

bitoku commented Jan 23, 2025

certainly.

/assign

@bitoku
Copy link
Contributor

bitoku commented Jan 23, 2025

It seems like the container failed with exitCode 255.
I'm not sure why but it might have failed parsing exit file.
cri-o/cri-o#8937 this change might give an information when it happens again.

@wendy-ha18
Copy link
Member

Hi @bitoku, thanks for the PR cri-o/cri-o#8937. I just checked Triage again today and saw that after your PR being merged, at the same day this test failed again with Prow here: 22/01/2025, 15:24:08
{ failed [FAILED] checking for ready nodes: Not ready nodes: ", ip-172-31-1-25.ec2.internal" In [DeferCleanup (Each)] at: k8s.io/kubernetes/test/e2e/framework/node/init/init.go:35 @ 01/22/25 05:31:28.337 }.

Triage link. Not sure if it helps your investigation but just want to let you know!

@Rajalakshmi-Girish
Copy link
Contributor

@bitoku Can you please tell whether this issue will block the v1.33.0-alpha.1 cut, which is scheduled for Tuesday, 4th February UTC?

@bitoku
Copy link
Contributor

bitoku commented Jan 30, 2025

@Rajalakshmi-Girish I'm not yet sure but I saw some suspicious logs in cri-o so I think it's a bug in cri-o.
Also this is a rare flaky test, so I don't think it's a blocker for the new release.

@SergeyKanzhelev SergeyKanzhelev moved this from Issues - To do to Issues - In progress in SIG Node CI/Test Board Feb 12, 2025
@bitoku
Copy link
Contributor

bitoku commented Feb 12, 2025

There's likely to be a bug in cri-o side. We updated the CI to use the updated version of cri-o, which should output more logs when the issue happens.
Once we can get another occurrence, I can continue investigation.

@stmcginnis stmcginnis moved this from FLAKY to PASSING in CI Signal (SIG Release / Release Team) Feb 20, 2025
@stmcginnis
Copy link
Contributor Author

This appears to be passing now. Will wait on confirmation to close.

@wendy-ha18 wendy-ha18 moved this from PASSING to FLAKY in CI Signal (SIG Release / Release Team) Feb 25, 2025
@SergeyKanzhelev
Copy link
Member

This seems to be flaking now: https://storage.googleapis.com/k8s-triage/index.html?test=Container%20Runtime%20blackbox%20test%20on%20terminated%20container%20should%20report%20termination%20message%20from%20file%20when%20pod%20succeeds%20and%20TerminationMessagePolicy%20FallbackToLogsOnError%20is%20set

very rare flakes, setting lower priority. Also flaking on containerd as well, so not a cri-o issue (it seems)

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Mar 19, 2025
@esotsal
Copy link
Contributor

esotsal commented Mar 19, 2025

/cc

@wendy-ha18
Copy link
Member

Hi folks, thanks a lot for your support and attention on this issue.
The release cycle for v1.34 will start soon, and since this is still open, I will carry it over to the latest milestone.

/milestone v1.34

@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Issues - In progress
Development

No branches or pull requests

8 participants