Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn if podman stop timeout expires that sigkill was sent #11880

Merged
merged 1 commit into from
Oct 10, 2021

Conversation

rhatdan
Copy link
Member

@rhatdan rhatdan commented Oct 6, 2021

Note: the Warning message will not come to podman-remote.
It would be difficult to plumb, and not really worth the effort.

Fixes: #11854

Signed-off-by: Daniel J Walsh dwalsh@redhat.com

What this PR does / why we need it:

How to verify it

Which issue(s) this PR fixes:

Special notes for your reviewer:

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 6, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 6, 2021
@@ -82,7 +82,7 @@ Remote connections use local containers.conf for default.

#### **--log-level**=*level*

Log messages above specified level: debug, info, warn, error (default), fatal or panic (default: "error")
Log messages above specified level: debug, info, warn, error (default), fatal or panic (default: "warn")
Copy link
Member

@TomSweeneyRedHat TomSweeneyRedHat Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is pre-existing, but should it be this?

Suggested change
Log messages above specified level: debug, info, warn, error (default), fatal or panic (default: "warn")
Log messages at the specified level: debug, info, warn, error (default), fatal, or panic (default: "warn")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default is no longer error, it's Warn

@TomSweeneyRedHat
Copy link
Member

couple small nits in a message, otherwise LGTM

@@ -441,7 +441,7 @@ func (r *ConmonOCIRuntime) StopContainer(ctr *Container, timeout uint, all bool)
}

if err := waitContainerStop(ctr, time.Duration(timeout)*time.Second); err != nil {
logrus.Infof("Timed out stopping container %s, resorting to SIGKILL: %v", ctr.ID(), err)
logrus.Warnf("Timed out stopping container %s, resorting to SIGKILL: %v", ctr.ID(), err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be printing the error in the WARN-level logs. Make that a debug-level IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? The hole discussion was to raise this to a warn level, not a Info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need two logrus lines here: One to print "Timed out stopping container..." (warn level) and one to print "Error stopping container" (debug level).

@edsantiago
Copy link
Member

Sorry I missed your ping this afternoon; I was clerking at our local election. I don't see a bats test, so I don't know what went wrong with your tests, but I'll look at it tomorrow.

@rhatdan
Copy link
Member Author

rhatdan commented Oct 7, 2021

Here is the test I was trying to write.

@test "podman stop -t 1 Generate warning" {

    run_podman run --rm --name stopme -d $IMAGE sleep 100
    cid=$output
    $PODMAN stop -t 1 stopme >/dev/null
    is "$output" ".*Timed out stopping container ${cid}, resorting to SIGKILL: given PIDs did not die within timeout.*" "Status of container should print warning'"
}

@edsantiago
Copy link
Member

Ah - that explains it. Using $PODMAN calls podman directly, without the bats run helper. The easy solution is:

     run_podman stop -t 1 stopme
    is "$output" ".*Timed out ..."

@rhatdan
Copy link
Member Author

rhatdan commented Oct 7, 2021

Yikes, I must have accidently cut & Pasted that from where I stole the code. /dope slap


run_podman run --rm --name stopme -d $IMAGE sleep 100
cid=$output
run_podman stop -t 1 stopme >/dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The >/dev/null is unnecessary (and misleading, because it actually gets ignored); but not worth repushing for. LGTM.

@test "podman stop -t 1 Generate warning" {

run_podman run --rm --name stopme -d $IMAGE sleep 100
cid=$output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too is unnecessary (you never use $cid) but again, not worth a re-push

@@ -166,4 +166,12 @@ load helpers
is "$output" "137" "Exit code of killed container"
}

@test "podman stop -t 1 Generate warning" {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you need skip_if_remote; when you repush, can you take care of the two issues below? Thanks

@@ -441,7 +441,8 @@ func (r *ConmonOCIRuntime) StopContainer(ctr *Container, timeout uint, all bool)
}

if err := waitContainerStop(ctr, time.Duration(timeout)*time.Second); err != nil {
logrus.Infof("Timed out stopping container %s, resorting to SIGKILL: %v", ctr.ID(), err)
logrus.Debugf("Timed out stopping container %s with %s, resorting to SIGKILL: %v", ctr.ID(), unix.SignalName(syscall.Signal(stopSignal)), err)
logrus.Warnf("Failed to stopping container %s with %s, resorting to SIGKILL: %v", ctr.ID(), unix.SignalName(syscall.Signal(stopSignal)), err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Failed to stopping"??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see this is in response to @mheon's comment:

We need two logrus lines here: One to print "Timed out stopping container..." (warn level) and one to print "Error stopping container" (debug level).

...but I don't understand the request. @mheon could you clarify, possibly with a code sample to reduce ambiguity?

session.WaitWithDefaultTimeout()
Expect(session).Should(Exit(0))
warning := session.ErrorToString()
Expect(warning).To(ContainSubstring("Timed out stopping container"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean StopSignal TERM failed to stop container here (the Timed out one is now just debug).

But I still don't understand @mheon's reason for requesting two messages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to print the error message every time, it's largely irrelevant to the situation and will reduce clarity (I expect it's going to be rather long - "failed to stop PID %d after %d seconds" or somesuch.

What we actually want is to print the "Timed out stopping container %s" message first as a logrus.Warnf, then print "Error stopping container %s using stop signal: %v" including the error message as a Debugf so the error isn't entirely lost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR now prints.

 ./bin/podman stop -t 2 -l
WARN[0002] StopSignal SIGTERM failed to stop container hungry_bassi in 2 seconds, resorting to SIGKILL 
729d71716d7c024e1b274a6ebf7d41bf65faa43b4324d0017a72cf1375059df0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug message actually includes the error.

session.WaitWithDefaultTimeout()
Expect(session).Should(Exit(0))
warning := session.ErrorToString()
Expect(warning).To(ContainSubstring("StopSignal SIGTERM failed to stop container test4 in 1 seconds, resorting to SIGKILL"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/test4/test5/

Note: the Warning message will not come to podman-remote.
It would be difficult to plumb, and not really worth the effort.

Fixes: containers#11854

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
@mheon
Copy link
Member

mheon commented Oct 8, 2021

LGTM

@edsantiago
Copy link
Member

/lgtm
/hold

Quite a slog for something that seemed so simple! Thanks, @rhatdan!

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 8, 2021
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 8, 2021
@rhatdan
Copy link
Member Author

rhatdan commented Oct 10, 2021

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 10, 2021
@openshift-merge-robot openshift-merge-robot merged commit c47f714 into containers:main Oct 10, 2021
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Container sigkill after stop timeout not logged as error
5 participants