Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a net health recovery service to qemu machines #21262

Merged
merged 1 commit into from
Jan 17, 2024

Conversation

n1hility
Copy link
Member

@n1hility n1hility commented Jan 16, 2024

There is a network stability issue in qemu + virtio, affecting some users after long periods of usage, which can lead to suspended queue delivery. Until the issue is resolved, add a temporary recovery service which restarts networking when host communication becomes inoperable. Only qemu based machines on mac activate this service as the issue is understood to be qemu specific.

Works around issue in #20639

How to verify:

export CONTAINERS_MACHINE_PROVIDER=qemu
podman machine rm
podman machine init
podman machine start
podman machine ssh
# wait at least 2 minutes until the service becomes active, then take down the network to simulate a failure
ifconfig enp0s1 down
# After a minute networking should resume and the next command prompt should appear

Does this PR introduce a user-facing change?

Add a net recovery service to detect and recover from an inoperable host networking issue experienced by some mac qemu users when ran for long periods of time

@openshift-ci openshift-ci bot added release-note approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 16, 2024
@n1hility
Copy link
Member Author

FYI @benoitf

// cc @baude @ashley-cui

@n1hility n1hility changed the title Add a net health recovery service to Qemu machines Add a net health recovery service to qemu machines Jan 16, 2024
sleep 120 # allow time for network setup on initial boot
while true; do
sleep 30
curl -s -o /dev/null --connect-timeout 10 http://192.168.127.1/health
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this IP Address come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the gvproxy gateway address used by the guest. In addition to routing gvproxy runs a built in http server for management of port forwards on this address:

https://github.com/containers/gvisor-tap-vsock/blob/8912b782e96b60da1455bf711eb620d893affa4a/cmd/gvproxy/main.go#L51

sleep 30
curl -s -o /dev/null --connect-timeout 10 http://192.168.127.1/health
if [ "$?" != "0" ]; then
echo "bouncing nic due to loss of connectivity with host"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this line is reported to ? to see if it occurred or not in my podman machine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit file sends stdout&stderr to the system journal, so can be found using journalctl.

@ashley-cui
Copy link
Member

ashley-cui commented Jan 16, 2024

LGTM

Probably needs a head nod from @baude


sleep 120 # allow time for network setup on initial boot
while true; do
sleep 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using a systemd timer which runs every 30 seconds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right yeah only reason it's done this way is to avoid the log generation noise that comes from them that @rhatdan was warning about. With a frequent timer like this it would be a lot of noise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah logging, makes sense!

# is lost. This is a temporary workaround for a known rare qemu/virtio issue
# that affects some systems

sleep 120 # allow time for network setup on initial boot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sleep still needed with the systemd unit which has recoveryUnit.Add("Unit", "After", "sshd.socket sshd.service") ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sshd only has an after on network.target, which is quasi reliable:

"network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined [snip]"

Since we only seem to see the problem with long running vms, my thinking was it was better to just wait a bit longer in the script without disrupting / delaying boot (e.g. using something like network-online.target).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree sleeping for 2 minutes (or even 5 minutes or 1 hour or ... :) is no big deal given when the bug happens. I just wondered.

func GetNetRecoveryUnitFile() *parser.UnitFile {
recoveryUnit := parser.NewUnitFile()
recoveryUnit.Add("Unit", "Description", "Verifies health of network and recovers if necessary")
recoveryUnit.Add("Unit", "After", "sshd.socket sshd.service")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw, I'm not sure sshd.service is required here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sshd.socket and sshd.service are mutually exclusive alternates (currently our fcos images are using sshd.service atm. Our other units are declared as after both (my assumption is to be compatible if the base images witch to the inet socket approach in the future), so just mirroring that pattern here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I assumed the FCOS images used sshd.socket already.

@benoitf
Copy link
Contributor

benoitf commented Jan 16, 2024

I'm using the patch since this morning

journalctl | grep bouncing
Jan 16 15:08:10 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:09:21 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:15:36 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host

I already hit the bug but my podman machine is still reachable

@cfergeau
Copy link
Contributor

/lgtm

Copy link
Contributor

openshift-ci bot commented Jan 16, 2024

@cfergeau: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@benoitf
Copy link
Contributor

benoitf commented Jan 16, 2024

I think it's restarting the network interface if my computer goes to sleep mode and not only in the case of the bug

@cfergeau
Copy link
Contributor

I already hit the bug but my podman machine is still reachable

You hit a condition when curl -s -o /dev/null --connect-timeout 10 http://192.168.127.1/health failed (apparently 3 times in less than 10 minutes). I don't think we know if #20639 happens if and only if this condition is true, or if this condition can sometimes be true without #20639 happening.

Actually it would be (somewhat) interesting to try this change without ifconfig enp0s1 down; ifconfig enp0s1 up to see if false positives show up in the log (ie there's a log, but network is still up even if the workaround was removed from the script)

@mheon
Copy link
Member

mheon commented Jan 16, 2024

@baude @rhatdan PTAL

@n1hility
Copy link
Member Author

I already hit the bug but my podman machine is still reachable

You hit a condition when curl -s -o /dev/null --connect-timeout 10 http://192.168.127.1/health failed (apparently 3 times in less than 10 minutes). I don't think we know if #20639 happens if and only if this condition is true, or if this condition can sometimes be true without #20639 happening.

Actually it would be (somewhat) interesting to try this change without ifconfig enp0s1 down; ifconfig enp0s1 up to see if false positives show up in the log (ie there's a log, but network is still up even if the workaround was removed from the script)

Good idea. If we see lots of spurious events like this I could modify this to utilize a retry to try reduce the bounce events

@benoitf
Copy link
Contributor

benoitf commented Jan 16, 2024

do we need a special version of gvproxy ?

I'm using the one in the installer of podman v4.8.3
and /health always return a 404 page

curl --connect-timeout 10 http://192.168.127.1/health
404 page not found

@n1hility
Copy link
Member Author

do we need a special version of gvproxy ?

I'm using the one in the installer of podman v4.8.3 and /health always return a 404 page

There is no special version needed. The way curl is being used here the http result code doesn't matter, it's just verifying the request/reply happened. The URL suffix is just a placeholder for identification with any sort of logging on the gvproxy side.

@benoitf
Copy link
Contributor

benoitf commented Jan 16, 2024

I'll try tomorrow without the ifconfig down / ifconfig up

but it seems I had a lot of reports today

Jan 16 15:08:10 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:09:21 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:15:36 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:27:59 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:30:11 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 15:44:28 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 17:15:59 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 17:19:41 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 17:45:11 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 17:55:37 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 17:58:52 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host
Jan 16 18:03:07 localhost.localdomain net-health-recovery.sh[1963]: bouncing nic due to loss of connectivity with host

@n1hility
Copy link
Member Author

I'll try tomorrow without the ifconfig down / ifconfig up

but it seems I had a lot of reports today

@benoitf Interesting. I am curious what you see. I tried a bunch of scenarios on my system and was not able to get this to occur from sleeps. Although I also am not seeing the underlying qemu issue.

I just pushed up a replacement that ups the timeout. If your research shows false positives, can you try with the update?

@n1hility
Copy link
Member Author

/hold

(waiting until we wrap up the testing / verification from @benoitf )

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2024
@benoitf
Copy link
Contributor

benoitf commented Jan 16, 2024

I've updated my podman CLI with your updated code, I will run the changes overnight

@baude
Copy link
Member

baude commented Jan 16, 2024

I will merge this ... DO NOT MERGE. @benoitf let us know Wednesday-ish and I can get it in.

There is a network stability issue in qemu + virtio, affecting
some users after long periods of usage, which can lead to
suspended queue delivery. Until the issue is resolved, add a
temporary recovery service which restarts networking when host
communication becomes inoperable.

[NO NEW TESTS NEEDED]

Signed-off-by: Jason T. Greene <jason.greene@redhat.com>
@n1hility
Copy link
Member Author

n1hility commented Jan 16, 2024

Updated PR to only apply to darwin qemu builds. (In discussing with @baude even though the underlying qemu/virtio issue may not be mac specific, we decided its probably better to keep this narrowed to Mac until we see reports elsewhere)

@benoitf
Copy link
Contributor

benoitf commented Jan 17, 2024

With the new patch, I didn't get any traces in the journal and my machine is still working so I think I didn't get false positives but I wasn't yet able to reach the 'blocking state' (so ifconfig down/up wasn't triggered as well)

@gbraad
Copy link
Member

gbraad commented Jan 17, 2024

the addition of /health was a request to make the use of this more obvious. If this happens a lot, there is not much we can do except for waiting from a fix from Qemu+virtio teams to resolve the actual issue. this is what @benoitf referred to as the 'bug'.

In short:

I wasn't yet able to reach the 'blocking state'

Means it is 'resolved' for you, right?

@gbraad
Copy link
Member

gbraad commented Jan 17, 2024

/lgtm
/approve

Copy link
Contributor

openshift-ci bot commented Jan 17, 2024

@gbraad: changing LGTM is restricted to collaborators

In response to this:

/lgtm
/approve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

openshift-ci bot commented Jan 17, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gbraad, n1hility

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@benoitf
Copy link
Contributor

benoitf commented Jan 17, 2024

In short:
I wasn't yet able to reach the 'blocking state'
Means it is 'resolved' for you, right?

well the problem is that as I didn't yet reproduced the blocking state, the script didn't do the ifconfig down/up (there is no bouncing log in journalctl)

so I wouldn't say it's 'resolved', just that I didn't see 'potential false positives' as yesterday where it was occurring by sequences

@benoitf
Copy link
Contributor

benoitf commented Jan 17, 2024

bouncing trace occurred

Jan 17 13:30:36 localhost.localdomain net-health-recovery.sh[1976]: bouncing nic due to loss of connectivity with host
Jan 17 13:45:51 localhost.localdomain net-health-recovery.sh[1976]: bouncing nic due to loss of connectivity with host

I would say now it works better than yesterday's patch

I tried to do a lot of networking/heavy load in the VM
image

/lgtm

@baude OK to merge on my side

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 17, 2024
@mheon
Copy link
Member

mheon commented Jan 17, 2024

/hold cancel
/lgtm

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 17, 2024
@mheon
Copy link
Member

mheon commented Jan 17, 2024

/cherry-pick v4.9

@openshift-cherrypick-robot
Copy link
Collaborator

@mheon: once the present PR merges, I will cherry-pick it on top of v4.9 in a new PR and assign it to you.

In response to this:

/cherry-pick v4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-bot openshift-merge-bot bot merged commit e293ca8 into containers:main Jan 17, 2024
91 of 92 checks passed
@openshift-cherrypick-robot
Copy link
Collaborator

@mheon: #21262 failed to apply on top of branch "v4.9":

Applying: Add a net health recovery service to Qemu machines
Using index info to reconstruct a base tree...
A	pkg/machine/ignition/ignition.go
M	pkg/machine/qemu/machine.go
M	pkg/machine/qemu/options_linux.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/machine/qemu/options_linux.go
Auto-merging pkg/machine/qemu/machine.go
CONFLICT (content): Merge conflict in pkg/machine/qemu/machine.go
Auto-merging pkg/machine/ignition.go
CONFLICT (content): Merge conflict in pkg/machine/ignition.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Add a net health recovery service to Qemu machines
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick v4.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@benoitf
Copy link
Contributor

benoitf commented Jan 17, 2024

At some point network was stuck, I ran a ping command and we see that after a while it's coming back as the patch is bouncing the network interface 👍

image

@n1hility
Copy link
Member Author

@benoitf excellent thank you so much for the thorough testing on this one and last week!

@benoitf
Copy link
Contributor

benoitf commented Jan 17, 2024

@mheon it looks like automatic cherry-pick didn't work smoothly for 4.9 branch

will it be in time for 4.9.0 ? (or it'll part of 4.9.1)

@mheon
Copy link
Member

mheon commented Jan 17, 2024

We're having vendoring issues with 4.9 right now that have delayed the release - so it ought to be part of 4.9.0. ETA on that is hopefully this afternoon, but really depends on how difficult those vendoring issues prove to be.

@n1hility
Copy link
Member Author

I'll quickly back port this

@slemeur
Copy link

slemeur commented Jan 17, 2024

Thanks for the fix!

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Apr 17, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine podman-desktop release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants