Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qubes-gui-agent fails to start when /rw is full #8060

Closed
meithecatte opened this issue Feb 23, 2023 · 12 comments
Closed

qubes-gui-agent fails to start when /rw is full #8060

meithecatte opened this issue Feb 23, 2023 · 12 comments
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: other P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. ux User experience

Comments

@meithecatte
Copy link

Qubes OS release

R4.1.1

Brief summary

Rebooting an AppVM with filled up /home renders it unresponsive to all normal interaction.

Steps to reproduce

  1. Create an AppVM for testing. Boot it up and open terminal.
  2. Run dd if=/dev/zero of=chonk.bin.
  3. After that fails with a "No space on device" error, restart the AppVM.
  4. Attempt to open the terminal again.

Expected behavior

The functionality of the VM is unimpaired to a degree allowing the removal of the offending files with usual means (through gnome-terminal or nautilus).

Actual behavior

The VM does not respond to any requests to open an application. No indicator is shown to indicate failure, relying on the user's patience running out to realize something has gone wrong. The usual notification about disk space running out is also missing.

Opening a shell through "Open console in qube" to inspect the logs leads one to find various mentions of disk space running out, including the following for the qubes-gui-agent.service:

Feb 23 14:11:46 smol systemd[1]: Starting Qubes GUI Agent...
Feb 23 14:11:46 smol systemd[1]: Started Qubes GUI Agent.
Feb 23 14:11:46 smol qubes-gui[626]: Waiting on /var/run/xf86-qubes-socket socket...
Feb 23 14:11:49 smol qubes-gui[626]: Ok, somebody connected.
Feb 23 14:11:49 smol qubes-gui[626]: XOpenDisplay: Success
Feb 23 14:11:49 smol systemd[1]: qubes-gui-agent.service: Main process exited, code=exited, status=1/FAILURE
Feb 23 14:11:49 smol systemd[1]: qubes-gui-agent.service: Failed with result 'exit-code'.
@meithecatte meithecatte added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug labels Feb 23, 2023
@andrewdavidwong andrewdavidwong added C: other needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Feb 23, 2023
@andrewdavidwong andrewdavidwong added this to the Release 4.1 updates milestone Feb 23, 2023
@meithecatte
Copy link
Author

It seems that this isn't reproducible 100% of the time, sometimes rebooting the VM frees up a small amount of space.

@meithecatte
Copy link
Author

It seems that the non-determinism is coming from the fact that during startup, the free space fluctuates, and whether a particular component sees any space when it starts up is pretty much random.

To mitigate this, you can shut down the test AppVM, mount its private volume in dom0, remove .local, .cache and .config, and then fill up all the space on the volume.

When this is the setup, qubes-gui-agent hangs on Waiting on /var/run/xf86-qubes-socket socket....

So far this seems to be reliable.

@meithecatte
Copy link
Author

This seems to be the classic "Xorg can't start because there's no space for its log file". On our part we could detect this and report an error in dom0, as well as make sure that the full storage notification pops up in this case.

Unless we decide that working around this Xorg issue downstream is something we'd want because the default 2 GB volume size makes this issue much more likely to occur.

@meithecatte
Copy link
Author

In particular, it seems like qubes-gui-runuser would propagate the exitcode of Xorg, but it keeps running. With that fixed, qubes-gui could detect this and propagate the error to dom0 - either with a special packet in the GUI protocol, or by closing the connection.

Reporting "no GUI agent is running" when the user tries to start something seems like a reasonable idea?

@marmarek
Copy link
Member

In particular, it seems like qubes-gui-runuser would propagate the exitcode of Xorg, but it keeps running.

Yes, fixing this makes sense.

With that fixed, qubes-gui could detect this and propagate the error to dom0 - either with a special packet in the GUI protocol, or by closing the connection.

I'd go for closing the connection.

Reporting "no GUI agent is running" when the user tries to start something seems like a reasonable idea?

Yes, sounds reasonable.

As for reporting full storage, I think the current monitoring has an issue it reports used space for committed content, not currently running (snapshot) volume. This needs fixing, I think we have separate issue about it somewhere. Or maybe not?

@meithecatte
Copy link
Author

As for reporting full storage, I think the current monitoring has an issue it reports used space for committed content, not currently running (snapshot) volume.

I don't think that's it? No notification showed up even after shutdown.

Unless there's some aspect to it that makes blocks filled with 0s to behave differently.

@marmarek
Copy link
Member

No, zero blocks are also written. If you click on disk space widget (https://www.qubes-os.org/doc/getting-started/#user-interface), you should get overall usage, but also list of qubes with high disk usage. There is an option to disable notifications (per-qube), maybe you disabled it some time ago for this qube?

@meithecatte
Copy link
Author

I created the qube specifically to reproduce #8016, so the likelihood that the notifications are disabled is quite low ;3

That part started working after a reboot, so either some process hung/crashed in dom0, or an update broke it temporarily.

@meithecatte
Copy link
Author

Turns out that qubes-gui-runuser does indeed exit when its child does, it's just that qubes-gui doesn't reap it.

@meithecatte
Copy link
Author

How do we want to approach notifying the user that the VM has no GUI connection? I can see two general approaches:

  1. Pop up a notification when the user tries to do something that requires a GUI connection. I believe all cases go through the qubes.StartApp RPC. We could
    • hardcode a check for this particular RPC, or
    • make it configurable in a manner similar to the RPC policy mechanism, or
    • expose another knob at the RPC call, or
    • there is already a gui flag here, perhaps we could add functionality to that
  2. Display a warning icon, with a tooltip explaining the situation, in the domain tray (similar to "has updates available" or "needs to be restarted to apply updates").

@andrewdavidwong andrewdavidwong added the ux User experience label Mar 7, 2023
@andrewdavidwong
Copy link
Member

How do we want to approach notifying the user that the VM has no GUI connection? I can see two general approaches: [...]

Adding the ux label, but this might be better suited for a separate issue.

@meithecatte
Copy link
Author

You're right, closing in favor of #8084.

@andrewdavidwong andrewdavidwong removed the needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. label Mar 9, 2023
@andrewdavidwong andrewdavidwong added the affects-4.1 This issue affects Qubes OS 4.1. label Aug 8, 2023
marmarek added a commit to marmarek/qubes-gui-agent-linux that referenced this issue Dec 2, 2024
Register proper signal handler for SIGCHLD, and collect the Xorg's
zombie in it.

This has two effects:
1. The main loop can explicitly exit on Xorg termination, not only via
   receiving EOF on the socket.
2. Due to not ignoring SIGCHLD anymore, accept() in mkghandles will also
   notice Xorg early exit and not wait indefinitely (it will fail with
   EINTR). For this case, improve error message.

There is still a small race on startup, if Xorg exits before reaching
accept() (or listen()) call. Handle this by checking just before
accept() call. It isn't perfect (there is still a few instructions
window where it wouldn't notice it in time), but it's good enough for
practical purposes.

QubesOS/qubes-issues#8060
marmarek added a commit to QubesOS/qubes-gui-agent-linux that referenced this issue Dec 27, 2024
Register proper signal handler for SIGCHLD, and collect the Xorg's
zombie in it.

This has two effects:
1. The main loop can explicitly exit on Xorg termination, not only via
   receiving EOF on the socket.
2. Due to not ignoring SIGCHLD anymore, accept() in mkghandles will also
   notice Xorg early exit and not wait indefinitely (it will fail with
   EINTR). For this case, improve error message.

There is still a small race on startup, if Xorg exits before reaching
accept() (or listen()) call. Handle this by checking just before
accept() call. It isn't perfect (there is still a few instructions
window where it wouldn't notice it in time), but it's good enough for
practical purposes.

QubesOS/qubes-issues#8060

(cherry picked from commit eaba72a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: other P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. ux User experience
Projects
None yet
Development

No branches or pull requests

3 participants