Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sway crashed after nixos-rebuild switch #74626

Closed
primeos opened this issue Nov 29, 2019 · 11 comments
Closed

Sway crashed after nixos-rebuild switch #74626

primeos opened this issue Nov 29, 2019 · 11 comments
Labels
0.kind: bug Something is broken 0.kind: regression Something that worked before working no longer 1.severity: blocker This is preventing another PR or issue from being completed

Comments

@primeos
Copy link
Member

primeos commented Nov 29, 2019

Describe the bug

After running nixos-rebuild switch my graphical session crashed. The Sway process was technically still running but I was back to the console and could see various error messages.

This probably applies to other Wayland compositors as well and maybe even X11.

To Reproduce

Tested the following and it it indeed reproducible:

  1. Switch to the old revision: e89b215
  2. Switch to the new revision: 0ee0489
  3. Now Sway should've "crashed".

Expected behavior

The system upgrades normally without Sway crashing.

Screenshots

TODO

Additional context

One of the following systemd units that where stopped should be responsible for this:

'dbus.service', 'dbus.socket', 'pcscd.service', 'pcscd.socket',
'systemd-coredump.service', 'systemd-coredump.socket',
'systemd-initctl.service', 'systemd-initctl.socket',
'systemd-journald-audit.service', 'systemd-journald-audit.socket',
'systemd-journald-dev-log.service', 'systemd-journald-dev-log.socket',
'systemd-journald.service', 'systemd-journald.socket', 'systemd-rfkill.service',
'systemd-rfkill.socket', 'systemd-udevd-control.service',
'systemd-udevd-kernel.service'

Probably due to D-Bus or udev.

Update: systemctl stop dbus crashes Sway, so that was probably the reason. I also don't think we did ever restart (stop + start) D-Bus in the past (only reloaded it). AFAIK restarting D-Bus will kick all clients and is therefore a bad idea (but haven't looked much into it).

Update2: Apparently we still have the following (which is very important):

      # Don't restart dbus-daemon. Bad things tend to happen if we do.
      reloadIfChanged = true;

Not sure why nixos-rebuild switch did restart dbus.service then (actually it did stop, start, and reload the unit...). Maybe we have a bigger regression in nixos-rebuild.

Update3: 89806e9 could be the problem, I'll investigate.

Metadata

 - system: `"x86_64-linux"`
 - host os: `Linux 5.4.0, NixOS, 20.03.git.ee376d4 (Markhor)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.3.1`
 - channels(root): ``
 - nixpkgs: `/var/nixpkgs`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute: sway
# a list of nixos modules affected by the problem
module: sway
@primeos primeos added 0.kind: bug Something is broken 0.kind: regression Something that worked before working no longer labels Nov 29, 2019
@primeos
Copy link
Member Author

primeos commented Nov 29, 2019

Result

This crash is reproducible and the regression is caused by 89806e9.

When I switch from e89b215 to 0ee0489 both dbus.service and dbus.socket will be stopped:

[root@quorra:/var/nixpkgs]# nixos-rebuild dry-activate 2>&1 | grep dbus
would stop the following units: alsa-store.service, audit.service, avahi-daemon.service, avahi-daemon.socket, cups-browsed.service, cups.service, cups.socket, dbus.service, dbus.socket, gollum.service, jupyter.service, kmod-static-nodes.service, monetdb.service, mysql.service, network-local-commands.service, network-setup.service, nix-daemon.service, nix-daemon.socket, nscd.service, resolvconf.service, sks-db.service, systemd-binfmt.service, systemd-coredump.service, systemd-coredump.socket, systemd-initctl.service, systemd-initctl.socket, systemd-journald-audit.service, systemd-journald-audit.socket, systemd-journald-dev-log.service, systemd-journald-dev-log.socket, systemd-journald.service, systemd-journald.socket, systemd-modules-load.service, systemd-rfkill.service, systemd-rfkill.socket, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.service, systemd-udevd-control.socket, systemd-udevd-kernel.service, systemd-udevd-kernel.socket, systemd-udevd.service
would start the following units: alsa-store.service, audit.service, avahi-daemon.socket, cups-browsed.service, cups.socket, dbus.socket, gollum.service, jupyter.service, kmod-static-nodes.service, monetdb.service, mysql.service, network-local-commands.service, network-setup.service, nix-daemon.socket, nscd.service, resolvconf.service, sks-db.service, systemd-binfmt.service, systemd-coredump.socket, systemd-initctl.socket, systemd-journald-audit.socket, systemd-journald-dev-log.socket, systemd-journald.socket, systemd-modules-load.service, systemd-rfkill.socket, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket
would reload the following units: dbus.service, dev-hugepages.mount, dev-mqueue.mount, firewall.service, sys-kernel-debug.mount, tmp.mount

But without 89806e9 (cc #73871) this will not happen:

[root@quorra:/var/nixpkgs]# git revert 89806e95363f06869c9de18586e32c8ef65bd2fd
[master 2010034a6bc] Revert "nixos/switch-to-configuration: restart changed socket units"
 1 file changed, 1 insertion(+), 11 deletions(-)

[root@quorra:/var/nixpkgs]# nixos-rebuild dry-activate 2>&1 | grep dbus
  /nix/store/n5s71dizh5p2rfqxj6kxpk0iz5csnwzz-dbus-1.drv
  /nix/store/0kadlpbqay29h2h5mq17mdkwi7mj0c2w-unit-dbus.service.drv
  /nix/store/1bvw496bvvjya8pl2yxhr2kr3ia65fql-unit-dbus.service.drv
building '/nix/store/n5s71dizh5p2rfqxj6kxpk0iz5csnwzz-dbus-1.drv'...
building '/nix/store/0kadlpbqay29h2h5mq17mdkwi7mj0c2w-unit-dbus.service.drv'...
building '/nix/store/1bvw496bvvjya8pl2yxhr2kr3ia65fql-unit-dbus.service.drv'...
would reload the following units: dbus.service, dev-hugepages.mount, dev-mqueue.mount, firewall.service, sys-kernel-debug.mount, tmp.mount

Impact

man systemd.unit:

Requires=
Configures requirement dependencies on other units. If this unit gets activated, the units listed here will be activated as well. If one of the other units fails to activate, and an ordering dependency After= on the failing unit is set, this unit will not be started. Besides, with or without specifying After=, this unit will be stopped if one of the other units is explicitly stopped.

man systemd.service:

Services with Type=dbus set automatically acquire dependencies of type Requires= and After= on dbus.socket.

@primeos primeos added the 1.severity: blocker This is preventing another PR or issue from being completed label Nov 29, 2019
@domenkozar
Copy link
Member

How come that dbus.socket changed, can you provide full log of nixos-rebuild switch?

@primeos
Copy link
Member Author

primeos commented Nov 29, 2019

How come that dbus.socket changed, can you provide full log of nixos-rebuild switch?

@domenkozar I can investigate this later but this shouldn't happen regardless of that fact.

Update: Had a look and dbus.socket doesn't actually change - seems like another unit is causing this.

@domenkozar
Copy link
Member

I agree, but socket shouldn't change at all. So I'd like to understand what causes this mess.

@primeos
Copy link
Member Author

primeos commented Nov 29, 2019

I've used the following patch to figure out what's going on:

diff --git a/nixos/modules/system/activation/switch-to-configuration.pl b/nixos/modules/system/activation/switch-to-configuration.pl
index 12a80a12d19..8c22cf7c0e1 100644
--- a/nixos/modules/system/activation/switch-to-configuration.pl
+++ b/nixos/modules/system/activation/switch-to-configuration.pl
@@ -220,6 +220,8 @@ while (my ($unit, $state) = each %{$activePrev}) {
                 # service unit has to be stopped before the socket can
                 # be restarted. The service will be started again on demand.
                 my $serviceUnit = $unitInfo->{'Unit'} // "$baseName.service";
+                print STDERR "Affected unit (socket):  $unit\n";
+                print STDERR "Affected unit (service): $serviceUnit\n";
                 $unitsToStop{$serviceUnit} = 1;
                 $unitsToStop{$unit} = 1;
                 $unitsToStart{$unit} = 1;

This gives the following output:

Affected unit (socket):  systemd-journald-dev-log.socket
Affected unit (service): systemd-journald-dev-log.service
Affected unit (socket):  avahi-daemon.socket
Affected unit (service): avahi-daemon.service
Affected unit (socket):  systemd-rfkill.socket
Affected unit (service): systemd-rfkill.service
Affected unit (socket):  systemd-udevd-kernel.socket
Affected unit (service): systemd-udevd-kernel.service
Affected unit (socket):  nix-daemon.socket
Affected unit (service): nix-daemon.service
Affected unit (socket):  systemd-udevd-control.socket
Affected unit (service): systemd-udevd-control.service
Affected unit (socket):  systemd-coredump.socket
Affected unit (service): systemd-coredump.service
Affected unit (socket):  cups.socket
Affected unit (service): cups.service
Affected unit (socket):  systemd-journald.socket
Affected unit (service): systemd-journald.service
Affected unit (socket):  dbus.socket
Affected unit (service): dbus.service
Affected unit (socket):  systemd-initctl.socket
Affected unit (service): systemd-initctl.service
Affected unit (socket):  systemd-journald-audit.socket
Affected unit (service): systemd-journald-audit.service

@domenkozar
Copy link
Member

OK, so this patch doesn't check if socket has actually changed?

@primeos
Copy link
Member Author

primeos commented Nov 29, 2019

dbus.socket didn't change, the following might be useful as well:

diff --git a/nixos/modules/system/activation/switch-to-configuration.pl b/nixos/modules/system/activation/switch-to-configuration.pl
index 12a80a12d19..943ee547617 100644
--- a/nixos/modules/system/activation/switch-to-configuration.pl
+++ b/nixos/modules/system/activation/switch-to-configuration.pl
@@ -220,6 +220,9 @@ while (my ($unit, $state) = each %{$activePrev}) {
                 # service unit has to be stopped before the socket can
                 # be restarted. The service will be started again on demand.
                 my $serviceUnit = $unitInfo->{'Unit'} // "$baseName.service";
+                print STDERR "Affected unit files:  $prevUnitFile -> $newUnitFile\n";
+                print STDERR "Affected unit (socket):  $unit\n";
+                print STDERR "Affected unit (service): $serviceUnit\n";
                 $unitsToStop{$serviceUnit} = 1;
                 $unitsToStop{$unit} = 1;
                 $unitsToStart{$unit} = 1;

Relevant output:

Affected unit files:  /etc/systemd/system/dbus.socket -> /nix/store/kai0jqnk56zcky086f1y4kr6cwyvvrga-nixos-system-quorra-20.03.git.0ee0489/etc/systemd/system/dbus.socket
Affected unit (socket):  dbus.socket
Affected unit (service): dbus.service
$ sha256sum /etc/systemd/system/dbus.socket /nix/store/kai0jqnk56zcky086f1y4kr6cwyvvrga-nixos-system-quorra-20.03.git.0ee0489/etc/systemd/system/dbus.socket
e05359bbdc083b8db2b49542b26429166b5e13367a63668a4e8ff8a1b496f7ae  /etc/systemd/system/dbus.socket
e05359bbdc083b8db2b49542b26429166b5e13367a63668a4e8ff8a1b496f7ae  /nix/store/kai0jqnk56zcky086f1y4kr6cwyvvrga-nixos-system-quorra-20.03.git.0ee0489/etc/systemd/system/dbus.socket

So fingerprintUnit($prevUnitFile) ne fingerprintUnit($newUnitFile) seems interesting as well (as an additional problem).

@primeos
Copy link
Member Author

primeos commented Nov 29, 2019

The following explains why the fingerprint does still change:

Affected unit fingerprints: /nix/store/q1bajnw02f40l41vpmm5cqprw6dxm7b2-dbus-1.12.16/etc/systemd/system/dbus.socket -> /nix/store/bzh47mk7hhz8pw1djxqkn5rpz93334bs-dbus-1.12.16/etc/systemd/system/dbus.socket

But I'll have to go AFK now.

Edit: Btw for the meantime it might not hurt to revert 89806e9 until we find a proper solution (I really don't like issues that cause my graphical session to crash).

@lovesegfault
Copy link
Member

I've had this happen too.

@flokli
Copy link
Contributor

flokli commented Dec 2, 2019

With 0f799bd, this can probably be closed.

@flokli flokli closed this as completed Dec 2, 2019
@domenkozar
Copy link
Member

Opened #74899

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 0.kind: regression Something that worked before working no longer 1.severity: blocker This is preventing another PR or issue from being completed
Projects
None yet
Development

No branches or pull requests

4 participants