Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make multi-user installer idempotent #7603

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

iFreilicht
Copy link
Contributor

@iFreilicht iFreilicht commented Jan 15, 2023

@abathur
Copy link
Member

abathur commented Jan 15, 2023

There are instructions at https://nixos.org/manual/nix/stable/contributing/hacking.html#installer-tests for setting up the installer jobs to run in your own fork.

@abathur
Copy link
Member

abathur commented Jan 15, 2023

I guess you'd need to manually test the generated installer or maybe tweak the install test workflow to run the installation twice to test this specific fix.

(may also need to add some cleanup to keep it from blocking on some other non-idempotence)

@iFreilicht
Copy link
Contributor Author

@abathur Ah, thanks for the pointer! I found that page but it didn't seem to be what I was looking for. I got that setup working, which is of course nice, but running the CI just takes forever. What I was really looking for was this:

$ nix-build -A hydraJobs.installerTests.ubuntu-22-04.x86_64-linux.install-force-daemon

This allows me to build and test locally automatically. Very convenient.

But yes, good point, I have to adapt the CI as well to avoid a regression on this in the future. However, the CI is using the cachix/install-nix action right now, which does a single-user install AND skips the installation if it finds the nix command, so that will be a little more involved.

I also saw that there's #7215 which I also ran into with the automated tests, but I think for the purposes of this PR I would just add all the manual steps to the automated test before the second install attempt.

@iFreilicht
Copy link
Contributor Author

iFreilicht commented Jan 17, 2023

Alright, I made the multi-user installer fully idempotent now, and the test was successful!

$ nix-build -A hydraJobs.installerTests.ubuntu-22-04.x86_64-linux.install-force-daemon
these derivations will be built:
  /nix/store/dv4dmxf445xpqydbg8pp533ld3yajgaz-nix-2.13.0pre19700101_61734d2.drv
  /nix/store/h48ibf3hi96r4hwjjfimajmnrva94dsf-closure-info.drv
  /nix/store/ralvjga518dld7swlgdg8hpwwkm2lql0-nix-binary-tarball-2.13.0pre19700101_61734d2.drv
  /nix/store/85b3kqs3dhy78j0r9bfjlcr9lm9vpi7f-installer-test-ubuntu-22-04-install-force-daemon.drv
building '/nix/store/dv4dmxf445xpqydbg8pp533ld3yajgaz-nix-2.13.0pre19700101_61734d2.drv'...
unpacking sources
[...]
building '/nix/store/85b3kqs3dhy78j0r9bfjlcr9lm9vpi7f-installer-test-ubuntu-22-04-install-force-daemon.drv'...
Unpacking Vagrant box /nix/store/pkkfzasqwc8lay1m360xlc9nvixbaqk4-libvirt.box...
Vagrantfile
box.img
info.json
metadata.json
Formatting './disk.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=137438953472 backing_file=./box.img backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16
Starting qemu...
Waiting for SSH...
[ssh] Trying to connect...
cSeaBIOS (version rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org)


iPXE (http://ipxe.org) 00:03.0 CA00 PCI2.10 PnP PMM+BFF911E0+BFEF11E0 CA00



Booting from Hard Disk...
[ssh] Trying to connect...
[ssh] Trying to connect...
[ssh] Trying to connect...
[ssh] Trying to connect...
[ssh] Trying to connect...
[ssh] Connected!
Copying installer...
Running installer...
+ tar -xf ./nix.tar.xz
+ mv ./nix-2.13.0pre19700101_61734d2-x86_64-linux nix
+ ./nix/install --daemon --no-channel-add
Note: a multi-user installation is possible. See https://nixos.org/manual/nix/stable/installation/installing-binary.html#multi-user-installation
Switching to the Multi-user Installer
Welcome to the Multi-User Nix Installation

This installation tool will set up your computer with the Nix package
manager. This will happen in a few stages:
[...]
---- Reminders -----------------------------------------------------------------
[ 1 ]
Nix won't work in active shell sessions until you restart them.

Testing Nix installation...
+ source /home/vagrant/.bash_profile
-bash: line 4: /home/vagrant/.bash_profile: No such file or directory
+ true
+ source /home/vagrant/.bash_login
-bash: line 5: /home/vagrant/.bash_login: No such file or directory
+ true
+ source /home/vagrant/.profile
++ '[' -n '5.1.16(1)-release' ']'
++ '[' -f /home/vagrant/.bashrc ']'
++ . /home/vagrant/.bashrc
+++ case $- in
+++ return
++ '[' -d /home/vagrant/bin ']'
++ '[' -d /home/vagrant/.local/bin ']'
+ source /etc/bashrc
++ '[' -e /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh ']'
++ . /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh
+++ '[' -n 1 ']'
+++ return
+ nix-env --version
nix-env (Nix) 2.13.0pre19700101_dirty
+ nix --extra-experimental-features nix-command store ping
Store URL: daemon
Version: 2.13.0pre19700101_dirty
++ nix-build --no-substitute -E 'derivation { name = "foo"; system = "x86_64-linux"; builder = "/bin/sh"; args = ["-c" "echo foobar > $out"]; }'
this derivation will be built:
  /nix/store/9qb0l9n1gsmcyynfmndnq3qpmlvq8rln-foo.drv
building '/nix/store/9qb0l9n1gsmcyynfmndnq3qpmlvq8rln-foo.drv'...
+ out=/nix/store/sivbjmsgqj95sxw35iqgvqd64grp5q91-foo
++ cat /nix/store/sivbjmsgqj95sxw35iqgvqd64grp5q91-foo
+ [[ foobar = foobar ]]
Running installer again to test for idempotency...
+ tar -xf ./nix.tar.xz
+ mv ./nix-2.13.0pre19700101_dirty-x86_64-linux nix
+ ./nix/install --daemon --no-channel-add
Note: a multi-user installation is possible. See https://nixos.org/manual/nix/stable/installation/installing-binary.html#multi-user-installation
Switching to the Multi-user Installer
Welcome to the Multi-User Nix Installation

This installation tool will set up your computer with the Nix package
manager. This will happen in a few stages:
[...]
---- Reminders -----------------------------------------------------------------
[ 1 ]
Nix won't work in active shell sessions until you restart them.

Testing Nix installation...
+ source /home/vagrant/.bash_profile
-bash: line 4: /home/vagrant/.bash_profile: No such file or directory
+ true
+ source /home/vagrant/.bash_login
-bash: line 5: /home/vagrant/.bash_login: No such file or directory
+ true
+ source /home/vagrant/.profile
++ '[' -n '5.1.16(1)-release' ']'
++ '[' -f /home/vagrant/.bashrc ']'
++ . /home/vagrant/.bashrc
+++ case $- in
+++ return
++ '[' -d /home/vagrant/bin ']'
++ '[' -d /home/vagrant/.local/bin ']'
+ source /etc/bashrc
++ '[' -e /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh ']'
++ . /nix/var/nix/profiles/default/etc/profile.d/nix-daemon.sh
+++ '[' -n 1 ']'
+++ return
+ nix-env --version
nix-env (Nix) 2.13.0pre19700101_dirty
+ nix --extra-experimental-features nix-command store ping
Store URL: daemon
Version: 2.13.0pre19700101_dirty
++ nix-build --no-substitute -E 'derivation { name = "foo"; system = "x86_64-linux"; builder = "/bin/sh"; args = ["-c" "echo foobar > $out"]; }'
+ out=/nix/store/sivbjmsgqj95sxw35iqgvqd64grp5q91-foo
++ cat /nix/store/sivbjmsgqj95sxw35iqgvqd64grp5q91-foo
+ [[ foobar = foobar ]]
Done!
qemu-kvm: terminating on signal 15 from pid 1 ()
/nix/store/i9ln2c3wqm23vh5i0k63gazik4kbckmv-installer-test-ubuntu-22-04-install-force-daemon

So that's looking good.

I saw that a lot of issues on here are related to this, so I'll make a proper list, squash the commits and add references to all of them so they get closed when this is merged. Seems like this isn't quite the case. Once this issue is fixed, people will most likely just run into the cp: cannot overwrite directory ... with non-directory errors, which need an additional fix.

I'm not sure if making the github CI-pipeline test for this as well makes sense, as long as hydra is doing it.

@iFreilicht iFreilicht changed the title Don't overwrite already-copied files on install Make multi-user installer fully idempotent Jan 17, 2023
@iFreilicht iFreilicht changed the title Make multi-user installer fully idempotent Make multi-user installer idempotent Jan 17, 2023
@iFreilicht
Copy link
Contributor Author

I tested this again, and it does seem like 250c118 from this PR kind-of fixes the cp: cannot overwrite directory with non-directory crashes. The issue here is that if the Nix store is genuinely broken (which is likely the original cause of that error, otherwise cp would overwrite the directory fine), a re-install wouldn't fix it.
So instead, I'll add an additional line that rms all the directories that are about to be copied into the store. I'll also deliberately break the nix store in the installer test to ensure this won't break again in the future.

Potentially, we could also fully delete /nix/* first (which might also resolve other issues), but I feel this is way too nuclear and has potential to break and/or delete other parts of the installation like per-user profiles and their generations, which ideally should be left alone during a re-install.

@iFreilicht
Copy link
Contributor Author

YES! I finally figured it out! In bb0c4b9, this change was made:

@@ -741,2 +741,2 @@
         _sudo "to copy the basic Nix files to the new store at $NIX_ROOT/store" 
\
-              cp -RLp ./store/* "$NIX_ROOT/store/"
+              cp -RPp ./store/* "$NIX_ROOT/store/"

Which made sure symlinks are copied verbatim and not followed. While technically correct, this created an incompatibility with installs made BEFORE this commit was released.

Before, symlinks to directories would be followed and deep-copied. For example, libkrb contains a symlink sbin -> bin. So for installations made with the pre-bb0c4b9 installer, it looked like this:

$ ls -lF /nix/store/5sxcmklgrgl7lsij8bp9a98iws4q8fw0-libkrb5-1.19.2
total 4
dr-xr-xr-x 1 root root   44 Jan  1  1970 bin/
dr-xr-xr-x 1 root root 1108 Jan  1  1970 lib/
lrwxrwxrwx 1 root root    3 Jan  1  1970 sbin/
dr-xr-xr-x 1 root root   10 Jan  1  1970 share/

But when running the post-bb0c4b9 installer, it tries to create this:

$ ls -lF /nix/store/5sxcmklgrgl7lsij8bp9a98iws4q8fw0-libkrb5-1.19.2
total 4
dr-xr-xr-x 1 root root   44 Jan  1  1970 bin/
dr-xr-xr-x 1 root root 1108 Jan  1  1970 lib/
lrwxrwxrwx 1 root root    3 Jan  1  1970 sbin -> bin/
dr-xr-xr-x 1 root root   10 Jan  1  1970 share/

Which is not possible and fails with the dreaded message

cp: cannot overwrite directory /nix/store/h9z5lncphgm9if86wxrfqg7w7fv7khbh-libkrb5-1.19.3/sbin with non-directory ./store/h9z5lncphgm9if86wxrfqg7w7fv7khbh-libkrb5-1.19.3/sbin

Now I can confidently say that this PR will fix all the cp: cannot overwrite directory ... with non-directory errors. I thus updated the list of issues this PR would fix.

I also found two related outdated issues that I believe were caused by 475fc10, cannot occur since bb0c4b9 and will not regress after this PR. They're basically the opposite problem, trying to overwrite a link with a directory:

I will rebase the PR and add a short version of and link to this explanation to the commit that fixes the issue. After that it is ready to merge from my side.

iFreilicht added a commit to iFreilicht/nix that referenced this pull request Jan 31, 2023
Fixes NixOS#6679 and all issues that contain
`cp: cannot overwrite directory ... with non-directory` errors.
These were caused by 475fc10 and
bb0c4b9. Or rather, installations after
475fc10 erroneously followed and deep-copied symlinks, which was fixed
in bb0c4b9. This meant installations installed with the installer
released between these commits had some paths in their nix store with
directories where symlinks should have been, causing the fixed installer
to try to overwrite them with symlinks.

The -n will not overwrite existing files, which is fine inside of the
nix-store as identical store paths will have identical content.

For additional details and examples, see
NixOS#7603 (comment)
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-ready-for-review/3032/1778

@Ericson2314
Copy link
Member

Triaged in the Nix team meeting:

  • @thufschmitt: really hard to test on CI due to
  • @tomberek: there have been multiple attempts at various changes and refactorings
    • we should probably treat the installer as a subsystem and deal with it differently
  • to discuss a strategy how to address the broader issue, and how that relates to the Determinate installer
    • @thufschmitt: will bring it up in the installer work group

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-03-03-nix-team-meeting-minutes-37/25998/1

iFreilicht added a commit to iFreilicht/nix that referenced this pull request Jun 20, 2023
Fixes NixOS#6679 and all issues that contain
`cp: cannot overwrite directory ... with non-directory` errors.
These were caused by 475fc10 and
bb0c4b9. Or rather, installations after
475fc10 erroneously followed and deep-copied symlinks, which was fixed
in bb0c4b9. This meant installations installed with the installer
released between these commits had some paths in their nix store with
directories where symlinks should have been, causing the fixed installer
to try to overwrite them with symlinks.

The -n will not overwrite existing files, which is fine inside of the
nix-store as identical store paths will have identical content.

For additional details and examples, see
NixOS#7603 (comment)
@github-actions github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Jun 20, 2023
@fricklerhandwerk
Copy link
Contributor

@abathur could you take a look?

tests/installer/default.nix Outdated Show resolved Hide resolved
scripts/install-multi-user.sh Outdated Show resolved Hide resolved
@abathur
Copy link
Member

abathur commented Jun 21, 2023

@abathur Ah, thanks for the pointer! I found that page but it didn't seem to be what I was looking for. I got that setup working, which is of course nice, but running the CI just takes forever. What I was really looking for was this:

$ nix-build -A hydraJobs.installerTests.ubuntu-22-04.x86_64-linux.install-force-daemon

This allows me to build and test locally automatically. Very convenient.

Yes, the GA tests aren't great for velocity--but they're still a big step up from the status quo, and I think it's ideal to lock in whatever gains we can get here with both sets of tests.

But yes, good point, I have to adapt the CI as well to avoid a regression on this in the future. However, the CI is using the cachix/install-nix action right now, which does a single-user install

I'm pretty sure it uses multiuser and has for a while as long as you're on macOS or have systemd.

AND skips the installation if it finds the nix command

Ah. Yes. That's annoying (for our case...).

I spent some time trying to find a ~cheap way to force this to work and do have a candidate: abathur@100c3d5#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03fR120-R129

The few lines above the highlight are a failed attempt to just overwrite GITHUB_PATH (which had no effect), but removing nix from the default profile did let the cachix action try to go ahead. (There might be a better way to do this?)

I added a set of jobs that try this with the current release installer, and a set that try it on top of your PR.

Here's the CI run https://github.com/abathur/nix/actions/runs/5329640965

The sanity-check jobs fail where I think we expect them to:

The install-test jobs both also fail, so I'm curious what you think about where they tip over (maybe these are out of scope?):

@iFreilicht
Copy link
Contributor Author

@abathur Thanks for taking the time! Yes, the sanity-checks are exactly what we expect.

But actually, the failing installer-test on macOS is also expected. See the issues I linked in the initial post:

  • Will fix all cp: cannot overwrite directory ... with non-directory errors

This is what was fixed by adding the -n flag to cp.

The unit file on linux I don't know about, I didn't test that. I assume it happens here, on line 101, but can't really explain why. I also searched the issue tracker, and it seems not a single person has ever reported this issue before.

iFreilicht and others added 3 commits June 27, 2023 15:08
Fixes NixOS#6679 and all issues that contain
`cp: cannot overwrite directory ... with non-directory` errors.
These were caused by 475fc10 and
bb0c4b9. Or rather, installations after
475fc10 erroneously followed and deep-copied symlinks, which was fixed
in bb0c4b9. This meant installations installed with the installer
released between these commits had some paths in their nix store with
directories where symlinks should have been, causing the fixed installer
to try to overwrite them with symlinks.

The -n will not overwrite existing files, which is fine inside of the
nix-store as identical store paths will have identical content.

For additional details and examples, see
NixOS#7603 (comment)
@abathur
Copy link
Member

abathur commented Jun 27, 2023

But actually, the failing installer-test on macOS is also expected. See the issues I linked in the initial post:

  • Will fix all cp: cannot overwrite directory ... with non-directory errors

This is what was fixed by adding the -n flag to cp.

I am not sure I understand why this failure is expected. The latter set of tests includes your code (https://github.com/abathur/nix/blame/100c3d5400f315eec04cfdbeb99a20a51b66932c/scripts/install-multi-user.sh#L833), so it doesnt appear that the new flag stops the error?

@iFreilicht
Copy link
Contributor Author

I am not sure I understand why this failure is expected. The latter set of tests includes your code (https://github.com/abathur/nix/blame/100c3d5400f315eec04cfdbeb99a20a51b66932c/scripts/install-multi-user.sh#L833), so it doesnt appear that the new flag stops the error?

Ah sorry, I misunderstood. Yeah I guess. I checked the installer the pipeline downloaded from your cachix, and it seems to indeed be the correct version, so I have no idea why it doesn't work in CI but works when running the hydra tests locally.

@abathur
Copy link
Member

abathur commented Jul 11, 2023

I tested this installer on a spare MacBook with an existing nix 2.13.2 install and it too failed at the cp step with overwrite directory errors.

I'm not very familiar with the vm tests, so I don't have much of a mental model for why they might be differing. Are you sure the macOS variant ran?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug installer macos Nix on macOS, aka OS X, aka darwin tests with-tests Issues related to testing. PRs with tests have some priority
Projects
Status: ⚖ To discuss
Development

Successfully merging this pull request may close these issues.

Reinstall doesnt work multi-user-installer: restore copy-to-store idempotence?
6 participants