Don't use port open check to determine if reboot completed. Fixes #856. #857

nh2 · 2018-01-27T18:56:28Z

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the last reboot on the remote side
to change, which is not racy.

nh2 · 2018-01-27T18:58:56Z

Changed output with this patch:

nixos-cachecache-hetzner> closures copied successfully
machine1> updating GRUB 2 menu...
machine1> rebooting...
packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
machine1> waiting for reboot to complete...done.
machine1> activation finished successfully

nh2 · 2018-01-27T19:00:49Z

Alternative output that can be created by this:

nixos-cachecache-hetzner> closures copied successfully
machine1> updating GRUB 2 menu...
machine1> rebooting...
machine1> waiting for reboot to complete...packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe

machine1> mux_client_request_session: read from master failed: Broken pipe
machine1> done.
machine1> activation finished successfully

So the output is not as pretty as before due to the errors, but at least it works without race-condition.

nh2 · 2018-02-03T17:22:46Z

I pushed a small fix (def reboot() being overridden in subclasses).

domenkozar · 2018-02-04T19:05:51Z

nixops/backends/__init__.py

+ # and show an 'x' as progress indicator in that case.
+ self.log_continue("x")
+ if last_reboot_output is not None and last_reboot_output != pre_reboot_last_reboot_output:
+ break


Shouldn't this call ssh.reset() eventually?

Based on my current understanding, this is not needed, because the only thing reset() is call shutdown() which exits the control master process, and I think that gets cleaned up automatically when the connection dies due to the machine rebooting.

Hm, in theory you're right, but if the machine didn't manage to properly close the TCP socket (for example due to a hard reboot), the control master process is still alive.

Great point, you're right. I've changed this back to always reset at reboot.

I've also re-tested that change with the EC2 and Hetzner backends (though only Hetzner can do --hard reboots, where this is relevant).

nh2 · 2018-05-08T21:03:08Z

All good? This PR is the required base of my new PR #948, so it would be cool if we could finish this one.

aszlig · 2018-05-15T00:39:08Z

nixops/backends/__init__.py

+ # manner, we compare the output of `last reboot` before and after
+ # the reboot. Once the output has changed, the reboot is done.
+ def get_last_reboot_output():
+ return self.run_command('last reboot --time-format iso | head -n1', capture_stdout=True).rstrip()


Both the Hetzner rescue system and the actual NixOS system are using systemd, so maybe it's a better idea to use systemd-analyze because it fails whenever bootup is not finished. At least that would avoid the current vs. last string comparison.

I don't think that can work because when NixOS is waiting for nixops keys to be uploaded (so, right here), you'll get

# systemd-analyze Bootup is not yet finished. Please try again later.

Oh, right... you're correct and after all we don't care about a truly finished reboot but only want to make sure the machine has rebooted after all, so last reboot should work.

nh2 · 2018-05-26T02:52:32Z

Updated the PR, I noticed I have to not only catch nixops.ssh_util.SSHCommandFailed, but also nixops.ssh_util.SSHConnectionFailed explicitly.

nh2 · 2018-05-26T14:01:09Z

Made another small improvement so that nixops reboot --hard also works when the machine is off.

aszlig · 2018-05-27T04:06:18Z

@nh2: Approved, but see my last comment.

coretemp · 2018-06-01T18:23:28Z

nixops/backends/__init__.py

+ # command invocation changes.
+ # We use timeout=10 so that the user gets some sense
+ # of progress, as reboots can take a long time.
+ return self.run_command('last reboot --time-format iso | head -n1', capture_stdout=True, timeout=10).rstrip()


This pipe could be broken.

head -n1 -> head -n 1 http://pubs.opengroup.org/onlinepubs/9699919799/utilities/head.html

nh2 · 2018-06-28T14:55:51Z

OK, I've updated the commits to address the remaining comments.

…OS#856. The old approach, waiting for the machine to not having an open port, and then waiting for it to be open again, was insufficient, because of the race condition that the machine rebooted so quickly that the port was immediately open again without nixops noticing that it went down. I experienced this on a Hetzner cloud server. The new approach checks the `last reboot` on the remote side to change, which is not racy.

grahamc · 2020-03-26T19:22:06Z

Hello!

Thank you for this PR.

In the past several months, some major changes have taken place in
NixOps:

Backends have been removed, preferring a plugin-based architecture.
Here are some of them:
NixOps Core has been updated to be Python 3 only, and at the
same time, MyPy type hints have been added and are now strictly
required during CI.

This is all accumulating in to what I hope will be a NixOps 2.0
release. There is a tracking issue for that:
#1242 . It is possible that
more core changes will be made to NixOps for this release, with a
focus on simplifying NixOps core and making it easier to use and work
on.

My hope is that by adding types and more thorough automated testing,
it will be easier for contributors to make improvements, and for
contributions like this one to merge in the future.

However, because of the major changes, it has become likely that this
PR cannot merge right now as it is. The backlog of now-unmergable PRs
makes it hard to see which ones are being kept up to date.

If you would like to see this merge, please bring it up to date with
master and reopen it. If the or mypy type checking fails, please
correct any issues and then reopen it. I will be looking primarily at
open PRs whose tests are all green.

Thank you again for the work you've done here, I am sorry to be
closing it now.

Graham

nh2 force-pushed the issue-856-too-fast-reboot-race-condition branch from a3f6b8a to 6bc3e33 Compare February 3, 2018 17:21

nh2 force-pushed the issue-856-too-fast-reboot-race-condition branch from 6bc3e33 to 37bc5af Compare February 3, 2018 20:16

domenkozar reviewed Feb 4, 2018

View reviewed changes

nh2 mentioned this pull request May 8, 2018

Hetzner partitioning script #948

Closed

aszlig reviewed May 15, 2018

View reviewed changes

nh2 force-pushed the issue-856-too-fast-reboot-race-condition branch from 37bc5af to 3cbb613 Compare May 26, 2018 02:51

nh2 force-pushed the issue-856-too-fast-reboot-race-condition branch from 3cbb613 to 2fff961 Compare May 26, 2018 14:00

aszlig approved these changes May 27, 2018

View reviewed changes

coretemp reviewed Jun 1, 2018

View reviewed changes

nh2 force-pushed the issue-856-too-fast-reboot-race-condition branch from 2fff961 to f83e244 Compare June 28, 2018 14:21

nh2 force-pushed the issue-856-too-fast-reboot-race-condition branch from f83e244 to 8f94a85 Compare July 2, 2018 18:26

aszlig mentioned this pull request Dec 15, 2018

Allow specifying arbitrary SSH configuration for nodes #1024

Closed

grahamc closed this Mar 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't use port open check to determine if reboot completed. Fixes #856. #857

Don't use port open check to determine if reboot completed. Fixes #856. #857

nh2 commented Jan 27, 2018 •

edited

Loading

nh2 commented Jan 27, 2018

nh2 commented Jan 27, 2018 •

edited

Loading

nh2 commented Feb 3, 2018

domenkozar Feb 4, 2018

nh2 Feb 14, 2018

aszlig May 27, 2018

nh2 Jun 28, 2018 •

edited

Loading

nh2 commented May 8, 2018

aszlig May 15, 2018

nh2 May 26, 2018

aszlig May 27, 2018

nh2 commented May 26, 2018

nh2 commented May 26, 2018

aszlig commented May 27, 2018

coretemp Jun 1, 2018

coretemp Jun 1, 2018

nh2 Jun 28, 2018

nh2 commented Jun 28, 2018

grahamc commented Mar 26, 2020

Don't use port open check to determine if reboot completed. Fixes #856. #857

Don't use port open check to determine if reboot completed. Fixes #856. #857

Conversation

nh2 commented Jan 27, 2018 • edited Loading

nh2 commented Jan 27, 2018

nh2 commented Jan 27, 2018 • edited Loading

nh2 commented Feb 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nh2 Jun 28, 2018 • edited Loading

Choose a reason for hiding this comment

nh2 commented May 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nh2 commented May 26, 2018

nh2 commented May 26, 2018

aszlig commented May 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nh2 commented Jun 28, 2018

grahamc commented Mar 26, 2020

nh2 commented Jan 27, 2018 •

edited

Loading

nh2 commented Jan 27, 2018 •

edited

Loading

nh2 Jun 28, 2018 •

edited

Loading