Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use port open check to determine if reboot completed. Fixes #856. #857

Closed

Conversation

nh2
Copy link
Contributor

@nh2 nh2 commented Jan 27, 2018

See #856

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the last reboot on the remote side
to change, which is not racy.

@nh2
Copy link
Contributor Author

nh2 commented Jan 27, 2018

Changed output with this patch:

nixos-cachecache-hetzner> closures copied successfully
machine1> updating GRUB 2 menu...
machine1> rebooting...
packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
machine1> waiting for reboot to complete...done.
machine1> activation finished successfully

@nh2
Copy link
Contributor Author

nh2 commented Jan 27, 2018

Alternative output that can be created by this:

nixos-cachecache-hetzner> closures copied successfully
machine1> updating GRUB 2 menu...
machine1> rebooting...
machine1> waiting for reboot to complete...packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe

machine1> mux_client_request_session: read from master failed: Broken pipe
machine1> done.
machine1> activation finished successfully

So the output is not as pretty as before due to the errors, but at least it works without race-condition.

@nh2 nh2 force-pushed the issue-856-too-fast-reboot-race-condition branch from a3f6b8a to 6bc3e33 Compare February 3, 2018 17:21
@nh2
Copy link
Contributor Author

nh2 commented Feb 3, 2018

I pushed a small fix (def reboot() being overridden in subclasses).

@nh2 nh2 force-pushed the issue-856-too-fast-reboot-race-condition branch from 6bc3e33 to 37bc5af Compare February 3, 2018 20:16
# and show an 'x' as progress indicator in that case.
self.log_continue("x")
if last_reboot_output is not None and last_reboot_output != pre_reboot_last_reboot_output:
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this call ssh.reset() eventually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my current understanding, this is not needed, because the only thing reset() is call shutdown() which exits the control master process, and I think that gets cleaned up automatically when the connection dies due to the machine rebooting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, in theory you're right, but if the machine didn't manage to properly close the TCP socket (for example due to a hard reboot), the control master process is still alive.

Copy link
Contributor Author

@nh2 nh2 Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, you're right. I've changed this back to always reset at reboot.

I've also re-tested that change with the EC2 and Hetzner backends (though only Hetzner can do --hard reboots, where this is relevant).

@nh2 nh2 mentioned this pull request May 8, 2018
@nh2
Copy link
Contributor Author

nh2 commented May 8, 2018

All good? This PR is the required base of my new PR #948, so it would be cool if we could finish this one.

# manner, we compare the output of `last reboot` before and after
# the reboot. Once the output has changed, the reboot is done.
def get_last_reboot_output():
return self.run_command('last reboot --time-format iso | head -n1', capture_stdout=True).rstrip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the Hetzner rescue system and the actual NixOS system are using systemd, so maybe it's a better idea to use systemd-analyze because it fails whenever bootup is not finished. At least that would avoid the current vs. last string comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that can work because when NixOS is waiting for nixops keys to be uploaded (so, right here), you'll get

# systemd-analyze
Bootup is not yet finished. Please try again later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right... you're correct and after all we don't care about a truly finished reboot but only want to make sure the machine has rebooted after all, so last reboot should work.

@nh2 nh2 force-pushed the issue-856-too-fast-reboot-race-condition branch from 37bc5af to 3cbb613 Compare May 26, 2018 02:51
@nh2
Copy link
Contributor Author

nh2 commented May 26, 2018

Updated the PR, I noticed I have to not only catch nixops.ssh_util.SSHCommandFailed, but also nixops.ssh_util.SSHConnectionFailed explicitly.

@nh2 nh2 force-pushed the issue-856-too-fast-reboot-race-condition branch from 3cbb613 to 2fff961 Compare May 26, 2018 14:00
@nh2
Copy link
Contributor Author

nh2 commented May 26, 2018

Made another small improvement so that nixops reboot --hard also works when the machine is off.

@aszlig
Copy link
Member

aszlig commented May 27, 2018

@nh2: Approved, but see my last comment.

# command invocation changes.
# We use timeout=10 so that the user gets some sense
# of progress, as reboots can take a long time.
return self.run_command('last reboot --time-format iso | head -n1', capture_stdout=True, timeout=10).rstrip()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pipe could be broken.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@nh2 nh2 force-pushed the issue-856-too-fast-reboot-race-condition branch from 2fff961 to f83e244 Compare June 28, 2018 14:21
@nh2
Copy link
Contributor Author

nh2 commented Jun 28, 2018

OK, I've updated the commits to address the remaining comments.

…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
@grahamc
Copy link
Member

grahamc commented Mar 26, 2020

Hello!

Thank you for this PR.

In the past several months, some major changes have taken place in
NixOps:

  1. Backends have been removed, preferring a plugin-based architecture.
    Here are some of them:

  2. NixOps Core has been updated to be Python 3 only, and at the
    same time, MyPy type hints have been added and are now strictly
    required during CI.

This is all accumulating in to what I hope will be a NixOps 2.0
release
. There is a tracking issue for that:
#1242 . It is possible that
more core changes will be made to NixOps for this release, with a
focus on simplifying NixOps core and making it easier to use and work
on.

My hope is that by adding types and more thorough automated testing,
it will be easier for contributors to make improvements, and for
contributions like this one to merge in the future.

However, because of the major changes, it has become likely that this
PR cannot merge right now as it is. The backlog of now-unmergable PRs
makes it hard to see which ones are being kept up to date.

If you would like to see this merge, please bring it up to date with
master and reopen it
. If the or mypy type checking fails, please
correct any issues and then reopen it. I will be looking primarily at
open PRs whose tests are all green.

Thank you again for the work you've done here, I am sorry to be
closing it now.

Graham

@grahamc grahamc closed this Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants