Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

waiting for the machine to finish rebooting does not work when the machine reboots too quickly #856

Open
nh2 opened this issue Jan 27, 2018 · 1 comment

Comments

@nh2
Copy link
Contributor

nh2 commented Jan 27, 2018

Using nixops deploy --force-reboot.

Good case:

nixos-cachecache-hetzner> closures copied successfully
machine1> updating GRUB 2 menu...
machine1> rebooting...
machine1> waiting for the machine to finish rebooting....[down]............................................................[up]
machine1> activation finished successfully
nixos-cachecache-hetzner> deployment finished successfully

Bad case:

machine1> setting custom nix.conf options
building all machine configurations...
machine1> copying closure...
nixos-cachecache-hetzner> closures copied successfully
machine1> updating GRUB 2 menu...
machine1> rebooting...
packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
machine1> waiting for the machine to finish rebooting............................................................................................................................

Note the packet_write_wait in the above.

The machine is perfectly up and running in that case; running nixops ssh at this point works.

I suspect there is a race condititon that if the machine shuts down so quickly that packet_write_wait appears before waiting for the machine to finish rebooting appears, then the reboot detection doesn't work.

@nh2
Copy link
Contributor Author

nh2 commented Jan 27, 2018

I can trigger this race condition reliably by inserting a sleep() in the right place:

    def reboot_sync(self, hard=False):
        """Reboot this machine and wait until it's up again."""
        self.reboot(hard=hard)

        # SLEEP INSERTED HERE
        import time
        time.sleep(20)

        self.log_start("waiting for the machine to finish rebooting...")
        nixops.util.wait_for_tcp_port(self.get_ssh_name(), self.ssh_port, open=False, callback=lambda: self.log_continue("."))

So indeed, when the race condition triggers in my issue, the machine has already rebooted by the time we call wait_for_tcp_port(..., open=False), so then this line will hang forever.

nh2 added a commit to nh2/nixops that referenced this issue Jan 27, 2018
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
nh2 added a commit to nh2/nixops that referenced this issue Feb 3, 2018
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
nh2 added a commit to nh2/nixops that referenced this issue Feb 3, 2018
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
nh2 added a commit to nh2/nixops that referenced this issue Apr 17, 2018
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
nh2 added a commit to nh2/nixops that referenced this issue May 8, 2018
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
nh2 added a commit to nh2/nixops that referenced this issue May 26, 2018
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
nh2 added a commit to nh2/nixops that referenced this issue May 26, 2018
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
nh2 added a commit to nh2/nixops that referenced this issue May 26, 2018
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
nh2 added a commit to nh2/nixops that referenced this issue May 26, 2018
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
nh2 added a commit to nh2/nixops that referenced this issue Jun 28, 2018
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
nh2 added a commit to nh2/nixops that referenced this issue Jul 2, 2018
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
nh2 added a commit to nh2/nixops that referenced this issue Oct 28, 2018
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
tolbrino pushed a commit to tolbrino/nixops that referenced this issue Jun 25, 2020
…OS#856.

The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.

The new approach checks the `last reboot` on the remote side
to change, which is not racy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant