waiting for the machine to finish rebooting does not work when the machine reboots too quickly #856

nh2 · 2018-01-27T17:48:50Z

Using nixops deploy --force-reboot.

Good case:

nixos-cachecache-hetzner> closures copied successfully
machine1> updating GRUB 2 menu...
machine1> rebooting...
machine1> waiting for the machine to finish rebooting....[down]............................................................[up]
machine1> activation finished successfully
nixos-cachecache-hetzner> deployment finished successfully

Bad case:

machine1> setting custom nix.conf options
building all machine configurations...
machine1> copying closure...
nixos-cachecache-hetzner> closures copied successfully
machine1> updating GRUB 2 menu...
machine1> rebooting...
packet_write_wait: Connection to 1.2.3.4 port 22: Broken pipe
machine1> waiting for the machine to finish rebooting............................................................................................................................

Note the packet_write_wait in the above.

The machine is perfectly up and running in that case; running nixops ssh at this point works.

I suspect there is a race condititon that if the machine shuts down so quickly that packet_write_wait appears before waiting for the machine to finish rebooting appears, then the reboot detection doesn't work.

The text was updated successfully, but these errors were encountered:

nh2 · 2018-01-27T18:07:00Z

I can trigger this race condition reliably by inserting a sleep() in the right place:

    def reboot_sync(self, hard=False):
        """Reboot this machine and wait until it's up again."""
        self.reboot(hard=hard)

        # SLEEP INSERTED HERE
        import time
        time.sleep(20)

        self.log_start("waiting for the machine to finish rebooting...")
        nixops.util.wait_for_tcp_port(self.get_ssh_name(), self.ssh_port, open=False, callback=lambda: self.log_continue("."))

So indeed, when the race condition triggers in my issue, the machine has already rebooted by the time we call wait_for_tcp_port(..., open=False), so then this line will hang forever.

…OS#856. The old approach, waiting for the machine to not having an open port, and then waiting for it to be open again, was insufficient, because of the race condition that the machine rebooted so quickly that the port was immediately open again without nixops noticing that it went down. I experienced this on a Hetzner cloud server. The new approach checks the `last reboot` on the remote side to change, which is not racy.

nh2 mentioned this issue Jan 27, 2018

Don't use port open check to determine if reboot completed. Fixes #856. #857

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

waiting for the machine to finish rebooting does not work when the machine reboots too quickly #856

waiting for the machine to finish rebooting does not work when the machine reboots too quickly #856

nh2 commented Jan 27, 2018 •

edited

Loading

nh2 commented Jan 27, 2018 •

edited

Loading

waiting for the machine to finish rebooting does not work when the machine reboots too quickly #856

waiting for the machine to finish rebooting does not work when the machine reboots too quickly #856

Comments

nh2 commented Jan 27, 2018 • edited Loading

nh2 commented Jan 27, 2018 • edited Loading

nh2 commented Jan 27, 2018 •

edited

Loading

nh2 commented Jan 27, 2018 •

edited

Loading