[syncoid] CRITICAL ERROR: snapshots couldn't be listed #532

sasoiliev · 2020-04-21T00:37:00Z

Hi,

I am trying to set up a set of periodic push replications from host A to host B with syncoid.

(The reason for not using the --recursive option is that there is no direct mapping of the dataset hierarchy on the two hosts, i.e. I can sync some of the datasets with recursion, but others I still want to map manually.)

I have created systemd service and timer units for each dataset pair. The timer units are configured to trigger at the same time (every hour).

I am hitting an issue where at least one (but usually more) of the syncoid instances fail to list the snapshots on the remote host.

Here's the debug log of the latest run of one of the instances:

Apr 21 01:30:00 cube syncoid[7046]: DEBUG: SSHCMD: ssh    -i /root/.ssh/id_rsa
Apr 21 01:30:01 cube syncoid[7046]: ControlSocket /tmp/syncoid-root-root@offsite-1587425400 already exists, disabling multiplexing
Apr 21 01:30:02 cube syncoid[7046]: DEBUG: checking availability of lzop on source...
Apr 21 01:30:02 cube syncoid[7046]: DEBUG: checking availability of lzop on target...
Apr 21 01:30:02 cube syncoid[7046]: DEBUG: checking availability of lzop on local machine...
Apr 21 01:30:02 cube syncoid[7046]: DEBUG: checking availability of mbuffer on source...
Apr 21 01:30:02 cube syncoid[7046]: DEBUG: checking availability of mbuffer on target...
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: checking availability of pv on local machine...
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: checking availability of zfs resume feature on source...
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: checking availability of zfs resume feature on target...
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: syncing source persistent/content to target backup/content.
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: getting current value of syncoid:sync on persistent/content...
Apr 21 01:30:03 cube syncoid[7046]:   zfs get -H syncoid:sync 'persistent/content'
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: checking to see if backup/content on ssh    -i /root/.ssh/id_rsa -S /tmp/syncoid-root-root@offsite-1587425400 root@offsite is already in zfs receive using ssh    -i /root/.ssh/id_rsa -S /tmp/syncoid-root-root@offsite-1587425400>
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: checking to see if target filesystem exists using "ssh    -i /root/.ssh/id_rsa -S /tmp/syncoid-root-root@offsite-1587425400 root@offsite  zfs get -H name ''"'"'backup/content'"'"'' 2>&1 |"...
Apr 21 01:30:04 cube syncoid[7046]: DEBUG: getting current value of receive_resume_token on backup/content...
Apr 21 01:30:04 cube syncoid[7046]: ssh    -i /root/.ssh/id_rsa -S /tmp/syncoid-root-root@offsite-1587425400 root@offsite  zfs get -H receive_resume_token ''"'"'backup/content'"'"''
Apr 21 01:30:04 cube syncoid[7046]: DEBUG: no receive token found
Apr 21 01:30:04 cube syncoid[7046]: DEBUG: getting list of snapshots on persistent/content using   zfs get -Hpd 1 -t snapshot guid,creation 'persistent/content' |...
Apr 21 01:30:04 cube syncoid[7046]: DEBUG: getting list of snapshots on backup/content using ssh    -i /root/.ssh/id_rsa -S /tmp/syncoid-root-root@offsite-1587425400 root@offsite  zfs get -Hpd 1 -t snapshot guid,creation ''"'"'backup/content'"'"'' |...
Apr 21 01:30:05 cube syncoid[7046]: CRITICAL ERROR: snapshots couldn't be listed for backup/content (exit code 65280) at /nix/store/vp01hkbjr4345w381yzr39z9nh014hd8-sanoid-2.0.3/bin/syncoid.pl line 1462.

My working hypothesis is that a race condition exists, but I can't figure out which is the shared resource. I was thinking that this might be due to the ControlMaster (-M) option used in the initial SSH connection, but I wasn't able to prove this.

Any help will be greatly appreciated. Thanks!

The text was updated successfully, but these errors were encountered:

phreaker0 · 2020-04-21T17:57:25Z

@sasoiliev exit code 65280, needs to be shifted by 8 bits to the right to get the actuall exit code of the ssh command for listing the snapshots -> 255

the documentation for ssh suggest:
ssh exits with the exit status of the remote command or with 255 if an error occurred.

so there is probably something wrong with you ssh connection. Can you monitor the log files of your remote server when this happens.

sasoiliev · 2020-04-23T23:25:20Z

@phreaker0 thanks for the pointer. I tried to bump the log level of the remote SSH server to DEBUG3, but this didn't help much.

Something I forgot to mention in my original post is that, when started manually, the systemd services succeed. It's only when the services are started by their timers some of them are failing. The success/failure distribution when triggered by the timer units is seemingly random - it varies anywhere between 2 and 5 failed services, hence my suspicion of a race condition.

What I discovered is that making the SSH control socket name unique by either adding a random delay (via a sleep in ExecStartPre) to the services or using the Time::HighRes version of time() works around the issue.

So the issue seems to stem from the fact that the SSH control socket is shared by multiple syncoid instances because of the name collision (ControlSocket /tmp/syncoid-root-root@offsite-1587425400 already exists, disabling multiplexing), but I don't understand yet why this is a problem.

phreaker0 · 2020-04-24T07:04:30Z

@sasoiliev
I don't think it's a race condition. But I have a theory, the first synoicd created the control socket, the other runs will us this for their SSH stuff. But when the first syncoid finishes, it will also close the master ssh session. If the other syncoid instances are still running they won't be able to do ssh stuff any more.

sasoiliev · 2020-04-24T22:39:18Z

@phreaker0 right, this seems to be the case.

Would you consider a PR fixing this via Temp::File::tempdir? Basically this (tested only on Linux):

diff --git a/syncoid b/syncoid
index f891099..f323497 100755
--- a/syncoid
+++ b/syncoid
@@ -14,6 +14,7 @@ use Pod::Usage;
 use Time::Local;
 use Sys::Hostname;
 use Capture::Tiny ':all';
+use File::Temp qw(tempdir);
 
 my $mbuffer_size = "16M";
 
@@ -29,6 +30,10 @@ GetOptions(\%args, "no-command-checks", "monitor-version", "compress=s", "dumpsn
 
 my %compressargs = %{compressargset($args{'compress'} || 'default')}; # Can't be done with GetOptions arg, as default still needs to be set
 
+# Install an explicit signal handler to enable proper clean-up of temporary files/dirs.
+# See https://stackoverflow.com/questions/38711725/under-what-circumstances-are-end-blocks-skipped-in-perl
+$SIG{INT} = sub { };
+
 my @sendoptions = ();
 if (length $args{'sendoptions'}) {
        @sendoptions = parsespecialoptions($args{'sendoptions'});
@@ -1418,7 +1423,11 @@ sub getssh {
                 $remoteuser =~ s/\@.*$//;
                if ($remoteuser eq 'root' || $args{'no-privilege-elevation'}) { $isroot = 1; } else { $isroot = 0; }
                # now we need to establish a persistent master SSH connection
-               $socket = "/tmp/syncoid-$remoteuser-$rhost-" . time();
+               my $originalumask = umask 077;
+               my $socketdir = tempdir("syncoid-$remoteuser-$rhost.XXXXXXXX", CLEANUP => 1, TMPDIR => 1);
+               umask $originalumask;
+               $socket = "$socketdir/ssh.sock";
                open FH, "$sshcmd -M -S $socket -o ControlPersist=1m $args{'sshport'} $rhost exit |";
                close FH;

phreaker0 · 2020-05-01T15:01:12Z

@sasoiliev sorry for taking so long to reply. I think the easiest solution would be to replace $remoteuser-$rhost with something like $$ ($$ = current pid of the process). This would also fix #530 .
@jimsalterjrs do you mind dropping $remoteuser-$rhost for the current pid?

sasoiliev · 2020-05-01T16:15:51Z

@phreaker0 no worries, thanks for getting back!

Indeed using the PID is much simpler.

Replacing time() with $$ (and leaving $remoteuser-$rhost) would also work, but it wouldn't address #530.

I guess if the intent of having the $remoteuser-$rhost part was to help in avoiding collisions then using /tmp/syncoid-$$ makes sense, but let's see what @jimsalterjrs thinks.

ju1m · 2020-09-23T06:23:41Z

I've hit the same problem while implementing the same idea of using a "systemd service and timer units for each dataset pair" in NixOS/nixpkgs#98455
Adding the PID in the socket path as suggested, seems a good fix. Or maybe removing the timestamp but allowing the use of ControlMaster=auto to get opportunistic multiplexing between all syncoid calls (though it would keep active the service acting as master, longer than required for its own transfer). However, if such opportunisticaly shared master were to stall it would stall all the others syncoid services, so maybe it's better to keep the master private to each syncoid call.

jimsalterjrs · 2021-07-23T21:50:49Z

@sasoiliev sorry for taking so long to reply. I think the easiest solution would be to replace $remoteuser-$rhost with something like $$ ($$ = current pid of the process). This would also fix #530 .
@jimsalterjrs do you mind dropping $remoteuser-$rhost for the current pid?

I can't believe I missed this. Hmmm, I think it would probably make more sense to just add a short pseudorandom hash. PIDs get recycled also; they shouldn't get recycled quickly enough to cause a problem... but then again, I wasn't prepared for simultaneous-to-the-second invocations of syncoid, so if we're going to fix this, we should probably fix it in a more reliable and portable way.

Each call should definitely have its own ControlMaster, IMO. I don't want one depending on another.

darkpixel · 2021-12-06T23:50:36Z

I just hit this. It looks like it's because the socket name is $remoteuser-$rhost, but I have multiple boxes behind one hostname (i.e. router1.example.tld ports 4001, 4002, and 4003). If I manually take backups instead of using tsp as a queue, I can serialize them and they work perfectly.

gnordli · 2021-12-07T00:40:18Z

I have this happen all the time too. I fix it by doing:

$socket = "/tmp/syncoid-$remoteuser-$rhost-" . time() . rand(1000);

sdettmer · 2024-04-01T09:58:59Z

I guess the real cause for my issue in #902 could be, that --identifier=EXTRA possibly should be honored at least: if user assigns identifiers, maybe it is needed here for the same reason?

sdettmer · 2024-04-01T10:00:58Z

also I think maybe support for $ENV{TMPDIR} could be considered (then I could use different TMPDIRs as workaround).

sasoiliev mentioned this issue Apr 28, 2020

Fixed SSH control socket collision #535

Closed

OtherJohnGray mentioned this issue Aug 20, 2020

include PID in SSH Control path #581

Closed

ju1m mentioned this issue Nov 25, 2020

nixos/syncoid: split in multiple systemd services and harden them NixOS/nixpkgs#98455

Merged

13 tasks

Sieboldianus mentioned this issue Jan 30, 2024

Randomly failing syncoid pulls with CRITICAL ERROR ... failed: 256 at /usr/sbin/syncoid line 889. #885

Closed

sdettmer mentioned this issue Apr 1, 2024

syncoid: ControlSocket already exists #902

Closed

phreaker0 mentioned this issue Apr 17, 2024

rename ssh control socket to avoid problem with length limits and con… #909

Merged

jimsalterjrs closed this as completed in #909 Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[syncoid] CRITICAL ERROR: snapshots couldn't be listed #532

[syncoid] CRITICAL ERROR: snapshots couldn't be listed #532

sasoiliev commented Apr 21, 2020 •

edited

Loading

phreaker0 commented Apr 21, 2020

sasoiliev commented Apr 23, 2020

phreaker0 commented Apr 24, 2020

sasoiliev commented Apr 24, 2020

phreaker0 commented May 1, 2020

sasoiliev commented May 1, 2020

ju1m commented Sep 23, 2020

jimsalterjrs commented Jul 23, 2021

darkpixel commented Dec 6, 2021

gnordli commented Dec 7, 2021

sdettmer commented Apr 1, 2024

sdettmer commented Apr 1, 2024

[syncoid] CRITICAL ERROR: snapshots couldn't be listed #532

[syncoid] CRITICAL ERROR: snapshots couldn't be listed #532

Comments

sasoiliev commented Apr 21, 2020 • edited Loading

phreaker0 commented Apr 21, 2020

sasoiliev commented Apr 23, 2020

phreaker0 commented Apr 24, 2020

sasoiliev commented Apr 24, 2020

phreaker0 commented May 1, 2020

sasoiliev commented May 1, 2020

ju1m commented Sep 23, 2020

jimsalterjrs commented Jul 23, 2021

darkpixel commented Dec 6, 2021

gnordli commented Dec 7, 2021

sdettmer commented Apr 1, 2024

sdettmer commented Apr 1, 2024

sasoiliev commented Apr 21, 2020 •

edited

Loading