Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[syncoid] CRITICAL ERROR: snapshots couldn't be listed #532

Closed
sasoiliev opened this issue Apr 21, 2020 · 12 comments · Fixed by #909
Closed

[syncoid] CRITICAL ERROR: snapshots couldn't be listed #532

sasoiliev opened this issue Apr 21, 2020 · 12 comments · Fixed by #909

Comments

@sasoiliev
Copy link

sasoiliev commented Apr 21, 2020

Hi,

I am trying to set up a set of periodic push replications from host A to host B with syncoid.

(The reason for not using the --recursive option is that there is no direct mapping of the dataset hierarchy on the two hosts, i.e. I can sync some of the datasets with recursion, but others I still want to map manually.)

I have created systemd service and timer units for each dataset pair. The timer units are configured to trigger at the same time (every hour).

I am hitting an issue where at least one (but usually more) of the syncoid instances fail to list the snapshots on the remote host.

Here's the debug log of the latest run of one of the instances:

Apr 21 01:30:00 cube syncoid[7046]: DEBUG: SSHCMD: ssh    -i /root/.ssh/id_rsa
Apr 21 01:30:01 cube syncoid[7046]: ControlSocket /tmp/syncoid-root-root@offsite-1587425400 already exists, disabling multiplexing
Apr 21 01:30:02 cube syncoid[7046]: DEBUG: checking availability of lzop on source...
Apr 21 01:30:02 cube syncoid[7046]: DEBUG: checking availability of lzop on target...
Apr 21 01:30:02 cube syncoid[7046]: DEBUG: checking availability of lzop on local machine...
Apr 21 01:30:02 cube syncoid[7046]: DEBUG: checking availability of mbuffer on source...
Apr 21 01:30:02 cube syncoid[7046]: DEBUG: checking availability of mbuffer on target...
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: checking availability of pv on local machine...
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: checking availability of zfs resume feature on source...
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: checking availability of zfs resume feature on target...
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: syncing source persistent/content to target backup/content.
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: getting current value of syncoid:sync on persistent/content...
Apr 21 01:30:03 cube syncoid[7046]:   zfs get -H syncoid:sync 'persistent/content'
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: checking to see if backup/content on ssh    -i /root/.ssh/id_rsa -S /tmp/syncoid-root-root@offsite-1587425400 root@offsite is already in zfs receive using ssh    -i /root/.ssh/id_rsa -S /tmp/syncoid-root-root@offsite-1587425400>
Apr 21 01:30:03 cube syncoid[7046]: DEBUG: checking to see if target filesystem exists using "ssh    -i /root/.ssh/id_rsa -S /tmp/syncoid-root-root@offsite-1587425400 root@offsite  zfs get -H name ''"'"'backup/content'"'"'' 2>&1 |"...
Apr 21 01:30:04 cube syncoid[7046]: DEBUG: getting current value of receive_resume_token on backup/content...
Apr 21 01:30:04 cube syncoid[7046]: ssh    -i /root/.ssh/id_rsa -S /tmp/syncoid-root-root@offsite-1587425400 root@offsite  zfs get -H receive_resume_token ''"'"'backup/content'"'"''
Apr 21 01:30:04 cube syncoid[7046]: DEBUG: no receive token found
Apr 21 01:30:04 cube syncoid[7046]: DEBUG: getting list of snapshots on persistent/content using   zfs get -Hpd 1 -t snapshot guid,creation 'persistent/content' |...
Apr 21 01:30:04 cube syncoid[7046]: DEBUG: getting list of snapshots on backup/content using ssh    -i /root/.ssh/id_rsa -S /tmp/syncoid-root-root@offsite-1587425400 root@offsite  zfs get -Hpd 1 -t snapshot guid,creation ''"'"'backup/content'"'"'' |...
Apr 21 01:30:05 cube syncoid[7046]: CRITICAL ERROR: snapshots couldn't be listed for backup/content (exit code 65280) at /nix/store/vp01hkbjr4345w381yzr39z9nh014hd8-sanoid-2.0.3/bin/syncoid.pl line 1462.

My working hypothesis is that a race condition exists, but I can't figure out which is the shared resource. I was thinking that this might be due to the ControlMaster (-M) option used in the initial SSH connection, but I wasn't able to prove this.

Any help will be greatly appreciated. Thanks!

@phreaker0
Copy link
Collaborator

@sasoiliev exit code 65280, needs to be shifted by 8 bits to the right to get the actuall exit code of the ssh command for listing the snapshots -> 255

the documentation for ssh suggest:
ssh exits with the exit status of the remote command or with 255 if an error occurred.

so there is probably something wrong with you ssh connection. Can you monitor the log files of your remote server when this happens.

@sasoiliev
Copy link
Author

@phreaker0 thanks for the pointer. I tried to bump the log level of the remote SSH server to DEBUG3, but this didn't help much.

Something I forgot to mention in my original post is that, when started manually, the systemd services succeed. It's only when the services are started by their timers some of them are failing. The success/failure distribution when triggered by the timer units is seemingly random - it varies anywhere between 2 and 5 failed services, hence my suspicion of a race condition.

What I discovered is that making the SSH control socket name unique by either adding a random delay (via a sleep in ExecStartPre) to the services or using the Time::HighRes version of time() works around the issue.

So the issue seems to stem from the fact that the SSH control socket is shared by multiple syncoid instances because of the name collision (ControlSocket /tmp/syncoid-root-root@offsite-1587425400 already exists, disabling multiplexing), but I don't understand yet why this is a problem.

@phreaker0
Copy link
Collaborator

@sasoiliev
I don't think it's a race condition. But I have a theory, the first synoicd created the control socket, the other runs will us this for their SSH stuff. But when the first syncoid finishes, it will also close the master ssh session. If the other syncoid instances are still running they won't be able to do ssh stuff any more.

@sasoiliev
Copy link
Author

@phreaker0 right, this seems to be the case.

Would you consider a PR fixing this via Temp::File::tempdir? Basically this (tested only on Linux):

diff --git a/syncoid b/syncoid
index f891099..f323497 100755
--- a/syncoid
+++ b/syncoid
@@ -14,6 +14,7 @@ use Pod::Usage;
 use Time::Local;
 use Sys::Hostname;
 use Capture::Tiny ':all';
+use File::Temp qw(tempdir);
 
 my $mbuffer_size = "16M";
 
@@ -29,6 +30,10 @@ GetOptions(\%args, "no-command-checks", "monitor-version", "compress=s", "dumpsn
 
 my %compressargs = %{compressargset($args{'compress'} || 'default')}; # Can't be done with GetOptions arg, as default still needs to be set
 
+# Install an explicit signal handler to enable proper clean-up of temporary files/dirs.
+# See https://stackoverflow.com/questions/38711725/under-what-circumstances-are-end-blocks-skipped-in-perl
+$SIG{INT} = sub { };
+
 my @sendoptions = ();
 if (length $args{'sendoptions'}) {
        @sendoptions = parsespecialoptions($args{'sendoptions'});
@@ -1418,7 +1423,11 @@ sub getssh {
                 $remoteuser =~ s/\@.*$//;
                if ($remoteuser eq 'root' || $args{'no-privilege-elevation'}) { $isroot = 1; } else { $isroot = 0; }
                # now we need to establish a persistent master SSH connection
-               $socket = "/tmp/syncoid-$remoteuser-$rhost-" . time();
+               my $originalumask = umask 077;
+               my $socketdir = tempdir("syncoid-$remoteuser-$rhost.XXXXXXXX", CLEANUP => 1, TMPDIR => 1);
+               umask $originalumask;
+               $socket = "$socketdir/ssh.sock";
                open FH, "$sshcmd -M -S $socket -o ControlPersist=1m $args{'sshport'} $rhost exit |";
                close FH;

@phreaker0
Copy link
Collaborator

@sasoiliev sorry for taking so long to reply. I think the easiest solution would be to replace $remoteuser-$rhost with something like $$ ($$ = current pid of the process). This would also fix #530 .
@jimsalterjrs do you mind dropping $remoteuser-$rhost for the current pid?

@sasoiliev
Copy link
Author

@phreaker0 no worries, thanks for getting back!

Indeed using the PID is much simpler.

Replacing time() with $$ (and leaving $remoteuser-$rhost) would also work, but it wouldn't address #530.

I guess if the intent of having the $remoteuser-$rhost part was to help in avoiding collisions then using /tmp/syncoid-$$ makes sense, but let's see what @jimsalterjrs thinks.

@ju1m
Copy link

ju1m commented Sep 23, 2020

I've hit the same problem while implementing the same idea of using a "systemd service and timer units for each dataset pair" in NixOS/nixpkgs#98455
Adding the PID in the socket path as suggested, seems a good fix. Or maybe removing the timestamp but allowing the use of ControlMaster=auto to get opportunistic multiplexing between all syncoid calls (though it would keep active the service acting as master, longer than required for its own transfer). However, if such opportunisticaly shared master were to stall it would stall all the others syncoid services, so maybe it's better to keep the master private to each syncoid call.

@jimsalterjrs
Copy link
Owner

@sasoiliev sorry for taking so long to reply. I think the easiest solution would be to replace $remoteuser-$rhost with something like $$ ($$ = current pid of the process). This would also fix #530 .
@jimsalterjrs do you mind dropping $remoteuser-$rhost for the current pid?

I can't believe I missed this. Hmmm, I think it would probably make more sense to just add a short pseudorandom hash. PIDs get recycled also; they shouldn't get recycled quickly enough to cause a problem... but then again, I wasn't prepared for simultaneous-to-the-second invocations of syncoid, so if we're going to fix this, we should probably fix it in a more reliable and portable way.

Each call should definitely have its own ControlMaster, IMO. I don't want one depending on another.

@darkpixel
Copy link

I just hit this. It looks like it's because the socket name is $remoteuser-$rhost, but I have multiple boxes behind one hostname (i.e. router1.example.tld ports 4001, 4002, and 4003). If I manually take backups instead of using tsp as a queue, I can serialize them and they work perfectly.

@gnordli
Copy link

gnordli commented Dec 7, 2021

I have this happen all the time too. I fix it by doing:

$socket = "/tmp/syncoid-$remoteuser-$rhost-" . time() . rand(1000);

@sdettmer
Copy link

sdettmer commented Apr 1, 2024

I guess the real cause for my issue in #902 could be, that --identifier=EXTRA possibly should be honored at least: if user assigns identifiers, maybe it is needed here for the same reason?

@sdettmer
Copy link

sdettmer commented Apr 1, 2024

also I think maybe support for $ENV{TMPDIR} could be considered (then I could use different TMPDIRs as workaround).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants