-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[syncoid] CRITICAL ERROR: snapshots couldn't be listed #532
Comments
@sasoiliev exit code 65280, needs to be shifted by 8 bits to the right to get the actuall exit code of the ssh command for listing the snapshots -> 255 the documentation for ssh suggest: so there is probably something wrong with you ssh connection. Can you monitor the log files of your remote server when this happens. |
@phreaker0 thanks for the pointer. I tried to bump the log level of the remote SSH server to Something I forgot to mention in my original post is that, when started manually, the systemd services succeed. It's only when the services are started by their timers some of them are failing. The success/failure distribution when triggered by the timer units is seemingly random - it varies anywhere between 2 and 5 failed services, hence my suspicion of a race condition. What I discovered is that making the SSH control socket name unique by either adding a random delay (via a So the issue seems to stem from the fact that the SSH control socket is shared by multiple syncoid instances because of the name collision ( |
@sasoiliev |
@phreaker0 right, this seems to be the case. Would you consider a PR fixing this via diff --git a/syncoid b/syncoid
index f891099..f323497 100755
--- a/syncoid
+++ b/syncoid
@@ -14,6 +14,7 @@ use Pod::Usage;
use Time::Local;
use Sys::Hostname;
use Capture::Tiny ':all';
+use File::Temp qw(tempdir);
my $mbuffer_size = "16M";
@@ -29,6 +30,10 @@ GetOptions(\%args, "no-command-checks", "monitor-version", "compress=s", "dumpsn
my %compressargs = %{compressargset($args{'compress'} || 'default')}; # Can't be done with GetOptions arg, as default still needs to be set
+# Install an explicit signal handler to enable proper clean-up of temporary files/dirs.
+# See https://stackoverflow.com/questions/38711725/under-what-circumstances-are-end-blocks-skipped-in-perl
+$SIG{INT} = sub { };
+
my @sendoptions = ();
if (length $args{'sendoptions'}) {
@sendoptions = parsespecialoptions($args{'sendoptions'});
@@ -1418,7 +1423,11 @@ sub getssh {
$remoteuser =~ s/\@.*$//;
if ($remoteuser eq 'root' || $args{'no-privilege-elevation'}) { $isroot = 1; } else { $isroot = 0; }
# now we need to establish a persistent master SSH connection
- $socket = "/tmp/syncoid-$remoteuser-$rhost-" . time();
+ my $originalumask = umask 077;
+ my $socketdir = tempdir("syncoid-$remoteuser-$rhost.XXXXXXXX", CLEANUP => 1, TMPDIR => 1);
+ umask $originalumask;
+ $socket = "$socketdir/ssh.sock";
open FH, "$sshcmd -M -S $socket -o ControlPersist=1m $args{'sshport'} $rhost exit |";
close FH; |
@sasoiliev sorry for taking so long to reply. I think the easiest solution would be to replace |
@phreaker0 no worries, thanks for getting back! Indeed using the PID is much simpler. Replacing I guess if the intent of having the |
I've hit the same problem while implementing the same idea of using a "systemd service and timer units for each dataset pair" in NixOS/nixpkgs#98455 |
I can't believe I missed this. Hmmm, I think it would probably make more sense to just add a short pseudorandom hash. PIDs get recycled also; they shouldn't get recycled quickly enough to cause a problem... but then again, I wasn't prepared for simultaneous-to-the-second invocations of Each call should definitely have its own ControlMaster, IMO. I don't want one depending on another. |
I just hit this. It looks like it's because the socket name is |
I have this happen all the time too. I fix it by doing: $socket = "/tmp/syncoid-$remoteuser-$rhost-" . time() . rand(1000); |
I guess the real cause for my issue in #902 could be, that --identifier=EXTRA possibly should be honored at least: if user assigns identifiers, maybe it is needed here for the same reason? |
also I think maybe support for $ENV{TMPDIR} could be considered (then I could use different TMPDIRs as workaround). |
Hi,
I am trying to set up a set of periodic push replications from host A to host B with syncoid.
(The reason for not using the
--recursive
option is that there is no direct mapping of the dataset hierarchy on the two hosts, i.e. I can sync some of the datasets with recursion, but others I still want to map manually.)I have created systemd service and timer units for each dataset pair. The timer units are configured to trigger at the same time (every hour).
I am hitting an issue where at least one (but usually more) of the syncoid instances fail to list the snapshots on the remote host.
Here's the debug log of the latest run of one of the instances:
My working hypothesis is that a race condition exists, but I can't figure out which is the shared resource. I was thinking that this might be due to the
ControlMaster
(-M
) option used in the initial SSH connection, but I wasn't able to prove this.Any help will be greatly appreciated. Thanks!
The text was updated successfully, but these errors were encountered: