-
-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Borg Deadlocked #813
Comments
Can you determine which process/host hogged the lock first and what happened there? In 1.0.0 there are some circumstances (- mainly abnormal process termination) that usually result in stale locks. Some changes where made in master since then to avoid these; unless a Borg is sigkill'ed there shouldn't be any stale locks anymore. |
It seems like the remote host has crashed and the lock was not cleaned up properly. |
OK, in that case, I'ld say this is a feature, not a bug. If the backup host crashed while borg was active, you want to know that and further backups being stopped until you manually remove the lock AND run a borg check on the repo. |
The
( https://github.com/borgbackup/borg/blob/master/borg/remote.py#L75 ) Btw. maybe a repository.rollback() is more appropriate here? Or is this the regular clean-up path? Not sure... |
@enkore @rumpelsepp I understood "remote host has crashed" as "the machine running 'borg serve' has crashed". If I understood correctly, there is nothing to do here. |
Misunderstanding on my part then :) |
No. The only thing that I can find out is the error message with Traceback in the first cron mail. The machine did not crash, the machine was fine and refused to aquire locks, thus the other machines spammed me with Borg "Could not aquire lock..." mails. :) First email:
|
OK, so the borg serve machine did not crash, but there was a lock in the repo and we do not know why. I guess we can't do anything here if we do not find out why the lock was there, so I think I'll close this unless we get more information. |
Yeah. When this issue comes again I will dig deeper into it and maybe enable debug logs. Currently everything was as silent as possible because of cron. ATM everything runs fine again. On March 30, 2016 7:20:37 PM GMT+02:00, TW notifications@github.com wrote:
Sent from my Android device with K-9 Mail. Please excuse my brevity. |
Having this issue very often here, backing up 10 servers or so onto a shared backup repository. As the original poster said, it is possible to mitigate a bit using random sleep and longer lock-wait but every morning I have several instances of the backup which fail after (a very long number of seconds), usually because too long waiting for the lock. I am logging failed backups into my monitoring system so I can tell exactly how many backups were successful and how many returned non zero (and after how many seconds). |
If the lock wait times out (after waiting for whatever time you specified), there are 2 possibilities:
Maybe upgrade to borg 1.0.1, which has some fixes, including one related to lock cleanup. |
@ThomasWaldmann I have backported the fix for #773 but other than that I'm still using 1.0.0-4 (because that's the latest package available from the ubuntu ppa as of today) |
Thanks, just asked in the "distro packages needed" issue for a borg 1.0.1 package, will try it as soon as it becomes available. |
I just got a stale lock and I'm not running any simultaneous backups. |
@lhupfeldt you can't determine from machine A whether a borg process on machine B is still running (which both are clients to a repo on machine C). So this is not a general solution, it only works for the easy case when there is only 1 client always on same machine. |
Since 1.0.0 there where some advancements how process termination and locking is handled; in 1.0.0 most premature exits would have caused a stale lock (e.g. computer shutdown, ^C, connection loss, …), while this should be fixed for almost everything except SIGKILL / hard crashes by 1.0.3. Note that it's merely an inconvenience, not something causing corruption... |
@ThomasWaldmann` I disagree. It is definitely more complex than determining if a process is running locally, but basically, if a human can determine it, then so can a program. The client borg will ask the server borg if it is running, and distinguishing should not be hard if each server process is started with a unique cmdline, so that it can recognize other instances of itself. I guess the check needs to ensure that only one process is accessing a specific repository and that can be seen from the cmdline. #!/bin/python3
# Copyright (c) 2012 Lars Hupfeldt Nielsen, Hupfeldt IT
# All rights reserved. This work is under a BSD license, see LICENSE.TXT.
import sys, os
import psutil
def singleton_script():
proc_name = os.path.basename(__file__)
my_proc = None
for proc in psutil.process_iter():
try:
try:
# Handle script called as 'python <script>'
arg_name = os.path.basename(proc.cmdline()[1]) if len(proc.cmdline()) > 1 else None
except (psutil.ZombieProcess):
continue
except (PermissionError, psutil.AccessDenied, IndexError) as ex:
arg_name = None
if proc_name in (os.path.basename(proc.name()), arg_name):
if my_proc:
print("Already running")
sys.exit(1)
my_proc = proc
except UnicodeDecodeError:
# Workaround for broken psutils on non english installation
# Singleton is still guaranteed if script is installed in a full path with an 'ascii' name
pass
if __name__ == "__main__":
import time
singleton_script()
print("Going to sleep for 10 seconds. Run one more of me to test!")
time.sleep(10) Modifying this to check on the cmdline arguments would do the trick |
"asking the server" and "cmdline arguments" require a interface change - we can only do that on bigger release steps and we want to do that rarely as it usually breaks (or just doesn't work) with older clients or servers. I suggest you just try latest 1.0.x and see if that solves most such problems. |
I have already upgrade to the latest version, so hopefully that will reduce the problem. I'm not talking about necessarily adding any new cmdline arguments, couldn't the restrict-to-path be used for this? Even if a new parameter was added it could probably be backwards compatible. I am able to resolve lock file problems, but I think all the family members I'm providing server space for, would call me without a clue as to why their backup failed if they got this error. |
Another approach that may be worth considering is using an separate lock management tool. I particularly like FLOM for this purpose: |
@level323 Does this require an external daemon? If it does I would consider it an unnecessary complexity and it may be overkill for this purpose. |
@lhupfeldt No, an external daemon is not required. But it does have that capability. FLOM supports many different use cases and locking mechanisms. FLOM compiles fairly easily with few dependencies. I'm not saying it's "the" solution to the needs of this particular use case. My suggestion was offered more out of a sense of pragmatism - that is, borg's lock handling may still have some rough edges (or may not be designed to suit certain use cases) in which case FLOM can likely resolve your immediate issue and may provide a useful solution for use cases that borg's locking system was not designed for. In the particular case mentioned by @jperville I believe FLOM can be used to completely eliminate the locking problems that were encountered. In this case FLOM can be used to ensure that only one instance of borg is ever running at any one time on the machine housing the central backup repo. Furthermore (if desired) FLOM can also be used to queue each remote source machine to backup to the central repo in turn (strictly one at a time). |
This lock issue is still current for me, to the point that I gave up on the idea of using a shared borg repository for multiple clients. |
What I don't quite understand (yet), and you didn't really say anything about it... do you have these issues due to crashes, or is it just timing? In the latter case a different locking system just wouldn't make a difference... |
For me is timing mostly, running several clients (on different hosts) at the same time that push to different archives in the same repository, the clients can make really a long time to acquire the lock (if they do at all). I can reproduce quite easily by running my backup command explicitly on each host using a terminal close to the same time. Each backup should complete in around 5 minutes, however when I start 3 at the same time, borg keeps fighting for the lock and no backup completes (or if it does, it takes way too long). |
This night all my backups failed, due to some kind of deadlock. I have three servers, one contains the borg repository, the other two servers push their backups into that (=the same) repository. I do hourly backups, the diff between the hourly backups is usally just about 9 MB or something and normally they are really fast. The setup worked well for about 1 week.
To avoid races, I schedule my hourly cron with
sleep $(jot -r 1 1 600) && backup.sh
, which generates a random delay from 0 to 10 minutes. Additionally, I have set the--lock-wait
flag to 1800. This night a got the following emails:Server:
Client 1
Client 2
The server has too much borg processes left:
It seems that it deadlocked very badly; any ideas what went wrong? Is this a bug or misconfiguration?
The text was updated successfully, but these errors were encountered: