-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Traceback on a frequent backup #161
Comments
Let me expand a little bit on this... I am using the same (remote) repository for two different backup tasks, separated with a different archive prefix. One backup task is scheduled to run very frequently (every 5 minutes), while the other is scheduled to run only once per day. Both backup tasks include a 'prune' command after the 'create' command to remove older archives. The traceback in the first post happened only once, on a 'create' command for the frequent backup. The other backup, the daily one, which takes more time, while it always succeeds on the 'create' command, occasionally it fails on the 'prune' command, with the following traceback:
So, I am thinking that it may exist some kind of race condition related to how cache transaction directories are handled by multiple processes trying to access the same repository. @jborg could that be the case? Here is the output of 'attic list' for the two backup tasks just before the time that 'prune' command fails: daily-archive_2014-12-27-02:15 Sat Dec 27 02:18:17 2014 As you can see, the 'create' commands of the two backup tasks finished very close in time. So, 'prune' commands started just after 'Sat Dec 27 02:18:17 2014' and 'Sat Dec 27 02:18:18 2014' respectively. |
I've never seen this before and it's not expected to ever happen. Is NFS, CIFS or some other kind of potentially unreliable filesystem used anywhere in this setup (on the local machine or the remote host)? |
No, there isn't any kind of unreliable filesystem. EXT4 is used on both sides. As I have posted on issue #167, my very frequent backup task was taking pretty much time because it executed one 'list' command before the 'create' command and one 'prune' command after the 'create' command. Now I have improved this by not calling the list operation. Since that change, I have not seen errors of the kind described on this issue. That makes me think that maybe these errors were more likely to happen when "overlapping" executions of 'list-create-prune' backup tasks were in action. Reducing the time needed for the backup task by actually reducing the chain 'list-create-prune' to 'create-prune' (with 'create' being pretty fast) makes it less possible to have overlapping executions. That said, I know that the operations acquire an exclusive lock to the repository, so that overlapping normally should not be a problem. |
To clarify (as it is not visible from the 2 tracebacks): The operation that fails in both cases is:
|
I am still getting these tracebacks in my 'create & prune' frequent backup task on the repository mentioned above. The traceback on the 'prune' operation is somewhat different:
On a side note, I wish the prune operation was done at the server side on a remote repository. If you have a big enough repository, it makes SO much traffic the way it is implemented right now and it is SO slow, that a frequent backup task is practically not feasible. Next invocations are waiting for the previous to complete. I also suspect that the traceback above are triggered somehow by these create/prune processes piling and "fighting" for getting access to the repo (although the locking mechanism should protect from this). |
@Ernest0x about too much traffic / serverside prune: rather create a separate ticket about that. |
I suspect this issue is caused by an internal inconsistency about whether a transaction is active or not. I also did a fix related to transaction state of a remote repo server process that could be related, see there: |
I got the following traceback on a frequent backup task (running every 5 minutes). The failure happened only once. Subsequent backup tasks succeeded normally, so not a big issue for me. I am just posting it, since maybe there is some corner case that can be handled better.
The text was updated successfully, but these errors were encountered: