Traceback on a frequent backup #161

Ernest0x · 2014-12-22T14:15:29Z

I got the following traceback on a frequent backup task (running every 5 minutes). The failure happened only once. Subsequent backup tasks succeeded normally, so not a big issue for me. I am just posting it, since maybe there is some corner case that can be handled better.

Traceback (most recent call last):
  File "/usr/local/bin/attic", line 5, in <module>
    pkg_resources.run_script('Attic==0.13', 'attic')
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1236, in run_script
    exec(compile(open(script_filename).read(), script_filename, 'exec'), namespace, namespace)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/EGG-INFO/scripts/attic", line 3, in <module>
    main()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 715, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 705, in run
    return args.func(args)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 129, in do_create
    archive.save()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archive.py", line 198, in save
    self.cache.commit()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 118, in commit
    os.path.join(self.path, 'txn.tmp'))
OSError: [Errno 2] No such file or directory

The text was updated successfully, but these errors were encountered:

Ernest0x · 2014-12-27T10:06:27Z

Let me expand a little bit on this...

I am using the same (remote) repository for two different backup tasks, separated with a different archive prefix. One backup task is scheduled to run very frequently (every 5 minutes), while the other is scheduled to run only once per day. Both backup tasks include a 'prune' command after the 'create' command to remove older archives.

The traceback in the first post happened only once, on a 'create' command for the frequent backup. The other backup, the daily one, which takes more time, while it always succeeds on the 'create' command, occasionally it fails on the 'prune' command, with the following traceback:

Traceback (most recent call last):
  File "/usr/local/bin/attic", line 5, in <module>
    pkg_resources.run_script('Attic==0.13', 'attic')
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1236, in run_script
    exec(compile(open(script_filename).read(), script_filename, 'exec'), namespace, namespace)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/EGG-INFO/scripts/attic", line 3, in <module>
    main()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 715, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 705, in run
    return args.func(args)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 350, in do_prune
    cache.commit()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 118, in commit
    os.path.join(self.path, 'txn.tmp'))
OSError: [Errno 39] Directory not empty

So, I am thinking that it may exist some kind of race condition related to how cache transaction directories are handled by multiple processes trying to access the same repository. @jborg could that be the case?

Here is the output of 'attic list' for the two backup tasks just before the time that 'prune' command fails:

daily-archive_2014-12-27-02:15 Sat Dec 27 02:18:17 2014
frequent-archive_2014-12-27-02:16 Sat Dec 27 02:18:18 2014

As you can see, the 'create' commands of the two backup tasks finished very close in time. So, 'prune' commands started just after 'Sat Dec 27 02:18:17 2014' and 'Sat Dec 27 02:18:18 2014' respectively.

jborg · 2015-01-03T22:20:08Z

I've never seen this before and it's not expected to ever happen. Is NFS, CIFS or some other kind of potentially unreliable filesystem used anywhere in this setup (on the local machine or the remote host)?

Ernest0x · 2015-01-04T12:49:04Z

I've never seen this before and it's not expected to ever happen. Is NFS, CIFS or some other kind of potentially unreliable filesystem used anywhere in this setup (on the local machine or the remote host)?

No, there isn't any kind of unreliable filesystem. EXT4 is used on both sides.

As I have posted on issue #167, my very frequent backup task was taking pretty much time because it executed one 'list' command before the 'create' command and one 'prune' command after the 'create' command. Now I have improved this by not calling the list operation. Since that change, I have not seen errors of the kind described on this issue. That makes me think that maybe these errors were more likely to happen when "overlapping" executions of 'list-create-prune' backup tasks were in action. Reducing the time needed for the backup task by actually reducing the chain 'list-create-prune' to 'create-prune' (with 'create' being pretty fast) makes it less possible to have overlapping executions. That said, I know that the operations acquire an exclusive lock to the repository, so that overlapping normally should not be a problem.

ThomasWaldmann · 2015-03-08T01:00:55Z

To clarify (as it is not visible from the 2 tracebacks):

The operation that fails in both cases is:

    os.rename(os.path.join(self.path, 'txn.active'),
              os.path.join(self.path, 'txn.tmp'))

Ernest0x · 2015-03-17T16:22:21Z

I am still getting these tracebacks in my 'create & prune' frequent backup task on the repository mentioned above. The traceback on the 'prune' operation is somewhat different:

Traceback (most recent call last):
  File "/usr/local/bin/attic", line 5, in <module>
    pkg_resources.run_script('Attic==0.13', 'attic')
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1236, in run_script
    exec(compile(open(script_filename).read(), script_filename, 'exec'), namespace, namespace)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/EGG-INFO/scripts/attic", line 3, in <module>
    main()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 715, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 705, in run
    return args.func(args)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 346, in do_prune
    archive.delete(stats)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archive.py", line 333, in delete
    self.cache.chunk_decref(items_id, stats)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 198, in chunk_decref
    self.begin_txn()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 96, in begin_txn
    os.path.join(self.path, 'txn.active'))
OSError: [Errno 39] Directory not empty

On a side note, I wish the prune operation was done at the server side on a remote repository. If you have a big enough repository, it makes SO much traffic the way it is implemented right now and it is SO slow, that a frequent backup task is practically not feasible. Next invocations are waiting for the previous to complete. I also suspect that the traceback above are triggered somehow by these create/prune processes piling and "fighting" for getting access to the repo (although the locking mechanism should protect from this).

ThomasWaldmann · 2015-03-17T21:32:26Z

@Ernest0x about too much traffic / serverside prune: rather create a separate ticket about that.

ThomasWaldmann · 2015-10-16T14:26:20Z

I suspect this issue is caused by an internal inconsistency about whether a transaction is active or not.
The code in question changed after 0.13, so it might be that this error does not happen any more.

I also did a fix related to transaction state of a remote repo server process that could be related, see there:

https://github.com/borgbackup/borg/blob/0b9dcbf8e529d0548d99d75c980c1913594d68ac/borg/repository.py#L174

anarcat mentioned this issue Mar 5, 2015

version number creates wrong impression #212

Open

Ernest0x mentioned this issue Mar 18, 2015

Refactor 'list' and 'prune' operations on remote repositories to run on the server side #242

Open

maltefiala mentioned this issue May 14, 2015

Dealing with attic issues borgbackup/borg#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traceback on a frequent backup #161

Traceback on a frequent backup #161

Ernest0x commented Dec 22, 2014

Ernest0x commented Dec 27, 2014

jborg commented Jan 3, 2015

Ernest0x commented Jan 4, 2015

ThomasWaldmann commented Mar 8, 2015

Ernest0x commented Mar 17, 2015

ThomasWaldmann commented Mar 17, 2015

ThomasWaldmann commented Oct 16, 2015

Traceback on a frequent backup #161

Traceback on a frequent backup #161

Comments

Ernest0x commented Dec 22, 2014

Ernest0x commented Dec 27, 2014

jborg commented Jan 3, 2015

Ernest0x commented Jan 4, 2015

ThomasWaldmann commented Mar 8, 2015

Ernest0x commented Mar 17, 2015

ThomasWaldmann commented Mar 17, 2015

ThomasWaldmann commented Oct 16, 2015