Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traceback on a frequent backup #161

Open
Ernest0x opened this issue Dec 22, 2014 · 7 comments
Open

Traceback on a frequent backup #161

Ernest0x opened this issue Dec 22, 2014 · 7 comments

Comments

@Ernest0x
Copy link
Contributor

I got the following traceback on a frequent backup task (running every 5 minutes). The failure happened only once. Subsequent backup tasks succeeded normally, so not a big issue for me. I am just posting it, since maybe there is some corner case that can be handled better.

Traceback (most recent call last):
  File "/usr/local/bin/attic", line 5, in <module>
    pkg_resources.run_script('Attic==0.13', 'attic')
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1236, in run_script
    exec(compile(open(script_filename).read(), script_filename, 'exec'), namespace, namespace)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/EGG-INFO/scripts/attic", line 3, in <module>
    main()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 715, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 705, in run
    return args.func(args)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 129, in do_create
    archive.save()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archive.py", line 198, in save
    self.cache.commit()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 118, in commit
    os.path.join(self.path, 'txn.tmp'))
OSError: [Errno 2] No such file or directory
@Ernest0x
Copy link
Contributor Author

Let me expand a little bit on this...

I am using the same (remote) repository for two different backup tasks, separated with a different archive prefix. One backup task is scheduled to run very frequently (every 5 minutes), while the other is scheduled to run only once per day. Both backup tasks include a 'prune' command after the 'create' command to remove older archives.

The traceback in the first post happened only once, on a 'create' command for the frequent backup. The other backup, the daily one, which takes more time, while it always succeeds on the 'create' command, occasionally it fails on the 'prune' command, with the following traceback:

Traceback (most recent call last):
  File "/usr/local/bin/attic", line 5, in <module>
    pkg_resources.run_script('Attic==0.13', 'attic')
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1236, in run_script
    exec(compile(open(script_filename).read(), script_filename, 'exec'), namespace, namespace)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/EGG-INFO/scripts/attic", line 3, in <module>
    main()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 715, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 705, in run
    return args.func(args)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 350, in do_prune
    cache.commit()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 118, in commit
    os.path.join(self.path, 'txn.tmp'))
OSError: [Errno 39] Directory not empty

So, I am thinking that it may exist some kind of race condition related to how cache transaction directories are handled by multiple processes trying to access the same repository. @jborg could that be the case?

Here is the output of 'attic list' for the two backup tasks just before the time that 'prune' command fails:

daily-archive_2014-12-27-02:15 Sat Dec 27 02:18:17 2014
frequent-archive_2014-12-27-02:16 Sat Dec 27 02:18:18 2014

As you can see, the 'create' commands of the two backup tasks finished very close in time. So, 'prune' commands started just after 'Sat Dec 27 02:18:17 2014' and 'Sat Dec 27 02:18:18 2014' respectively.

@jborg
Copy link
Owner

jborg commented Jan 3, 2015

I've never seen this before and it's not expected to ever happen. Is NFS, CIFS or some other kind of potentially unreliable filesystem used anywhere in this setup (on the local machine or the remote host)?

@Ernest0x
Copy link
Contributor Author

Ernest0x commented Jan 4, 2015

I've never seen this before and it's not expected to ever happen. Is NFS, CIFS or some other kind of potentially unreliable filesystem used anywhere in this setup (on the local machine or the remote host)?

No, there isn't any kind of unreliable filesystem. EXT4 is used on both sides.

As I have posted on issue #167, my very frequent backup task was taking pretty much time because it executed one 'list' command before the 'create' command and one 'prune' command after the 'create' command. Now I have improved this by not calling the list operation. Since that change, I have not seen errors of the kind described on this issue. That makes me think that maybe these errors were more likely to happen when "overlapping" executions of 'list-create-prune' backup tasks were in action. Reducing the time needed for the backup task by actually reducing the chain 'list-create-prune' to 'create-prune' (with 'create' being pretty fast) makes it less possible to have overlapping executions. That said, I know that the operations acquire an exclusive lock to the repository, so that overlapping normally should not be a problem.

@ThomasWaldmann
Copy link
Contributor

To clarify (as it is not visible from the 2 tracebacks):

The operation that fails in both cases is:

    os.rename(os.path.join(self.path, 'txn.active'),
              os.path.join(self.path, 'txn.tmp'))

@Ernest0x
Copy link
Contributor Author

I am still getting these tracebacks in my 'create & prune' frequent backup task on the repository mentioned above. The traceback on the 'prune' operation is somewhat different:

Traceback (most recent call last):
  File "/usr/local/bin/attic", line 5, in <module>
    pkg_resources.run_script('Attic==0.13', 'attic')
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources.py", line 1236, in run_script
    exec(compile(open(script_filename).read(), script_filename, 'exec'), namespace, namespace)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/EGG-INFO/scripts/attic", line 3, in <module>
    main()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 715, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 705, in run
    return args.func(args)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archiver.py", line 346, in do_prune
    archive.delete(stats)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/archive.py", line 333, in delete
    self.cache.chunk_decref(items_id, stats)
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 198, in chunk_decref
    self.begin_txn()
  File "/usr/local/lib/python3.2/dist-packages/Attic-0.13-py3.2-linux-x86_64.egg/attic/cache.py", line 96, in begin_txn
    os.path.join(self.path, 'txn.active'))
OSError: [Errno 39] Directory not empty

On a side note, I wish the prune operation was done at the server side on a remote repository. If you have a big enough repository, it makes SO much traffic the way it is implemented right now and it is SO slow, that a frequent backup task is practically not feasible. Next invocations are waiting for the previous to complete. I also suspect that the traceback above are triggered somehow by these create/prune processes piling and "fighting" for getting access to the repo (although the locking mechanism should protect from this).

@ThomasWaldmann
Copy link
Contributor

@Ernest0x about too much traffic / serverside prune: rather create a separate ticket about that.

@ThomasWaldmann
Copy link
Contributor

I suspect this issue is caused by an internal inconsistency about whether a transaction is active or not.
The code in question changed after 0.13, so it might be that this error does not happen any more.

I also did a fix related to transaction state of a remote repo server process that could be related, see there:

https://github.com/borgbackup/borg/blob/0b9dcbf8e529d0548d99d75c980c1913594d68ac/borg/repository.py#L174

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants