Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow "Saving files cache" with many files #4869

Closed
DurvalMenezes opened this issue Dec 8, 2019 · 52 comments
Closed

slow "Saving files cache" with many files #4869

DurvalMenezes opened this issue Dec 8, 2019 · 52 comments

Comments

@DurvalMenezes
Copy link

DurvalMenezes commented Dec 8, 2019

On MacOSX, borg create slows down terribly during "Saving files cache", tracing the process shows tons of madvise() and very few write() syscalls

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

BUG/ISSUE

System information. For client/server mode post info for both machines.

Client only (no server).

Your borg version (borg -V).

borg 1.1.10

Operating system (distribution) and version.

Mac OSX High Sierra, output of sw_vers command:

ProductName:    Mac OS X
ProductVersion: 10.13.6
BuildVersion:   17G2208

Hardware / network configuration, and filesystems used.

Power Mac 2013 with Xeon 4c/8t CPU and 48GB RAM. Filesystem is OpenZFS On MacOSX (O3X).

How much data is handled by borg?

~35T bytes on ~64M files

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg create --checkpoint-interval=10800 -v --show-rc --progress --stats ::20191204L210000

Describe the problem you're observing.

This is my first 'subsequent' archive, ie the first one after the initial one that took over 10 days to complete.

Initial estimates (based on the number of bytes (O) and files (N) displayed by borg) pointed to a 22-hour ETA, so I left it to run and came back after 12 hours to have a look. The last line output was "Saving files cache". I then waited for it to resume backing up files... and waited... and waited some more. ps auwx showed the borg process using up CPU but iostats showed negligible I/O; there were over 18GB totally free RAM at the time (ie, not even used by the OS/ZFS disk caches).
Only after about 4 more hours(!) did it finish with the cache-saving and got back to backing up files.

After 3 more hours (as indicated by --checkpoint-interval=10800), sure enough it displayed "Saving files cache" again... and again spent hours and hours there. Meanwhile me and @textshell on IRC tried everything we could think of to diagnose the issue, arriving at no definite conclusion; dtruss showed the borg process doing a LOT (tens of thousands per minute) madvise() syscalls, and very few (less than 2) write() syscalls; that led us to especulate it might have something to do with Python's garbage collection or otherwise memory management. We then tried some syscall profiling and dumping stack frames (both using dtrace and kill -USR2 on the borg process), but arrived at no definite conclusion.

Please see the entire IRC conversation, with pastes etc, on http://rn0.me/show/fz07NsoKpH9eVb71H28u/

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes, I can reproduce it anytime; all I need to do is to rollback the ZFS dataset containing the borg repo and cache to the state I snapshotted before running the above command, and then run it again.

Include any warning/errors/backtraces from the system logs

No errors were produced nor show. We pried out some backtraces with both dtrace (OS task-level) and kill -USR2 (Python level), see above URL.

Not including the CRUD benchmark results as I'm currently running a new borg create on the machine (after killing the previous one with kill -9 -- a simple ^C didn't terminate it, even after waiting for over 5 minutes), this time with --checkpoint-interval=3600000 so as to try and avoid the problem by doing no checkpoints at all (the expected run time is much much lower than the 1000 hours checkpoint period indicated by that option). But as contained in the above IRC log, I tested a cp -rp of the entire borg cache directory to a new directory on the same filesystem, and it took less than 3m23s to complete (so definitely there's no I/O bottleneck here that could explain this issue).

@ThomasWaldmann ThomasWaldmann changed the title On MacOSX, borg create slows down terribly during "Saving files cache", tracing the process shows tons of madvise() and very few write() syscalls very slow "Saving files cache", many madvise(), few write() syscalls Dec 9, 2019
@ThomasWaldmann ThomasWaldmann changed the title very slow "Saving files cache", many madvise(), few write() syscalls slow "Saving files cache", many madvise(), few write() calls Dec 9, 2019
@ThomasWaldmann
Copy link
Member

where did you get your borg 1.1.10 and how did you install it?

@ThomasWaldmann
Copy link
Member

notes:

  • borg code does not have any madvise() calls, so this must be the Python interpreter or something else.

@ThomasWaldmann
Copy link
Member

This is the relevant code from 1.1.10 (NOTEs added):

        if self.files is not None:
            ttl = int(os.environ.get('BORG_FILES_CACHE_TTL', 20))
            # NOTE: use this env var to make your files cache as small as possible,
            # but not smaller. See docs/faq.
            pi.output('Saving files cache')
            with IntegrityCheckedFile(path=os.path.join(self.path, 'files'), write=True) as fd:
                for path_hash, item in self.files.items():
                    # NOTE: ^^^ this is dictiter_iternextitem in the trace
                    # Only keep files seen in this backup that are older than newest cmtime seen in this backup -
                    # this is to avoid issues with filesystem snapshots and cmtime granularity.
                    # Also keep files from older backups that have not reached BORG_FILES_CACHE_TTL yet.
                    entry = FileCacheEntry(*msgpack.unpackb(item))
                    if entry.age == 0 and bigint_to_int(entry.cmtime) < self._newest_cmtime or \
                       entry.age > 0 and entry.age < ttl:
                        msgpack.pack((path_hash, entry), fd)
                        # NOTE: ^^^ this causes I/O. so if you don't see I/O, if-condition was false.
                        # NOTE: in most cases, the if-condition gets false by entry.age >= ttl. 
            self.cache_config.integrity['files'] = fd.integrity_data
        pi.output('Saving chunks cache')
        with IntegrityCheckedFile(path=os.path.join(self.path, 'chunks'), write=True) as fd:
            self.chunks.write(fd)
        self.cache_config.integrity['chunks'] = fd.integrity_data
...

So, I'ld say the time within "Saving files cache" but without I/O is borg NOT writing files cache items to disk that are beyond the TTL. It creates/garbage collects python objects all the time during that, in your case likely for ~64M items.

@ThomasWaldmann
Copy link
Member

Hmm, you said this is first subsequent backup after the initial full backup.
So, all the files cache content should be < TTL and you should see disk I/O all the time.
At least as far as our code is concerned, it could be the python gc jumping in and doing stuff.

@DurvalMenezes
Copy link
Author

DurvalMenezes commented Dec 9, 2019

where did you get your borg 1.1.10 and how did you install it?

From the Homebrew package repository, and using its package manager:

$  brew search borgbackup
     ==> Casks
     borgbackup 
$  brew cask install borgbackup
  • borg code does not have any madvise() calls, so this must be the Python interpreter or something else.

Agreed. My guess (I'm far from being a MacOSX or Python specialist) is that the calls come from somewhere inside MacOSX's run-time libraries, and is being indirectly called and/or triggered by the Python interpreter, which in turn is being indirectly 'called' and/or triggered by borg's operations during the "Saving files cache" stage.

So, I'ld say the time within "Saving files cache" but without I/O is borg NOT writing files cache items to disk that are beyond the TTL. It creates/garbage collects python objects all the time during that, in your case likely for ~64M items.

Hmm, you said this is first subsequent backup after the initial full backup.

Exactly what I was going to say :-) Glad you agree.

@ThomasWaldmann
Copy link
Member

In that loop, it dynamically creates a lot of python objects (and then needs to garbage collect them again at some time).

I'll try what I can do about that.

Problem is that I can't test it for performance with as many files as you have, so for testing you would need to run borg from a git checkout (see our install docs). Can you do that testing?

For that source-based install, it would be interesting to first just try 1.1.10 code with different python versions, maybe some weirdness or bad perf in 3.5.x got fixed in 3.6.x or 3.7.x.

Especially interesting would be 3.6.9 which is the version I currently use for master branch (and future borg 1.2 alpha binary builds).

@ThomasWaldmann ThomasWaldmann self-assigned this Dec 9, 2019
@ThomasWaldmann ThomasWaldmann added this to the 1.1.11 milestone Dec 9, 2019
@ThomasWaldmann ThomasWaldmann changed the title slow "Saving files cache", many madvise(), few write() calls slow "Saving files cache" Dec 9, 2019
@ThomasWaldmann ThomasWaldmann changed the title slow "Saving files cache" slow "Saving files cache" with many files Dec 9, 2019
@ThomasWaldmann
Copy link
Member

btw, thanks to @DurvalMenezes and @textshell for doing lots of analysis, the irc log and pastebins were quite informative.

@DurvalMenezes
Copy link
Author

where did you get your borg 1.1.10 and how did you install it?

From the Homebrew package repository, and using its package manager:

$  brew search borgbackup
     ==> Casks
     borgbackup 
$  brew cask install borgbackup
  • borg code does not have any madvise() calls, so this must be the Python interpreter or something else.

Agreed. My guess (I'm far from being a MacOSX or Python specialist) is that the madivise() calls come from somewhere inside MacOSX's run-time libraries, and are being indirectly called and/or triggered by the Python interpreter, which in turn is being indirectly 'called' and/or triggered by borg's operations during the "Saving files cache" stage.

So, I'ld say the time within "Saving files cache" but without I/O is borg NOT writing files cache items to disk that are beyond the TTL. It creates/garbage collects python objects all the time during that, in your case likely for ~64M items.

Hmm, you said this is first subsequent backup after the initial full backup.
So, all the files cache content should be < TTL and you should see disk I/O all the time.

Exactly what I was going to say :-) Glad you agree.

From the segment of code you pasted, I see no other place for that "indirect calling/triggering" to happen except inside the with/for loop you depicted above.

I'll try what I can do about that.

Ditto here, please count on me to help diagnose and fix that, as long as it can be done without fouling up the machine (as it is already in production as the main backup 'server' -- see below).

Problem is that I can't test it for performance with as many files as you have, so for testing you would need to run borg from a git checkout (see our install docs). Can you do that testing?

By "run borg from a git checkout", you mean this, right?

So, this is my main limitation: this machine is already in production as the main backup 'server here', By 'server', I mean it's a dedicated machinezfs recv'ing daily ZFS snapshots (via zfs send` from the main NAS server); this is working beautifully (sending a full daily 'delta' -- usually some tens of GBs and tens of thousands of files -- takes 10 minutes or less) and keeps them all available for point-in-time restores anytime in the future, which is fully testes and working and we already had the opportunity to use when a large set of files got clobbered by ramsonware. I would not like to mess with this machine's install by replacing Python versions, etc.

The ideal would be to use some kind of "container" technology to install everything needed for running borg from the git source in a separate manner and with minimal performance impact, like docker (or better, LXC) on Linux, both of which uses cgroup functionality to insulate containers from each other (and more importantly for my case, the 'host' from the containers). I'm trying to determine whether such a thing exists for MacOSX, so far to no avail (just found out that docker on OSX uses VirtualBox and so would surely impact performance, for example the "container" would only be able to access the host's files via VB's virtual filesystem).

For that source-based install, it would be interesting to first just try 1.1.10 code with different python versions, maybe some weirdness or bad perf in 3.5.x got fixed in 3.6.x or 3.7.x.

Agreed: as mentioned in the IRC log, there was at least one bug in Python 2.6 (fixed in 2.70) that fit the description of the problem. Given we're long past 2.6, this is not the case -- but I've stumbled on a lot of other bugs and issues on subsequent Python versions, some of them only fixed in 3.y.x

Especially interesting would be 3.6.9 which is the version I currently use for master branch (and future borg 1.2 alpha binary builds).

Agreed.

btw, thanks to @DurvalMenezes and @textshell for doing lots of analysis, the irc log and pastebins were quite informative.

You are more than welcome, @ThomasWaldmann, and thank you very much for all the time and effort you spent to make borg available for the rest of us.

Please give me a few days while I try and figure out how to go ahead with this.

Cheers,
-- Durval.

@ThomasWaldmann
Copy link
Member

https://borgbackup.readthedocs.io/en/stable/installation.html#from-source you need to start reading there and later skip forward to the git section.

virtualenv / venv is a lightweight "container", so you do not mess up your system python's site-packages.
recent pythons can do this via python3.x -m venv ... without needing the separate virtualenv package.

installing an additional python should also do no harm as long as you do not change your system python[3] (on linux it would be /usr/bin/python[3] or anything on the PATH called python[3]).

i would rather not recommend using vboxsf, that could introduce other issues.

@DurvalMenezes
Copy link
Author

DurvalMenezes commented Dec 9, 2019

virtualenv / venv is a lightweight "container", so you do not mess up your system python's site-packages. recent pythons can do this via python3.x -m venv ... without needing the separate virtualenv package

Thanks for the tip. But that would not isolate the system from the out-of-Python deps (like OpenSSL, libacl/libattr, etc), right?

i would rather not recommend using vboxsf, that could introduce other issues.

My thinking exactly.

I'm right now exploring with our MacOS local specialist the possibility of just doing a "bare-metal backup" of this entire machine, so we can go ahead full-steam with the tests, installing/replacing anything necessary, and them restore the machine entirely from this backup when we are done with it (or if our local ZFS send/recv backups are in anyway impacted during the borg tests).

I will keep this issue posted on our progress on that.

@ThomasWaldmann
Copy link
Member

no, venv doesn't affect the non-python stuff.

@DurvalMenezes
Copy link
Author

no, venv doesn't affect the non-python stuff.

That's what I expected. I'm now moving ahead with that "bare metal backup" idea, will let you know as soon as I get it implemented and update/install everything needed on the machine and proceed to "run from source and with Python 3.6.9" as you indicated.

@DurvalMenezes
Copy link
Author

Folks, bad news: just heard back from our Mac specialist that there is no way that he knows of to make a "bare metal backup" from a MacOSX machine. TimeMachine, which is the tool he uses most, doesn't make a bit-by-bit copy of the machine, just copies applications, data and configurations that are then restored on top of a new OS install.

Furthermore, he also strongly advised against changing the machine's installation as backup of this data is critical for the company. And even without borg, we have backups already being made in a "local" capacity (ie, from the NAS server to the disks attached to the backup machine via zfs send|recv that are working perfectly) and changing the machine's installation could jeopardize that.

So we are moving ahead with borg as-is, working around the present issue by setting --checkpoint-interval=3600000 so as to avoid any cache savings and therefore the associated performance loss. (should not be a problem as this machine is very stable). In fact, our first subsequent backup (which is being run that way) is already over 90% and should finish in a couple more hours.

When/if I get an unused MacOS machine with enough RAM to play with, I will try and get back to testing this (none is probable in the short term, unfortunately).

Again, thanks @ThomasWaldmann and everyone for the help.

@textshell
Copy link
Member

notes:

  • borg code does not have any madvise() calls, so this must be the Python interpreter or something else.

We know one source of madvice from my fumbling around with dtrace:

              libsystem_kernel.dylib`madvise+0xa
              borg.algorithms.msgpack._packer.so`__pyx_tp_dealloc_4borg_10algorithms_7msgpack_7_packer_Packer(_object*)+0x4c
              Python`frame_dealloc+0x74
              Python`_PyEval_EvalCodeWithName+0xcba
              Python`fast_function+0x1a2
              Python`PyEval_EvalFrameEx+0x5cc5
              Python`fast_function+0x120
              Python`PyEval_EvalFrameEx+0x5cc5
              Python`_PyEval_EvalCodeWithName+0xc9b
              Python`fast_function+0x1a2
              Python`PyEval_EvalFrameEx+0x5cc5
              Python`fast_function+0x120
              Python`PyEval_EvalFrameEx+0x5cc5
              Python`fast_function+0x120
              Python`PyEval_EvalFrameEx+0x5cc5
              Python`_PyEval_EvalCodeWithName+0xc9b
              Python`fast_function+0x1a2
              Python`PyEval_EvalFrameEx+0x5cc5
              Python`_PyEval_EvalCodeWithName+0xc9b
              Python`fast_function+0x1a2
           629186

So that looks like cython.

If we want to make this easier to debug, it might be worth the effort to see if it reproduces with a small test program that just loads and saves the files cache. That would also likely make tests easier to run without doing actual backups.

@DurvalMenezes
Copy link
Author

DurvalMenezes commented Dec 9, 2019

@textshell, good points. And if we have this test program to load and save the cache and it just needs a new Python interpreter and not all the libraries etc that the full borg depends on, I can run it on my machine with virtualenv/venv as @ThomasWaldmann suggested, as it very probably won't disturb its current installation.

@ThomasWaldmann
Copy link
Member

Good idea to use a test program.

It could even include generating a large files cache and then benchmark processing it.

@DurvalMenezes
Copy link
Author

DurvalMenezes commented Dec 10, 2019

Hello everyone,

My faith about working around the issue using --checkpoint-interval=3600000 was premature, by two different reasons:

  1. Obviously in retrospect, when borg create finishes it has to save the cache independent of the checkpoint interval, and then the same slowdown manifests itself; as copy/pasted from the GNUscreen screenlog file (the "-- timestamp --" lines around each "block" of borg output are recorded by GNUScreen using its "logstamp" option):
  • Last "--progress" line (ie, referring to the last file backed up):
30.14 TB O 25.16 TB C 31.87 GB D 64629701 N REDACTED_DIR1/REDACTED_DIR2/REDACTED_DIR3/REDACTED_FILE
-- 0:sh -- time-stamp -- Dec/09/19 18:39:55 --
  • It's immediatelly followed by the usual cache-saving messages:
-- 0:sh -- time-stamp -- Dec/09/19 18:39:56 --
Compacting segments   0% 
[...]
Compacting segments  99%
[...]
-- 0:sh -- time-stamp -- Dec/09/19 18:41:27 --
Saving files cache
-- 0:sh -- time-stamp -- Dec/09/19 18:41:35 --
-- 0:sh -- time-stamp -- Dec/10/19  4:30:34 --

This is exactly what I observed on the window where the borg create process was running: it printed "Saving files cache" and then the next ~10 hours(!) again apparently doing nothing but using some CPU (and I didn't check again, but almost certainly issuing lots of madvise() syscalls and very few write() syscalls).

Curiously enough, the next steps of cache-saving after "Saving files cache" were quite fast:

-- 0:sh -- time-stamp -- Dec/09/19 18:41:35 --
-- 0:sh -- time-stamp -- Dec/10/19  4:30:34 --
Saving chunks cache
-- 0:sh -- time-stamp -- Dec/10/19  4:30:36 --
-- 0:sh -- time-stamp -- Dec/10/19  4:32:37 --
Saving cache config              

Of course this was face-palm time for me. But not the next one:

  1. Right after the above output, borg prints the final, usual --stats screen and then engages in another hours-long lapse (notice the "-- time-stamp --"lines):
------------------------------------------------------------------------------
Archive name: 20191204L210000
Archive fingerprint: 26dbb9244449d1366fca9d483957633aedb0f242bc52b08848e9c8d59ab6dacb
Time (start): Sun, 2019-12-08 17:27:29
Time (end):   Mon, 2019-12-09 18:39:54
Duration: 1 days 1 hours 12 minutes 24.83 seconds
Number of files: 64629709
Utilization of max. archive size: 25%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:               30.14 TB             25.16 TB             31.88 GB
All archives:               65.55 TB             55.09 TB             15.02 TB

                       Unique chunks         Total chunks
Chunk index:                10723381            157418476
------------------------------------------------------------------------------
-- 0:sh -- time-stamp -- Dec/10/19  4:32:46 --
-- 0:sh -- time-stamp -- Dec/10/19  9:40:23 --
terminating with success status, rc 0

ie, it spent over 5 hours after printing the stats screen and before printing the final --show-rc message and finally terminating.

So, it in total spent ~25h actually working (doing I/O, updating the progress line, etc) and then over 15 hours in the above two episodes.

Some further data: I ran borg under /usr/bin/time -lp to obtain final stats re: memory use and etc, and here's what it printed after it finished (the lines starting with "->" are my comments:

real    144773.31
user     23994.20
sys      24996.57
-> OK, so ~40h total real time (which checks with the ~25h the program reported itself
-> as using in the `--stats` output, plus the ~10h at the "Saving files cache" lapse, 
-> plus the ~5h at the right-before-finishing lapse.

18132713472  maximum resident set size
-> This is interesting. I had a `ps auwx | grep borg` loop running and getting logged in another
-> window, and the most it recorded in the RSS column the `borg create` process was  
-> 26521496 (KiB), ie, almost 50% more than   `/usr/bin/time -lp` reported.
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
 708805728  page reclaims
-> Now ~709M seems to be a *lot* of page reclaims. See below for my further observations on this.
        31  page faults
-> This is the number of "'hard' page faults" (ie, that needed I/O to be satisfied). 
-> The amount seems negligible.
         0  swaps
         0  block input operations
         0  block output operations
         7  messages sent
         7  messages received
         0  signals received
  69919545  voluntary context switches
  27475483  involuntary context switches
-> Again these two seems a lot. But considering the long run-time for the program (during which
-> a lot of scheduling interrupts must have happened, which IIUTC generates involuntary context,
-> and the ton of syscalls, madvise() and others, that happened during its execution (which again
-> IIUTC would generate voluntary context switches), probably it is as expected. 

I did some research on what "page reclaims" means and what I was able to cobble together is that these are " 'soft' page faults" that are satisfied by claiming a page from the list of free pages in the system (ie, without doing any swapping/paging I/O), so not too bad. But I wonder whether it is somehow connected to the ton of madvise() calls we saw, either as a cause or as a side-effect or co-effect.

I got a hunch that perhaps the ZFS ARC cache (which by default tries to claim most of the idle memory up to half the machine's memory, leaving around 1M free for immediate allocation and then freeing more automatically as needed, when that 1M gets used) has something to do with it.

Perhaps the Python interpreter and/or system runtime are seeing the system short on totally free RAM (because of this ZFS ARC effect) and are engaging in dramatic action to try and free some of it up?

I will go ahead and make another subsequent backup, but this time I will restrict the ZFS arc_max parameter (which controls the maximum RAM it ever grabs for its ARC cache) to a mere 4GB; on this 48GB machine it would leave ~40GB free for borg create, let's hope it's enough to avoid this possible effect (if in fact it's what happening).

For this, please wait for my next report in a day or so.

Cheers,
-- Durval.

@infectormp
Copy link
Contributor

We already have something like this #1871

@DurvalMenezes
Copy link
Author

We already have something like this #1871

Agreed, with due regard for the differences:

  • 14M files instead of 64M files in my case (ie, ~4.5x less)
  • 16GB RAM instead of 48GB RAM (~3 times less)
    -> So, the relation Number_Of_Files/RAM is 1.5x larger for my case;
  • Linux instead of MacOSX;
  • Borg version 1.0.x instead of 1.1.x;

Anyway, the similarities are tantalizing: in that other issue, the problem was the subsequent backups taking over 24h (so they could not be done "daily"), and borg was also spending a lot of time in the "Saving files cache" step (but I don't think that @Phuzzy112, the OP, went in as deep as I did on this. Perhaps he would have seen madvise() or equivalent on his/her case too?). Also, it involved ZFS.
The solution seemed to have been reduce the number of files processed at the same time (by segmenting the backup) and possibly adding more RAM later on.

Anyway, it was over 3 years and a major version of borg ago, but perhaps it goes on to show that the problem is more widespread (across major borg versions and very different OSs, and very probably Python versions too).

@DurvalMenezes
Copy link
Author

So, my second subsequent backup finished during the night.

It seems that reducing the maximum memory ZFS grabs for its ARC to 4G (from the default of half the RAM, ie 24G) had a HUGE effect:

  • Total backup time was reduced from ~40h to ~15h(!);
  • The unavoidable "Saving files cache" at the end took only ~24m instead of 10h(!!);
  • The other "lapse" between printing the stats and the final "terminating with success status, rc 0" wasted only ~6m instead of 5h(!!!).

This was very surprising: ZFS is designed from the ground up to "feel" memory pressure and proactively release memory from its ARC back to the OS, in order to avoid exactly this. Perhaps Python and/or the system run-time is also trying to be proactive and so instead of putting the pressure on ZFS and getting the memory it needs, it spends those hours going through its memory regions desperately trying to release memory it can ill afford?

Anyway, I now see the main problem is with memory: borg is using far more than the 15GB my initial calculations (based on the formula from the borg docs) indicated. More on this:

  • Thanks to @textshell advice while I was monitoring this backup run (and bitching on IRC about the memory use), I checked the $BORG_CACHE_DIR/chunks file size to see whether there was a rehash, and there was not: the chunks file size continued exactly the same (725363206 bytes), which BTW is also the same since the initial backup.

  • Nevertheless, the process RSS (as shown by ps auwx) grew progressively from ~9.4GB right after the "Initializing cache transaction: Reading files" at its start (ie, right before processing the first file) to ~28.4GB (ie, more than triple) by the time it was doing the "Saving files cache" at the end.

  • This tripling of RAM usage is really puzzling to me: the number of files didn't change significantly (64668671 this time, vs 64629709 last time, ie only 0.06% difference, and I strongly suspect -- borg shows no stats on this regard -- that the vast majority of the already-present files were unmodified). So, if it's not the chunks, and not the files cache, what could be responsible for using so much memory from the time borg finishes reading the files cache to the time it finishes doing the backup? Some kind of memory leak, perhaps?

  • What worries me most is that the total memory used (as shown by /usr/bin/time -lp) seem to be growing enormously from one subsequent backup to the next: 18.1GB during the previous run (1st 'subsequent' backup after the initial one) to 29.1GB this time (2nd 'subsequent' backup) ie, over 60%(!) more. As I changed the ZFS ARC limit this time, it could be that now borg/python/system run-time felt more "comfortable" to grow into the resulting additional free memory...

I'm right now starting my 3rd subsequent backup and will keep ZFS arc_max at 4GB, let's see how it goes.

@horazont
Copy link
Contributor

Based on the data, I have a possible explanation. Before I write it down, here two caveats/assumptions:

  • Whatever data it is borg holds in RAM, it is mostly cold, i.e. not accessed, except during writing down the cache.
  • When the files cache is saved, most or all of those pages are touched, for example to write them to files.
  • OSX treats cold pages with similar priority as ZFS ARC, or maybe even with lower priority.

Now here’s what I think might be going on:

Based on the observation of the massively increased resident memory use with the reduced ARC size, it seems to me that borg on @DurvalMenezes system and run needs ~30 GB to operate. A lot, but let’s treat that as a given for now. Also, as mentioned above, I assume that much of this are cold memory pages from memory mappings or python objects which are rarely touched.

I suggest that during the previous run (with large ARC size), the cold pages of borg lost against the ARC memory in priority and got swapped out. Then whenever it tried to write the cache, it had to iterate over all its pages, swapping them in painfully, which is why writing the cache took ages. At the same time, when swapping in pages, the OS had to decide to release ARC memory (potentially causing disk accesses) or swapping out other pages (of borg, most likely, since anything swappable else would already be swapped out at this point), also causing disk accesses. Which is why everything was terribly slow here.

Then you reduced the ARC size and did a re-run. In this run, there was no pressure since the ARC was small enough to accomodate the full borg process (30 GiB) in RAM. Everything went smoothly without swapping.

This explains all of the observations:

  • Massively increased memory use: Only the observed memory use, since the tools you used only measure resident set size, and the resident set size was restricted by the memory pressure caused by competition with ARC.

  • Extremely slow cache operations: Touching all pages, this caused massive swapping.

@ThomasWaldmann
Copy link
Member

btw, iirc python has its own internal memory management: it gets bigger memory chunks from the OS when needed and then manages that memory internally to satisfy smaller allocations. iirc, it is rather lazy with giving memory back to the OS (if ever, not sure).

@DurvalMenezes
Copy link
Author

DurvalMenezes commented Dec 11, 2019

@ThomasWaldmann, thanks for the greatly detailed response! Much appreciated.

So, in my case (client only), I have just chunks cache and files cache (as I expected from my conversations on IRC).

Re: "going over the threshold for resizing": as the total size and the number of files hardly changed (and specially the chunks file on disk, which I understand reflects the chunks cache size, remains at the exact same size since the initial backup) then this should not be the case, right?

you see that the generation is part of the value, not of the key. so there is only one dict entry per abspath key (== per file location).

This is as I expected from my IRC conversations. Doesn't explain how the RSS used by the process is apparently incrementing by ~9GB each time I run it (ie, for each "generation").

Let's see how this 3rd subsequent backup (running right now) ends -- will keep this issue posted.

@jdchristensen
Copy link
Contributor

Another strategy that might work for @DurvalMenezes would be to break up the tree being backed up into a number of sub-trees that are backed up separately. Maybe to separate repos. And you could use different home directories for the backups for each sub-tree, so the file caches would be independent. (That would also allow you to run the backups in parallel, which might or might not work well, depending on what is the bottleneck.)

There's still the mystery of why the RSS changed so much when the ARC was changed. I have never heard of python adjusting its memory usage based on memory pressure, so I suspect some swapping, but @DurvalMenezes thinks that wasn't happening.

Also, about borg memory usage: could there be unnecessary object creation causing garbage collection? Could there be a memory leak? Not sure how to test for these in python.

@ThomasWaldmann
Copy link
Member

@DurvalMenezes for non-client-server mode, the repo index is loaded into the client.

Splitting the files into multiple subsets and using multiple repos would help most.

That keeps the number of files per files cache (which is per repo and/or per user, if different users are used) and the number of chunks in the repo (which is also per repo obviously) lower and cause correspondingly lower memory usage.

BTW, I also had strange/fatal effects with ZFS on linux and memory allocation - in my case not with borg, but somehow ZFS competing against kvm (on proxmox ve). Thus, I usually do set an max ARC cache limit to a not too high value (few GBs).

@DurvalMenezes
Copy link
Author

DurvalMenezes commented Dec 11, 2019

@ThomasWaldmann and @jdchristensen, thanks for the suggestions re:splitting the differences between multiple repos.

I've seen something like that on other postings here in github and briefly considered it, but would rather not do it as it would require me to start over from the beginning again and it would be rather painful (specially as I would be throwing away 1 month of my parallel rclone process struggling to copy the borg repo up to the cloud -- which is now almost finished).

I'm trying to get more memory for that machine (it can go all the way to 128GB with 3rd party modules -- almost 3x the current 48GB configuration), so it will gain me some time (and later, if we can't 'fix' it in borg, I think I can make it work by keeping a smaller number of backups in the repo before pruning).

But finding why/how is borg using up all this memory would really be best, for everyone IMHO.

@ThomasWaldmann
Copy link
Member

The hashindex entries are already quite dense: key 256bit, some 32bit ints - no fat to get rid of here.
The size of the hashindex is roughly proportional to the chunks count in the repo, not much we can do about that.

The python dict's internal behaviour / python and kernel memory management is not under our control, so we have to live with that. The files cache size is proportional to the amount of files and also a bit to their size, not much we can do about that.

What one can do as a emergency measure if memory needs is beyond the available RAM, is to switch off the files cache - but that has a huge performance impact as every file will need to be read/chunked/hashed at backup time. So better than nothing, but not really good for huge amounts of data.

@ThomasWaldmann
Copy link
Member

Did somebody write that "out of borg" test program yet?

I have an idea how maybe that loop code could be optimized, but I need a way to test / measure the effect.

@DurvalMenezes
Copy link
Author

Did somebody write that "out of borg" test program yet?

I have an idea how maybe that loop code could be optimized, but I need a way to test / measure the effect.

@ThomasWaldmann, I think that, given the apparent effect of memory pressure on this issue, the program will need some way to simulate that too (perhaps by allocating memory with the records read/writen) or it will probably not be able to reproduce the problem.

Perhaps you have that in mind already... I'm just making sure it doesn't go by unnoticed.

@DurvalMenezes
Copy link
Author

Hello everyone, specially @ThomasWaldmann, @textshell and @horazont:

It seems I just found a surprising workaround for the issue: hold 2GB extra memory in the ZFS ARC before starting the borg create process, then release it after borg starts slowing down.

For more details, see the paste of my announcing it on the borg's IRC channel: http://rn0.me/show/tnEllC18pOCm27Npt6C9/

If it keeps working during the next couple of borg create runs, I will post an update here.

@DurvalMenezes
Copy link
Author

Aaaand.... it worked! AGAIN! yay! ;-)

Here are more details on what I did:

  1. Once per hour, I calculated the speed of the process in files/second, ie I took the number shown in the progress line right before the "N", subtracted from the same number of 1 hour ago, and divided by 3600; the initial speed was ~1500 files/second, this kept around that during the first 6 hours (a little less, like 1300 files/second when the "D" number also increased -- indicating that borg had detected changed/new files and then had to read/compress/dedup/write them into the repo, so the slowdown is reasonable).
  2. There was a dip to ~900 files/second with no increment in the "D" number at the 6th hour, but as it didn't quite go below 50%, I waited a little more and sure enough, the 7th hour it was back at 1520 files/second .
  3. At the 8th hour, speed really went down the drain: 523 files/second, well below the 50% mark. So I released 2GB of the 4GB total that the ZFS ARC was holding, and waited.
  4. The effect was not immediate: in the 9th hour, speed was still low (627 files/second), perhaps because there was also some increment in "D"; but at the 10th hour, it went through the roof: 1627 files/second, the highest I had measured.
  5. it kept that way, varying between 1371 and 1607 files/second, until the backup ended after the 15th hour.

This borg create finished in 15h06m total, and most important, the aberrantly large 10h lapse right after "Saving files cache" and the 5h one right after "terminating with success status, rc 0" were MUCH smaller: these steps completed in ~24m and ~5m, respectively.

My hypothesis to explain the above is that, when "unused" memory (or rather, "unused + buffers", which do not include the ZFS ARC) is too small, either the Python interpreter or the system runtime (or both, acting in concert) try really hard to free as much of its memory as possible (perhaps triggering Python GC, resulting in multiple madvise() calls to the kernel, etc) greatly slowing down the system. But, when there's enough (eg, 2G+) in " unused + buffers", it stops happening (perhaps after the current lets-desperately-try-and-free-some-memory-and-or-GC-and-or-whatever cycle finishes, which would explain the delay I observed in 4) above.

To try and help clarify this, I will do as suggested by @textshell on IRC, and tonight I will try to use py-spy to capture two 15-minute profiles of the running borg process, one right after it has slowed down (and before releasing the 2GB from the ARC) and the other after that, when it has speed up again.

If all goes well, I shall be posting them here tomorrow morning.

Cheers,
-- Durval.

@DurvalMenezes
Copy link
Author

DurvalMenezes commented Dec 13, 2019

Hello everyone,

A new day, a new successful backup ;-) Exactly the same thing happened this time as in the last two runs: borg create started reasonably fast (as measured by files/sec), then after a few hours slowed down abruptly to less than 50% of the initial speed, then I reduced ZFS ARC from 4 to 2 GB (therefore adding 2GB to the system's "unused" memory, where it stayed -- the borg create process did not, nor anything else, directly use that memory) and then borg recovered its lost speed, also avoiding the lapses at "Saving files cache" and "terminating with success status, rc 0", and finished the backup in just 15h.

As stated previously, I tried to profile the process using py-spy, to no avail: it failed (after spending the indicated 15 minutes, presumably collecting samples) with the error "Failed to find python version from target process". I suspect this happens because it can't find the Python interpreter embedded in the borg "binary"; I have opened an issue with the py-spy developers to try and find a workaround for this: benfred/py-spy#207

If/when the py-spy developers (or anyone else; please feel free to chime in) tell me how to work around its inability to find borg's Python interpreter, I will try and take a few profiles; meanwhile, unfortunately I have my hands tied.

Cheers,
-- Durval.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Dec 15, 2019

Now try again without reducing ZFS ARC memory while borg is running and check if you see the same speed fluctuations (which could also be caused by different file sizes and hashtable state).

@ThomasWaldmann ThomasWaldmann removed this from the 1.1.11 milestone Dec 15, 2019
@DurvalMenezes
Copy link
Author

DurvalMenezes commented Dec 17, 2019

@ThomasWaldmann I was expecting the issue to turn up again this weekend (when I would have a large enough window for waiting it to take 30h for the backup) but it did not happen: speed in files/sec never dipped under 50%, so I didn't intervene with releasing the 2GB of cache and the backup finished in a record 15.01h.

Coincidentally or not, this monday (when I would not have the time to run it once more without intervening, in case it needed intervention in the first run) again the speed dip failed to happen.

Not sure what could have changed; volume of changed files/bytes are about the same from day to day, absolutely no new software has been installed nor new programs been started on the machine, etc.

I hate when problems just disappear all by themselves, but this seems to be what happened so far.

I'm pretty sure I can still reproduce the original problem (ie, have borg create spend 30h on this normally 15h backup with the two big lapses on "Saving files cache" and "terminating with success status") by rolling back the snapshots (or rather, cloning them into new datasets) for both the borg repo/cache and the origin dirtree for the backup, and restoring the ZFS arc_max parameter to its default of 0 (meaning use up to half the free RAM on the machine, ie 24GB instead of the 4GB I've limited it to, after these first lapses). But it would have to be done on a weekend (so as not to interfere with backup schedule) and I need some way to obtain diagnostic data for you developers that don't require me to change the machine's environment too much (ie, no overwriting of already installed programs/libraries, etc).

py-spy looks promising and the author responded to the issue I opened, mentioning it's already resolved on master; I tried bringing up a rust cross-compile environment so I could compile it on my Linux notebook to run on the MacOSX backup server, but to no avail; When a new release of py-spy comes out with a precompiled binary available, I may try reproducing the problem and using it to obtain a profile for you.

@ThomasWaldmann
Copy link
Member

When the hashtable gets fuller, speed drops. When it gets over the resize threshold, it will be enlarged and then be quicker again. The threshold is some sort of compromise of speed vs. memory needs.

@DurvalMenezes
Copy link
Author

DurvalMenezes commented Dec 17, 2019

When the hashtable gets fuller, speed drops. When it gets over the resize threshold, it will be enlarged and then be quicker again.

I checked for this with @textshell's help and it didn't seem to be the case: the file that would have doubled in size (named "chunks" IIRC) has kept the exact same size since my first borg create.

@ThomasWaldmann
Copy link
Member

When it resizes, it does not double in size, but uses a smaller factor depending on the size it already has.
(this applies to hashindex type data structures, borg's own hashtable implementation. chunks cache and repo index).

Not sure how the python dict resizes (used for the files cache).

@DurvalMenezes
Copy link
Author

@ThomasWaldmann thanks for the explanation. But the "chunks" file size hasn't changed at all... this eliminates the hypothesis it has been resized, right?

@ThomasWaldmann
Copy link
Member

If it has the exact same bytesize, yeah, then it likely has not been resized (there is some processing, therefore the .compact file extension, but guess one would at least see a minimal change).

@DurvalMenezes
Copy link
Author

DurvalMenezes commented Dec 18, 2019

py-spy's author @benfred kindly produced a working py-spy binary for me (see benfred/py-spy#207 (comment) for more details, and I managed to capture a 15-minute profile of a working borg create on a "normal" (ie, no many-hours lapse) run, please find the SVG file below (converted to PNG so github would accept it).

When I see one of the big slowdows/lapses again, I will capture another profile; please verify if the details/layout/etc of the one below is sufficient or if you need any other options/modes/etc for me to call py-spy with.

OUTPUT

@ThomasWaldmann
Copy link
Member

Can't read all of it, but it seems to mostly do xattr stuff.

If you have a lot of unchanged files, it does not deal with content data (as it'll know they're unchange), but it reads the metadata from the fs to create the archive's metadata stream. So: stat data, xattr metadata, acl metadata (also in xattrs on linux).

@textshell
Copy link
Member

I'm not sure if the svg is better (e.g. tooltips/titles) but the png seems mostly clipped.
I'm not sure what @ThomasWaldmann prefers, but i'd try with --format=speedscope and maybe --native if we suspect runtime parts that take the time.

@DurvalMenezes
Copy link
Author

DurvalMenezes commented Dec 19, 2019

Hello @ThomasWaldmann, @textshell:

Can't read all of it, but it seems to mostly do xattr stuff.

I'm not sure if the svg is better (e.g. tooltips/titles) but the png seems mostly clipped.

Yeah, I noticed that too. But the SVG was not any better (at least not with the default visualizer in my system). Will post the next SVG somewhere for you guys to check.

I'm not sure what @ThomasWaldmann prefers, but i'd try with --format=speedscope and maybe --native if we suspect runtime parts that take the time.

Just captured them all (flamegraph, speedscope and raw) during the latest borg create run (which has not slowed down); please use the links below to download them:

http://durval.com/xfer-only/borg_create_20191218L210000_at_20191219L105904_-_FLAMEGRAPH.svg
http://durval.com/xfer-only/borg_create_20191218L210000_at_20191219L105904_-_SPEEDSCOPE.dat
http://durval.com/xfer-only/borg_create_20191218L210000_at_20191219L105904_-_RAW.dat

Please confirm which of them is adequate, and if the parameters etc. are OK or need changing, so I can be ready the next time the slowdown occurs.

Thanks,
-- Durval.

@ThomasWaldmann
Copy link
Member

Just as a side note: the borg available via brew recently had a fundamental change:

It is not using the pyinstaller-made fat binary any more, but now installs borg in a more conventional way.
So, if you have a very recent python installed, it will now use that.

@DurvalMenezes
Copy link
Author

Just as a side note: the borg available via brew recently had a fundamental change:
It is not using the pyinstaller-made fat binary any more, but now installs borg in a more conventional way.
So, if you have a very recent python installed, it will now use that.

Thanks for the heads-up, @ThomasWaldmann. Just checked, and I already have a brew-installed python3.7.

So, the next time the issue turns up, I just need to make sure I have a new(er) brew-installed borg, and debugging will be much easier. It hasn't happened since I last reported, possibly because we upgraded the machine to 96GB RAM.

Will keep this issue posted.

@ThomasWaldmann ThomasWaldmann removed their assignment Jan 30, 2021
@ThomasWaldmann
Copy link
Member

@DurvalMenezes Is this still an issue with a current macOS, zfs, python and borg version (like 1.2.x)?

@DurvalMenezes
Copy link
Author

@ThomasWaldmann, thanks for inquiring back.

After the machine was upgraded to 96GB RAM, the issue never repeated itself, despite running borg-backup every single day in otherwise (as much as we could tell) the exact same conditions.

Alas, the machine (along with the entire network it ran on) was decommissioned last October when the company that owned it moved its entire production process 'to the cloud', so I'm no longer in a position to do any further tests :-/

Therefore, I'm closing this issue, and thank you again and the whole borg-backup community for the excellent software and support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants