-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bounded resource usage (feature) #109
Comments
Phew, that's a hard case. Actually we are one of the rather memory saving tools out there... First of all: You did not mention how you run How much RAM do you have installed and at what point (before traversal finishes/after/somewhen else) does it exceed the limit? A usual You're still using the |
I use paranoid file compare but this is the point. I don't completely trust hashes in such case, just as a candidate speed-up. And files are rather small, their vast number is the culprit. There are huge Subversion repositories there, I mean with full history, which make you appreciate git. I made backups at different times and then just copied them all over. Now there are many duplicates there so it would free much disk space. It would be a trivial SQL query to find all duplicate files in the database and wouldn't take so much memory. Then it would need to just compare files pair by pair which shouldn't take much memory either. And yes, the footprint is 1.6GB of resident size and about 3GB of virtual size. Unfortunately, I have only 2GB of RAM and that feels very painful. Everything slows to a crawl and the system is almost unresponsive. I think it's a memory cache which was hurt most, so constant thrashing/swapping considerably slowed the comparison and file system ops. And you never know it beforehand, so it was an unpleasant surprise when OOM killed some system processes. Fortunately, it was not a server or it'd be another story. The memory footprint was big from the start, but it stuck on matching stage where it found just about 100000 duplicates after eleven hours. I killed it after that, as you could guess it won't finish even tomorrow. Yes, that's on develop branch. |
By the way The memory footprint is usually the highest short after traversal and gets less after that since we get rid of unused structs once found (except for Regarding paranoid: The amount of memory used by the paranoid mode is managed and by default cut by around 256MB (see From my view, this is a general problem in the sense of "large data sets yield large resource usage". |
You made a point here. The paths are really long for many files. May be compressing them would help. Also, rmlint is not a blackbox - it's a tool with many options. I would consider using it for building candidate list and processing it with another options if it would save me some memory. And I think that running an SQL query every time to find next candidates would do just that. After all they do a good job at optimizing it. |
Maybe I take a try at that once I have the time. In the meantime some statistical questions:
|
Total files: 5389618 BTW, I found that there are drastic differences in processing time on hot and cold memory caches. It took 37 minutes for rmlint to build the list on the hot cache and 55 minutes when the cache was cold. So, memory footprint makes a very big difference. |
Thanks for the numbers. The last number was just to verify how many paths land in the actual duplicate-finding-phase. Well, it's no news that the build goes faster on consecutive runs. On btrfs this can be quite extreme, I saw a factor of around 20x between first and second run. This has (almost) nothing to do with memory footprint though. |
Actually it's the second run which was slower. The first one was after |
I added two new options:
Here's a short test with 1 million files: http://i.imgur.com/Ck9FrFG.png Caveats: It does not work for paths piped into |
It looks promising! I found one problem. Try this: Also, I already noticed that it doesn't work with I have a question though. When used with empty directory option it's reading more than 700 MB per minute from disk, while |
I use SQLite's
The slowness is due to the massive amount of selects the current pre-processing emits. When the path was still in memory this was no problem, but now every path is looked up several times for every hardlink it has (in your testcase a lot). Support for To your question: |
That's unfortunate because I have similar directories among the data. I expected rmlint not to look for hardlinks if it was so told. In case of looking for empty directories I supposed that it is going to traverse file names only without additional processing. It is much faster to link them again. How can I turn off hardlink processing in that case? The Another possibility would be storing and querying by inode number (indexed).
BTW, why not make it another formatter for consistency and functionality? |
@BTW: What is Edit: Problem with find and progressbar confirmed. |
Sorry, I didn't recognize that it's ambiguous. English is not my native language. I meant that making the sqlite cache a formatter instead of a special option would be more consistent with other functionality and would allow to specify some parameters, e. g. a file name. It would also allow to make that cache persistent as it's with JSON. It really makes no sense to ignore hardlinks if it takes so much time to find them. I'd rather process them as ordinary files instead. The time it takes to compare them would be much less in this case. If I understand you correctly there is no option for that. And why it doesn't compare just inode numbers which are already in memory without querying the database? Also, it seems to me that storing inode numbers in database with paths will help to optimize that case. It would be possible to query just hardlinked files instead of just all paths. And if it was a formatter it would be the similar logic as it's now with JSON. |
No problem, just a (more or less) friendly reminder. Making the cache a formatter is a valid discussion point, but I chose against it since I don't want people to rely on the cache internas. It's simply meant as an memory optimization that should be bound to one process only. It's true that for some people a separate sqlite formatter with more information in it might be useful, but I don't intend to write one due to my limited time (patches welcome though). Regarding hardlinks: Finding out files with double inodes and device id is not hard indeed. What's slow in the current code is finding path doubles, i.e. files that are exactly the same (same inode, same device id, same parent inode and parent device id) since they were given twice. Example: |
The progressbar bug should be fixed by the way. |
Thanks! |
The part I mentioned is rewritten now. I updated the |
Thank you! But something is not working right, I'm afraid. I did the same test |
D'oh - Sorry. Technically, it's the same issue as the one you reported. I just tried to delete some duplicate code lines and forgot to test them. Should be fixed.. |
Excellent news! I think that the code is perfect for my purposes. I'll try to deduplicate my data with it. This will take a while. I'll report any bugs that I may find. But for now I'm opening another bug report about the progress bar. |
While rmlint is still matching my data I noticed that the last line of progress bar is cut. I think that the word |
I'd be curious to see how https://github.com/SeeSpotRun/rmlint/tree/dir_tree compares for your use case too. It's a quick and dirty implementation of a different way of storing the file paths using a n-ary tree. This effectively de-duplicates common elements of path strings so if we already have /very/very/long/dir/path/file1 then adding /very/very/long/dir/path/file2 only requires |
After running for 28 hours it crashed with @SeeSpotRun Thanks. That's certainly interesting and I'll experiment with it a bit later after I find possible workaround for the crash. |
It's hard to guess what caused this, since it's an problem in an external library. We do not check the return value of malloc, since we can't do much if it fails except reporting that it failed. One possible idea: Can you try to run it on a few files larger than 4GB? Or even much larger than that? |
In this case I'll recompile it with debug information and allow it to produce core dump. Then it should be possible to analyze it in gdb to determine the exact place and conditions. But there is another problem which actually prevents me from doing that. I tried to limit the number of possible candidates by Yet another idea for optimization: the traversing stage took 51 minutes, but the |
A core file might help, but no guarantees. What was the exact command you used to run?
Regarding Btw.: Did you fs drop caches between obtaining those numbers? |
The command was The version by @SeeSpotRun has the CPU utilization problem too but in a different context. Well, I understand that rmlint is not I didn't drop caches because I wanted to see if running rmlint after |
Here's the cause. Run |
That's a bit of a corner case; 700 large files all of the same size. The memory manager for paranoid hashing doesn't currently contemplate that one. A quick fix at SeeSpotRun@5777983 should need about 1.4G to handle the 700 files. |
Not at all. There are just (not so) big multivolume archives with the same volume sizes and several copies. That's why rmlint is needed isn't it? The actual volume size is half of that but I can't make a test case which would account for other working memory, so I tweaked it slightly to clearly demonstrate the issue on a smaller subset. |
During traversing a lot of data structures are built too, which goes into that
Yes. It's true that this configuration might appear in the wild. But your setup is definitely a corner case since you are on highly limited RAM. We appreciate your constructive criticsim and testcases, but we're no request |
Sorry, I'm not requesting anything. I'm just pointing to real issues, IMHO. You should decide what to do in each case. I can workaround every single case by just excluding it from deduplication.
But the average user has no clue that such data set is very demanding on resources. Actually, it's the process of deduplication and bug finding when it became obvious. So, you should expect that such setup is the norm for people with very old backups. Allow me to elaborate on this a little. Memory is expensive. My motherboard supports only up to 4 GB. So, upgrading would require to replace motherboard, CPU, memory and may be even some peripheral devices. And even if I do, the old system is likely to become some NAS box for my backups. But disks are cheap, so I'd prefer to buy an additional disk and copy data to it. Servers are likely be attached to SAN with built-in online deduplication. That's why I believe that offline deduplication is for a low memory setups and it won't be in a high demand on a premium systems. BTW, |
All good points. It's certainly not such a corner case as I thought - I hadn't considered multi-volume archives always being split into same-sized chunks. |
Thank you for all your hard work! Your work will certainly benefit the product and all those who want to use it. |
No worries; the improved mem manager is much better than the old. |
@vvs: Nice to hear it ran through. The |
It doesn't matter as long as it didn't crash or caused other nasty effects when the memory was low.
I plan to run it again with other optimization options. I'll post the results here. Thanks again for all work on this! |
Using |
Heh , no coincidence I believe. Numbers should be correct enough for a rule of thumb. For future reference, part of the discussion here also happened in SeeSpotRun#1. |
I believe that the original issue was fully addressed. |
The default executable stack setting on Linux can be fixed in two different ways: - By adding the `.section .note.GNU-stack,"",%progbits` special incantation - By passing the `--noexecstack` flag to the assembler This patch implements both, but only one of them is strictly necessary. I've also added some additional hardening flags to the Makefile. May not be portable.
Consider this use case. I'm really stuck here trying to deduplicate over 5 million files. The memory usage goes far beyond available RAM and using large swap won't help because processing takes eons. I think that using some database, like e.g. sqlite, to hold files metadata would be a lifesaver. This is similar to #25 but with different scope: to limit memory usage and make rmlint scalable on huge number of files.
The text was updated successfully, but these errors were encountered: