-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add crypto hash and by byte content checking - enhancement suggestion #153
Comments
I have been thinking and reading a bit about hashing and I would like to argue here that fclones should allow other hash options.
It is working and it is fast, but its biggest weakness is that it is one person creation without easily available peer reviews. You google it and there is very little out there to support author's claims. It does not matter much for suffix, prefix matching phase (even CRC would do) - it is perfect place to use very fast hash. But not the final full file hash. To be honest this the biggest single objection I have with using fclones. I am not questioning metro hash quality as I am not qualified. But looks like not many qualified people investigated it. I have also noticed on various forums chatter that this one thing many people question re fclones. People want to trust proven and tested technology.
I was curious myself what is real life speed of various established hashes and run tests on my computer (2019 Intel CPU laptop) single thread:
indeed, they are much slower than non-crypto fast hashes. I have tried e.g., xxh128 and it was like 5000 MB/sec. I do not have metro implementation, but I guess it is the same range. Among them I think the most suitable candidate would be sha512-256. sha1 and md5 are broken and collisions can be created. Others are slower. sha512 output would increase dramatically memory usage with mathematically insignificant benefits. Good explanation in this paper: https://eprint.iacr.org/2010/548.pdf Key message from it: "From our performance analysis, and experimentation And truncated SHA512 is implemented in rust already: https://docs.rs/sha2/0.4.1/sha2/struct.Sha512Trunc256.html My computer has 8 cores CPU so I also tested multithreading:
To put these numbers into context of deduplication - single thread is faster than any HDD can deliver and fast enough to crunch data from NAS over 10g network, multithread is faster than my NVME SSD. 256 bits hash would make the hashing index table size grow and lead to increased time both for hash generation and matching stages, but it would not be big penalty. Slower than metro? yes. But I think not so much on any modern computer for files deduplication. I am only thinking here about last deduplication step - earlier stages can use whatever.
this is not really needed (unless for paranoid people who do not get math) when good hash is used
Other de-duplicators offer options e.g.:
optional final stage files checking hash option is in my opinion most desirable missing fclones feature. Thanks to robust multithreading implementation it is clearly the fastest de-duplicator out there. This will also make much slower hashes impact on deduplication process speed minimal as with modern hardware still the bottleneck is mostly with disk I/O. Alternative hashes option in my opinion would convince many doubters and make fclones much more popular. IMHO this is the low hanging fruit for significant functionality improvement. People will have option to balance their requirement. |
Tested fclones vs others - 100GB folder but with many duplicates - this is actually 2x 50GB the same dataset so put stress on grouping by contents phase as all data has to be checked by reading full file content. Only finding duplicates without any action (
My thoughts: disk I/O is bottleneck for this deduplication check phase- all programs have to read all files and compare them either using hash or directly - interestingly byte by byte comparison is not much slower than fast hash. But in this case it is only because all duplicates are doubles. Any bigger groups would make hash comaprison faster, e.g., when I quadruple duplicates hash comparison advantage becomes more apparent:
jdupes gets slower when fclones and rdfind scale nicely. calculating crypto hashes has its cost but it should not make much difference if multithreading is utilised - fclones was 6x faster than rdfind but on my machine sha512-256 multithreding is 6x faster than single thread - means end result would be similar with much stronger hash function fclones metro hash does not provide advantage when multithreading is used - and raises many questions. Fclones strenght is its multithreading. It makes looking for duplicates in massive dataset much faster (this test daset is small so advantage is not highlighted). You can see it in your NVME SSD benchmark where 1.5m files is used. |
surprise surprise yadf levaes everybody behind - for quadruple duplicates it finishes in 43s - 2 times faster than fclones (for doubles the same). But it is only proof of concept and they advice fclones for real job. |
and in addition SHA512-256 test on the latest MacBook Air (M2 ARM CPU) single thread 1400 MB/sec It is only to highlight that hash processing speed vs disk I/O is becoming more irrelevant. |
Yes, I agree adding more hashing algorithms is a good idea, in particular adding support for well known cryptographic hashes. Metrohash was used not because it was the fastest, but actually because it was fast enough and I could easily compile on most platforms without too much hassle with conditional compilation, including ARM. Earlier I used one of hashes from As far as performance is considered, hashing performance does not matter much for the wall clock time, and I agree with your conclusions that I/O is what affects the run time the most. However, there is a bit more to efficiency than just the amount of time you need to wait. Actually I don't want to load CPU too much while searching for dupes (particularly on fast SSD), because then it would render my machine useless for the time dupe search is running. This is actually what Ubuntu dejadup is doing whenever I'm backing up the system, and I literally hate it (yup, somehow it consumes all my CPU, and I/O not so much, fans go crazy, etc). To summarize, there are two things here, and both would be really good to implement:
BTW: I'm curious how you got yadf so much faster. Can you share your repro steps? This is a surprise because in my testing yadf was actually quite much slower than fclones, so maybe there is some edge case I missed. Did you evict page cache before each run to make comparison fair? |
I use 50GB folder with 24k files - mix of small and big. As I was interested about the last deduplication step speed I make two or four copies of this folder creating 24k duplicates groups forcing programs to read all content. on your tests https://github.com/pkolaczk/fclones#ssd-benchmark yadf is actually second fastest on SSD and much slower on HDD on yadf own tests https://github.com/jRimbault/yadf#benchmarks yadf is faster with warm cache and minimally second fastest with cold cache - they only test on SSD and explicitely say "I recommend fclones as it has more hardware heuristics. and in general more features. yadf on HDDs is terrible." in my tests yadf is significantly faster than fclones both for warm and cold cache (I tested both, Why - no idea. Maybe it scales better with number of processes? You used 4 core CPU, yadf tests used 6 core and I used 8 core. Also I use macOS when two other tests were run in Ubuntu. We all most likely used different SSD. Too many factors to say for sure why:) Also I tested the latest versions. In your tests yadf was v0.13 when now it is v1.0 |
Agree regarding application impact on system performance. But clearly dejadup is not written well backup process to consume all resources when for such task it is abcolutely not needed. Now in general it is something what should be optional - run program by default with full speed but give user option to start it with lower priority. I do this with my own backup - runs during a day at very low priority but late night run uses all resources available and does extra tasks like verification. |
I would not change default - definitely not for prefix and suffix steps. metro seems to be good tradeoff - especially in light of your comment on overall system performance. EDIT: unless BLAKE3 provides similarly good tradeoffs For second point as soon as option is available e.g. for 256 bits output hashes it should be super easy to add some reasonable choice: SHA512-256 - the fastest on 64bit architecture Also there is "new" kid on the block - BLAKE3 - this could be maybe alternative to metro for default full content hashing? It is of course slower as it is designed as crypto hash. But quick test and read tell me that more than fast enough for deduplication. https://github.com/BLAKE3-team/BLAKE3 It has been added to OpenZFS last year for flies' hashes and they are very conservative in adding something which is not proven - only this is good reccomendation. on my machine it is more than 2x faster compared to SHA512-256 on single thread: using built in multithreding: and hashing 8 files using 8 cores (b3 single threaded) - 10,000 MB/sec
I tested multiple times to make sure files are in cache - as I was interested in hash speed - not my disk speed. These numbers mean that it would never really need 100% CPU to do its job (no fans spinning) - simply it is faster than even fast NVME SSD I/O can provide. All of them are probably already implemented in rust. |
I think any performance and efficiency questions can be only answered by testing. If you add configurable hash option I am happy to test it throughly. |
Actually I have idea. I looked at htop because yadf is faster in any test I tried... yadf is using all my 8 cores vs fclones only using 4. Which explains why it is about x2 faster It makes use of multicore much better: With fclones I can see thast all 8 cores are used during initial stages - but then only 4 when contents hashes are calculated. |
What happens if you force number of threads to 8 with |
Anyway I think this is worth opening a separate issue. |
Good thing is that with proper parallelism and proper hashes fclones can become the best |
for new hash algs I can also test on ARM - RPi and Apple M2 - so so surprises when released |
Hey, got an hour today and added more hashes :) |
SHA256 will be faster if somebody runs 32b CPU - blake3 crate is really well done with Intel and ARM optimization. BTW - this is really great news!!!! I am going to test it later today |
could not wait - blake3 is a bit slower than metro for me - memory usage is up (about 2.5x) but it is 256b hash so no issue I test it on 220 GB of duplicates: [2022-09-10 18:24:59.815] fclones: info: Found 75449 (220.6 GB) redundant files metro - 75s |
is some way it is amazing that 220 GB can be read and processed in little over 1 min... computers are crazy fast nowadays |
It is great milestone! Now we have a choice. Thank you again for this great project. |
One thing: you are using Sha512_256 but:
calls it sha512 (which might be missleading) sha512 and sha512-256 are not the same |
And with proper crypto hashes still much faster than jdupes:)
takes 190s for my test data BTW - competition noticed fclones:): |
sha512 has 512 bits output Both are the same family of hashes - sha2. Actually very close relatives but not the same. Their speed is the same - difference is memory requirement to juggle 512 vs 256 bits hashes. As per paper I sent link to - sha512-256 as for today is very good practical balance. 256 bits for files dedup is super safe. It is very frustrating to see many software using sha256 becasue they assume that sha512 is always slower. And 32 bits hardware is a niche. |
Also have tested on 32 bits ARM and Debian - this is usually where problems are with many modern software:) compiles ok and all hashes work |
another observation - at least for my computer blake3 is super fast but when run on 8 cores makes my computer very slow. good thing is that thx to options available in fclones when I run it as ssd:8,6 then all is fine... and it is marginally slower deduplication Maybe in the future default threads strategy could be better - as I said new mac computers now have up to 20 cores. In such cases I/O bottlneck makes it pointless to use all of them even when they have insane fast SSDs - 7GB/sec transfers Anyway it is good problem to have. |
might be worth to mention in readme that some hashes are computationaly more intensive and user should plan. again thx to fclones options easy. Even if plaughing through TBs of data I can use --cache and lower number of thread during a day - then ctrl-C bafore going to bed and change strategy |
Some speed results: metro128 - 32s as expected on 32 bits CPU sha256 is faster than sha512 blake3 doing very well for such small hardware |
Last comment is that for sake of making available set of hashes complete I would also include sha3-256 https://en.wikipedia.org/wiki/SHA-3 Speed is very hardware depndent - probably slower today on our laptops but things are changing quick. It is standard some people migh have to adhere to. |
Good catch, indeed. |
Awesome - so in the future it will be easy to add any new hash regardless of its output size. But indeed important to use all output not some truncated part. |
I think I would keep sha512 - it is a bit paranoid hash (it is 512 bits long) given tradeoffs of speed and memory - but maybe somebody likes it. At the end now it is optional. |
I have tested the latest commit using full hash output length from more-hashes branch and so far all works for me. I worried about peak memory usage for sha512 but increase at least for me is minimal Below results for my dataset 100k files - 260 GB with 75k files - 220 GB duplicates hash - max mem - time ==== metro128 - 55MB - 77s Previous version of fclones (without hashes choice) also finishes in about 77s - so there is no impact on previous performance |
FYI - bug:
|
last documentation commit correction: blake3 is 256 bits hash not 128 |
also for consistency maybe some links to sha256 and sha512 - https://en.wikipedia.org/wiki/SHA-2 and sha3-256 and sha3-512 - https://en.wikipedia.org/wiki/SHA-3 |
Corrected. Is everything ok now? I think it's time to merge it. |
I did the last commits tetts - all works - for me - Intel 64 bits CPU/8 cores on macOS and ARM 32 bits on Debian: some tests (Intel/macOS only) - it is my data set and my setup - skewed heavily for hashing speed not deduplication: time for group run, max mem usage, aproximated hashing speed
and fclones single thread
|
it is fantactic that fclones now has options for different hashes - speed - it all depands on platform and hash implemetation. what is clear is that multithreding makes massive difference on modern hardware. I have not expected that adding new hashes will require so much work - thank you. |
and thank you for mentioning me in your release. very kind |
Overall I think that MetroHash you use is fine but some of us are more paranoid than others, also sometimes there are regulatory requirements we have to folllow.
What about to make it optional to use e.g. SHA512 (not sha256 as sha512 is faster on 64 bit CPUs) and for most paranoid simple byte by byte content compare?
ATM fclones defaults are perfectly reasonable but some of us would be happy to have even bigger choice.
The text was updated successfully, but these errors were encountered: