-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
on 8 cores CPU fclones seems not to use all cores #158
Comments
Not very scientific but gives good feel what is going on. The same dataset: yadf v1.0.0 - command to run fclones v0.27.2 - command to run it is a bit dirty animated gif capture - to restart sync both of them refresh browser window - #158 (comment) - they are not exactly the same time and slowly get out of sync It is clear that both programs use all cores initiallyfor grouping (lengths for fclones (yadf does not do it) then prexid and suffix for both followed by crunching contents hashes). What is also insteresting that even in initial phase yadf uses all available CPU power - fclones takes it easy |
I think I should stop using it - hahah - as I only create issues:) |
more multi core is becoming norm - in Apple world basic MacBook Air has 10 cores and Mac Studio as of today 20 - it will obviously only grow in the future so it is good idea to use what is available well - does not matter if 2 or 100 |
It's been great! Keep going! |
Be carefull what you wish for. But I need good deduplicator so will be pestering for time being. |
fclones has the biggest potential to be close to perfect |
Before I thought that jdupes is the best (it is still great) - fantastic attention to details and any person trying to learn C should study its code- but they ignored macOS world and think that SSD are only used in some high end hardware plus believe (there is no other word) that byte by byte files comparison is the best - so now bet is on fclones:) It has good base design - configurable parallelism with clever heuristics - it makes default easier. And attention to details similar to jdupes. Rest is about filling gaps. |
I think good news. I have tried with Overall CPU utilisation is lower than for yadf but now fclones finishes in the same time as yadf (75s in my test). With defaults it takes 150s for fclones to finish - 2 times slower on the spot. So problem is with threads heuristics. |
Implementing #159 would help to diagnose it + would let users to make more informed decision before going into tuning. IMHO it should show available cores and threads values used - either based on decision made by fclones for detected device(s) or user provided options |
I think I have an answer.
Line 210 in 7edbc68
as on macOS devices are named like this: EDIT: ok - this code is only for linux but it also should be used for macOS - as going by names is wrong in case of this OS. it is "mentally" much closer to linux than to windows. but has its quirks. |
That regex is a hack to get the device name from the partition name and is called on Linux only. On other systems it uses the raw device name it obtained from enumerating the device list. But maybe some of that device listing logic is wrong or some data is missing there, and then it can't find the device for each file. |
yes does not help I do not know rust:) but I think the issue is exactly with device listing logic. As I did some crude debug and I can see that on macOS first device is always unknown - they it is correct. Do you use first device from the list as "/" ? |
Yes, the first device is the default one that is always unknown - this is just a fallback if no devices are retrieved from the system. But then there is the |
ok... so this might be an issue becasue on macOS mount points are not always what they seem to be - APFS uses concept of firmlinks when magically one "partition" becomes one with another. It only applies to operating system disk - internal one. External disks work pretty much the same as in Linux. |
I think I could figure out what is going on if fclones had some debug output |
e.g., which device it assumed that folder used for |
Yeah, I'll add some |
if you could add all enumerated devices list with as much info as fclones is using would help. I looked at sysinfo crate and for macOS they do a bit of hacking - which might lead to fclones getting funny info. |
APFS macOS disk has multiple volumes/"partitions" folder /Users (and any other user writable folders) data is stored in "partition" called "Macintosh HD - Data" mounted to /System/Volumes/Data this is mount point and partition name fclones discovers via sysinfo and correctly marks as SSD but from user perspective (and fclones) folder Users has path /Users/ - so any data there for fclones is on unknown device this is macOS file system way - multiple "partitions" are fused together via firmlinks into one filesystem |
later when I have a moment I will try to think what would be the safest way to tackle it. Especially that fclones can be also used on older macs where things are different. |
I think I got to the bottom of it. Testing my first rust code now:) will do PR tomorrow. |
Here results of my debug, DiskDevices structure on macOS contains the following:
in this case mix of system and external devices. We do not have to worry about VM, Preboot nor Update - they are only used by system. Neither external devices - I only tried to see if their type is recognised correctly. The problem is with:
This is device where users all data is... however users never see nor use this path, neither does fclones. APFS uses firmlinks (in Apple own words: "Bi-directional wormhole in path traversal. Firmlinks are used on the system volume to point to the user data on the data volume."). From user perspective data device folders are part of root filesystem. - https://www.swiftforensics.com/2019/10/macos-1015-volumes-firmlink-magic.html Very APFS specific - but it is what makes fclones unable to recognise correct device. As e.g. when trying to deduplicate folder /Users/kptsky/FilesForDedup there is no such device path in DiskDevices and as a result 'unknown' device is used. Solution? For APFS we have to help and point 'Macintosh HD - Data' to root folder. sending PR with my proposed solution. |
macOS - '/System/Volumes/Data' DiskDevice path remapping Fixes #158 On macOS APFS disk all users' data is mounted in '/System/Volumes/Data' but fused transparently using firmlinks and presented as part of the root filesystem. It requires remapping Data volume path for this DiskDevice to '/' in order for fclones correctly recognise device deduplicated files are on. Ref: https://www.swiftforensics.com/2019/10/macos-1015-volumes-firmlink-magic.html https://eclecticlight.co/2020/01/23/catalina-boot-volumes/
Thank you for merging it. Now it flies on macOS with defaults. |
Thank you for all the hard work on this. Awesome contribution! |
After thinking about this issue more, I came to the conclusion the approach taken by fclones to identifying disks by file paths is fundamentally broken. I believe a lot better way would be to use the device identifiers we already have in the |
As it is now it is good enough IMHO. I have looked at sysinfo source code for macOS part and it is full of hacks as well. Cross platform solutions are never easy:) |
When trying to find out why yadf is faster than fclones I have noticed much better CPU utilisation in yadf.
macOS 12.5.1
CPU: 2.3 GHz 8-Core Intel Core i9
SSD - internal NVME
and not surprisengly yadf usually finishes duplicates search 2x faster than fclones
Actually I have idea. I looked at htop because yadf is faster in any test I tried... yadf is using all my 8 cores vs fclones only using 4. Which explains why it is about x2 faster
With fclones I can see thast all 8 cores are used during initial stages - but then only 4 when contents hashes are calculated.
Originally posted by @kapitainsky in #153 (comment)
The text was updated successfully, but these errors were encountered: