-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any way to get more speed? #39
Comments
Okay so |
I haven't really tested it across 30 threads, but does it manage to actually utilise the CPU reasonably well? Your CPU flags seem to suggest a Xeon E5v2, which doesn't support AVX2. By default, it shouldn't be using any AVX2 method, and if you force it, it should crash with SIGILL. As for your question:
If you have the option of choosing a different CPU, those extensions definitely help a fair bit, as they double/quadruple the amount of data that can be processed at once. |
It pulls 100% CPU on all threads, so I am not sure if the -t30 actually does anything. That's fine, but it's certainly CPU bound. There's no AVX2 indeed, was just wondering if it would help. I tested and Will try on another machine I have access to (AMD Epyc) to see how the different options fare. Thanks! |
That's interesting, does a lower number reduce CPU usage?
All EPYC chips support AVX2, but the first gen EPYC internally has half-width vectors (so it won't be much faster than SSE/AVX) - second/third gen have full width AVX2 units, so get a better gain there. |
Actually... I tested the So on master 5MB slices are faster but on develop 1M slices are faster. Development VersionWith -s1M --method shuffle-avx
With -s5M --method shuffle-avx (Slower)
With -s1M --method xorjit-sse
With -s5M --method xorjit-sse
So in Master VersionWith -s1M --method shuffle-avx
With -s5M --method shuffle-avx (A lot faster than 1M slices with AVX)
With -s1M --method xorjit-sse (18s faster than AVX)
With -s5M --method xorjit-sse (About the same as AVX and much faster than 1M slices)
So in All in all, the development version is faster overall. Development with 2 threads
Yes, -t2 only attaches to 2 threads indeed. So it does work and the speed it about the same on both AVX and XOR-JIT. What the heck?
1M slices are now a lot slower though with 2 threads on both AVX and XOR-JIT
|
With 4 threads, dev version and 5M slices I got a 7GB rarset (70 files) in 26 sec (~1GB larger than prev. single test file above)
|
Thanks for all those tests and reporting back! A few things:
PAR2 requires computing the MD5 of each file - unfortunately there's no way to multi-thread an MD5 calculation, but you can compute multiple MD5s at the same time. In your single file tests, MD5 (or disk read) could be a bottleneck if recovery can be computed fast enough. But if you want to play with the idea, using the interface_redesign branch, on your single file test, you can replace |
Yeah it's a dual socket 8 core system with HT indeed.
Yep, sort of:
More tests:No recovery
5% recovery
5% recovery with 4 threads (More than 4 threads didn't improve things)
I have played around with When I make a split archive of that 6GB.bin file with 100MB parts and then run the same command I get 14 seconds, and with 6 threads I get 11 seconds. So threading works better on more files indeed, which makes sense. But using all threads in the system actually slowed it down to 20s instead of the 11 below.
Using more than 6 threads didn't improve things. The scheduler might have put some threads on a different NUMA node. |
Thanks again for all those tests!
Interesting, maybe it's due to the second pass required, and shuffle-avx just being slower than xorjit-sse. From your new tests, it seems that the first pass cannot go below ~18 seconds (disk/hashing bottleneck), and the 4 thread test roughly matches this figure, so likely bottlenecking there. Also doing a second pass appears to add a fair bit of time, which may a key factor of the original 5M slice test getting roughly the same speed as the 1M slice test.
The option has little effect for single file tests - it's mostly beneficial if there's a lot of files, and there's a hashing bottleneck.
...whereas in this case, the smaller hash queue likely gives the most benefit. |
Thanks for the help, I got the speed at least tripled or so now :-) |
I am currently using
parpar -n -t30 -r10% -s1M -dpow2
as the parameters to ParPar. I am reading files from a RAID10 SSD array and writing to that same array.CPU has 32 cores, using 30 threads in ParPar to leave some CPU cycles for other stuff.
Is there any thing I can enable/disable to get more speed out of ParPar in this scenario?
The text was updated successfully, but these errors were encountered: