-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of odgi sort #445
Conversation
The global atomic term_updates is the main bottleneck, since all threads are incrementing it. Therefore, a thread local variable is updated in each step operation. The global atomic is only updated from time to time.
🔥 |
Hot |
@nsmlzl I am wondering how you did this? I am currently running pggb on chr20 (63Mb) and it takes more than 1h using 64 threads of an AMD EPYC™ 7002 Series Processors. Chr6 has 171 Mb and is much more complex. In practice, I didn't notice such a large speedup during all my runs. Did I miss something? |
That's strange. Maybe are you hitting a different bottleneck? A quick test on my AMD Ryzen 5 laptop:
|
On my AMD Ryzen 7 laptop I can observe a speedup, too:
I am just surprised for Chr20 it should take 1h using 128 threads, when above @nsmlzl was able to sort Chr6 with 60 threads in only 14 minutes! |
I will double check the parameters in |
odgi sort performed much worse with the progress-indicator activated ( |
The global atomic
term_updates
is a major bottleneck in odgi sort since all threads are incrementing it in each step operation. Therefore, this implementation improves the performance by incrementing a thread local counter. The global atomic counter is only updated from time to time (update of the global counter is batched) to prevent memory congestion.This small change improved the sorting of the Chr6 dataset with 60 threads from 1h 50 Minutes down to 14-15 Minutes (
/usr/bin/time -v ./odgi sort -i chr6.og --threads 60 -Y -o tmp.og
).We also experimented with other optimizations:
work_todo
&snapshot_in_progress
less frequentdelta_max
less frequentHowever, those changes did not lead to significant improvements. Therefore, were not added to this PR. In general, the computation time fluctuates by around 2 minutes. This makes it difficult to assess which changes lead to (slight) improvements.
The CPU utilization (measured with
/usr/bin/time -v
) is still a bit lower than the configured value. However, our profiling showed that during the actual SGD algorithm odgi uses the configured number of threads. During major parts of loading the .og graph file, the CPU utilization is single-threaded. For chr6 around 4 minutes (total computation takes ~14-15 minutes) are spend on this single-threadedfor_each_path_handle
function inXP::from_handle_graph_impl
.