Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of odgi sort #445

Merged
merged 1 commit into from
Sep 15, 2022
Merged

Conversation

nsmlzl
Copy link
Contributor

@nsmlzl nsmlzl commented Sep 15, 2022

The global atomic term_updates is a major bottleneck in odgi sort since all threads are incrementing it in each step operation. Therefore, this implementation improves the performance by incrementing a thread local counter. The global atomic counter is only updated from time to time (update of the global counter is batched) to prevent memory congestion.

This small change improved the sorting of the Chr6 dataset with 60 threads from 1h 50 Minutes down to 14-15 Minutes (/usr/bin/time -v ./odgi sort -i chr6.og --threads 60 -Y -o tmp.og).

We also experimented with other optimizations:

However, those changes did not lead to significant improvements. Therefore, were not added to this PR. In general, the computation time fluctuates by around 2 minutes. This makes it difficult to assess which changes lead to (slight) improvements.

The CPU utilization (measured with /usr/bin/time -v) is still a bit lower than the configured value. However, our profiling showed that during the actual SGD algorithm odgi uses the configured number of threads. During major parts of loading the .og graph file, the CPU utilization is single-threaded. For chr6 around 4 minutes (total computation takes ~14-15 minutes) are spend on this single-threaded for_each_path_handle function in XP::from_handle_graph_impl.

The global atomic term_updates is the main bottleneck, since all
threads are incrementing it. Therefore, a thread local variable
is updated in each step operation. The global atomic is only updated
from time to time.
@subwaystation
Copy link
Member

🔥

@AndreaGuarracino
Copy link
Member

Hot

@AndreaGuarracino AndreaGuarracino merged commit 3c84734 into pangenome:master Sep 15, 2022
@subwaystation
Copy link
Member

@nsmlzl I am wondering how you did this? I am currently running pggb on chr20 (63Mb) and it takes more than 1h using 64 threads of an AMD EPYC™ 7002 Series Processors. Chr6 has 171 Mb and is much more complex.

In practice, I didn't notice such a large speedup during all my runs. Did I miss something?

@AndreaGuarracino
Copy link
Member

AndreaGuarracino commented Nov 17, 2022

That's strange. Maybe are you hitting a different bottleneck?

A quick test on my AMD Ryzen 5 laptop:

# New odgi sort
\time -v odgi_new sort -t 12 -i LPA.fa.gz.3503181.*.og -o sort.new.og -Y -x 1000
	Command being timed: "odgi_new sort -t 12 -i LPA.fa.gz.3503181.417fcdf.21d4386.smooth.final.og -o sort.new.og -Y -x 1000"
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.34
	Maximum resident set size (kbytes): 42648

# Old odgi sort
\time -v odgi_old sort -t 12 -i LPA.fa.gz.3503181.*.og -o sort.fast.og -Y -x 1000
	Command being timed: "odgi_old sort -t 12 -i LPA.fa.gz.3503181.417fcdf.21d4386.smooth.final.og -o sort.old.og -Y -x 1000"
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.19
	Maximum resident set size (kbytes): 42628

@subwaystation
Copy link
Member

subwaystation commented Nov 18, 2022

On my AMD Ryzen 7 laptop I can observe a speedup, too:

#OLD
	Command being timed: "odgi_ sort -i cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish.gfa.og -o cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish.gfa.og.Y -P -t 16 -Y"
	User time (seconds): 280.95
	System time (seconds): 1.23
	Percent of CPU this job got: 1375%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.52
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 492936
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 48499
	Voluntary context switches: 28422
	Involuntary context switches: 41413
	Swaps: 0
	File system inputs: 0
	File system outputs: 479944
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

# NEW
	Command being timed: "odgi sort -i cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish.gfa.og -o cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish.gfa.og.Y -P -t 16 -Y"
	User time (seconds): 181.48
	System time (seconds): 0.97
	Percent of CPU this job got: 1289%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.14
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 490920
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 13
	Minor (reclaiming a frame) page faults: 48695
	Voluntary context switches: 23315
	Involuntary context switches: 15934
	Swaps: 0
	File system inputs: 0
	File system outputs: 479944
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

I am just surprised for Chr20 it should take 1h using 128 threads, when above @nsmlzl was able to sort Chr6 with 60 threads in only 14 minutes!

@subwaystation
Copy link
Member

I will double check the parameters in smoothxg.

@nsmlzl
Copy link
Contributor Author

nsmlzl commented Nov 23, 2022

odgi sort performed much worse with the progress-indicator activated (-P argument). This was fixed with #458 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants