Improve performance of odgi sort #445

nsmlzl · 2022-09-15T14:35:05Z

The global atomic term_updates is a major bottleneck in odgi sort since all threads are incrementing it in each step operation. Therefore, this implementation improves the performance by incrementing a thread local counter. The global atomic counter is only updated from time to time (update of the global counter is batched) to prevent memory congestion.

This small change improved the sorting of the Chr6 dataset with 60 threads from 1h 50 Minutes down to 14-15 Minutes (/usr/bin/time -v ./odgi sort -i chr6.og --threads 60 -Y -o tmp.og).

We also experimented with other optimizations:

Read atomics work_todo & snapshot_in_progress less frequent
Update progress less frequent
Update delta_max less frequent
Use thread-local variables for eta, adj_theta, and cooling instead of reading global atomics
Use of a different update interval

However, those changes did not lead to significant improvements. Therefore, were not added to this PR. In general, the computation time fluctuates by around 2 minutes. This makes it difficult to assess which changes lead to (slight) improvements.

The CPU utilization (measured with /usr/bin/time -v) is still a bit lower than the configured value. However, our profiling showed that during the actual SGD algorithm odgi uses the configured number of threads. During major parts of loading the .og graph file, the CPU utilization is single-threaded. For chr6 around 4 minutes (total computation takes ~14-15 minutes) are spend on this single-threaded for_each_path_handle function in XP::from_handle_graph_impl.

The global atomic term_updates is the main bottleneck, since all threads are incrementing it. Therefore, a thread local variable is updated in each step operation. The global atomic is only updated from time to time.

subwaystation · 2022-09-15T14:47:20Z

🔥

AndreaGuarracino · 2022-09-15T16:17:37Z

Hot

subwaystation · 2022-11-17T09:33:14Z

@nsmlzl I am wondering how you did this? I am currently running pggb on chr20 (63Mb) and it takes more than 1h using 64 threads of an AMD EPYC™ 7002 Series Processors. Chr6 has 171 Mb and is much more complex.

In practice, I didn't notice such a large speedup during all my runs. Did I miss something?

AndreaGuarracino · 2022-11-17T19:16:35Z

That's strange. Maybe are you hitting a different bottleneck?

A quick test on my AMD Ryzen 5 laptop:

# New odgi sort
\time -v odgi_new sort -t 12 -i LPA.fa.gz.3503181.*.og -o sort.new.og -Y -x 1000
	Command being timed: "odgi_new sort -t 12 -i LPA.fa.gz.3503181.417fcdf.21d4386.smooth.final.og -o sort.new.og -Y -x 1000"
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.34
	Maximum resident set size (kbytes): 42648

# Old odgi sort
\time -v odgi_old sort -t 12 -i LPA.fa.gz.3503181.*.og -o sort.fast.og -Y -x 1000
	Command being timed: "odgi_old sort -t 12 -i LPA.fa.gz.3503181.417fcdf.21d4386.smooth.final.og -o sort.old.og -Y -x 1000"
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.19
	Maximum resident set size (kbytes): 42628

subwaystation · 2022-11-18T08:21:12Z

On my AMD Ryzen 7 laptop I can observe a speedup, too:

#OLD
	Command being timed: "odgi_ sort -i cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish.gfa.og -o cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish.gfa.og.Y -P -t 16 -Y"
	User time (seconds): 280.95
	System time (seconds): 1.23
	Percent of CPU this job got: 1375%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.52
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 492936
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 48499
	Voluntary context switches: 28422
	Involuntary context switches: 41413
	Swaps: 0
	File system inputs: 0
	File system outputs: 479944
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

# NEW
	Command being timed: "odgi sort -i cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish.gfa.og -o cerevisiae.pan.fa.pggb-W-s50000-l150000-p90-n5-a0-K16-k8.seqwish.gfa.og.Y -P -t 16 -Y"
	User time (seconds): 181.48
	System time (seconds): 0.97
	Percent of CPU this job got: 1289%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.14
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 490920
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 13
	Minor (reclaiming a frame) page faults: 48695
	Voluntary context switches: 23315
	Involuntary context switches: 15934
	Swaps: 0
	File system inputs: 0
	File system outputs: 479944
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

I am just surprised for Chr20 it should take 1h using 128 threads, when above @nsmlzl was able to sort Chr6 with 60 threads in only 14 minutes!

subwaystation · 2022-11-18T08:21:29Z

I will double check the parameters in smoothxg.

nsmlzl · 2022-11-23T09:43:53Z

odgi sort performed much worse with the progress-indicator activated (-P argument). This was fixed with #458 .

Update term_updates less frequent

9357122

The global atomic term_updates is the main bottleneck, since all threads are incrementing it. Therefore, a thread local variable is updated in each step operation. The global atomic is only updated from time to time.

nsmlzl mentioned this pull request Sep 15, 2022

Improve performance of odgi layout #446

Merged

AndreaGuarracino merged commit 3c84734 into pangenome:master Sep 15, 2022

subwaystation mentioned this pull request Sep 16, 2022

odgi sort/layout power up nf-core/pangenome#86

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of odgi sort #445

Improve performance of odgi sort #445

nsmlzl commented Sep 15, 2022

subwaystation commented Sep 15, 2022

AndreaGuarracino commented Sep 15, 2022

subwaystation commented Nov 17, 2022

AndreaGuarracino commented Nov 17, 2022 •

edited

Loading

subwaystation commented Nov 18, 2022 •

edited

Loading

subwaystation commented Nov 18, 2022

nsmlzl commented Nov 23, 2022 •

edited

Loading

Improve performance of odgi sort #445

Improve performance of odgi sort #445

Conversation

nsmlzl commented Sep 15, 2022

subwaystation commented Sep 15, 2022

AndreaGuarracino commented Sep 15, 2022

subwaystation commented Nov 17, 2022

AndreaGuarracino commented Nov 17, 2022 • edited Loading

subwaystation commented Nov 18, 2022 • edited Loading

subwaystation commented Nov 18, 2022

nsmlzl commented Nov 23, 2022 • edited Loading

AndreaGuarracino commented Nov 17, 2022 •

edited

Loading

subwaystation commented Nov 18, 2022 •

edited

Loading

nsmlzl commented Nov 23, 2022 •

edited

Loading