Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graph generation speedups #515

Merged
merged 5 commits into from
Apr 12, 2022
Merged

graph generation speedups #515

merged 5 commits into from
Apr 12, 2022

Conversation

yakra
Copy link
Contributor

@yakra yakra commented Apr 12, 2022

@jteresco, this doesn't make any of the changes I PMed you about on the forum.
What is here is mostly C++, various graph generation optimizations. The ones that I could also do in Python I did there too, though the results were underwhelming.


6c97b69 closes yakra#195.
Area graph vertex sets are populated in place instead of dealing with a bunch of temporaries, same idea as what #508 did with NMPs.

  • C++: cuts area graph time by ~28%. Though area graphs already took only 0.49-0.79 s depending on machine.
  • Python: saves about 1.5%.

33183f2 takes advantage of memory locality (and probably also faster iteration?) and speeds up:

  • initial vertex creation
  • getting matching sets for subgraphs

Some harmless redundancy is involved.
Writing subgraphs, saves ~5-7% depending on machine @ 1 thread, tapering off as thread count increases. Hm. Wonder why that is. Cache line bouncing?


f0d618a is a step toward the long-term goal of moving as much code as practical out of the main siteupdate.cpp file, whether by function calls or simple #includes.


12beccd uses a switch when writing subgraph vertices. Much more concise. Will help with readability when TravelMapping/EduTools#156 happens.


f699eba 8f28a2c is 5 commits squashed together. Now we're getting into the good stuff!

  • V1a: store vertices in contiguous memory
    Writing subgraphs, saves 5-9% @ 1 thread on the various Linux machines, tapering off as thread count increases.
    This sets the stage for the bit field I mentioned on the forum, but doesn't implement it just yet.
  • V1b: switches when writing master graph vertices & edges
  • T: allocate clinchedby_code once per graph; modify in place
    This avoids having to allocate a string for every edge, instead reusing one char* for each edge in a graph.
    • Ubuntu: Saves ~9.5% on subgraphs, looking fairly flat WRT thread count.
    • CentOS: Much bigger impact here. At worst, lab1 is down 11.6% @ 1 thread. Savings increase with thread count up to ~10-12 threads, and then the time ratio relative to "V1b" starts inching back up a bit.
      My CentOS machines have older compilers without small string optimization. Across all traveled graphs, 34% of clinchedby_codes are < 16 bytes long. Meaning, this improvement saves us only ~1.5x the mallocs on CentOS as on Ubuntu. The mallocs alone can't explain the performance boost; there has to be something else going on.
  • V1c: master vertex & edge counts in HighwayGraph object
    This is what should have happened in #258. Keeps tracks of these numbers during initial construction instead of needlessly making a 2nd pass thru the structure afterward to grab this data. Saves a bit of time Writing master TM graph files; cleans up the code a bit.
  • V1d: nix GRAPH & ADDGRAPH macros
    I always felt a little yucky about them. Plus, it's all done in a way that uses fewer lines of code.

All together, subgraph time is down 11.4% (lab5 @ 5 threads) to 44% (lab2 @ 8 threads).
Before, the record for subgraph processing was 6.3 s, set by lab5 @ 8 threads. Now, lab2 takes 4.4 s at 8 threads, despite its considerably slower clock. Must be its larger L3 cache.
Implementing the bit field will break the 4 second barrier. Other improvements in the pipeline additionally looked quite promising, but haven't been formally tested yet.
Subg515a
Hey, wait a sec -- where's bsdlab?
Subg515b
...Oh. It's up here. :(
Maybe some upcoming speedups or Clang's optimization flags (currently being tested) may put a dent in that processing time, but at this point I'm not holding my breath...

Finally, performance of several single-threaded tasks has also changed. Top = before; bottom = after:

C++:

task Bigga
Tomato
lab1 lab5 lab2 lab3 lab4 bsdlab
Setting up for graphs of highway data .656
.817
.437
.48
.513
.587
.542
.589
.692
.762
.774
.904
.96
1.1540
Creating unique names and vertices 4.086
2.458
2.64
1.894
2.464
1.576
3.222
2.257
4.128
2.962
3.692
2.422
5.642
3.9840
Creating edges 1.476
1.455
.928
.843
.995
.967
1.165
1.067
1.422
1.382
1.55
1.482
2.488
2.4460
Compressing collapsed edges 1.107
1.079
.661
.616
.77
.743
.798
.758
.968
.952
1.154
1.122
1.34
1.3240
Writing master TM graph files 5.55
4.804
4.917
3.111
3.461
3.033
6.352
3.777
7.868
4.658
5.196
4.526
6.454
5.2480
Setting up subgraphs .199
.168
.109
.084
.128
.091
.121
.089
.164
.13
.182
.126
.166
.1100
combined 13.074
10.781
9.692
7.028
8.331
6.997
12.2
8.537
15.242
10.846
12.548
10.582
17.05
14.266

Python:

task Bigga
Tomato
lab1 lab5 lab2 lab3 lab4 bsdlab
Creating unique names and vertices 22.86
22.66
11.68
11.56
10.51
10.42
13.77
13.67
19.24
19.14
15.36
15.34
23.98
23.4600
Creating edges 15.82
16.12
7.74
7.88
7.19
7.22
9.39
9.49
13.08
13.28
10.54
10.66
16.3
16.3600
Compressing collapsed edges 4.02
4.26
2.18
2.26
1.87
1.93
2.69
2.79
3.36
3.52
2.8
2.86
4.4
4.5600
Writing master TM graph files 29.66
28.8
17.64
17.31
14.12
13.76
22.17
21.65
28.26
27.74
21.5
21.0
36.3
35.2400
subgraphs 106.02
103.8
65.7
65.02
53.91
53.55
79.96
79.35
103.24
102.48
79.82
78.76
133.02
133.0800
combined 178.38
175.64
104.94
104.03
87.6
86.88
127.98
126.95
167.18
166.16
130.02
128.62
214
212.7

yakra added 5 commits April 7, 2022 09:02
Populate set in situ. Don't keep re-inserting into a series of temporaries.
V1a: store vertices in contiguous memory
V1b: switches in write_master_graphs_tmg()
T: allocate cbycode 1x per graph; modify in place
V1c: master v&e counts in HighwayGraph object
V1d: nix GRAPH & ADDGRAPH macros

Interactive rebase 
3b79e75b08d0548c30f03d867d170fd6c62eb1de
@yakra yakra changed the title Graphs graph generation speedups Apr 12, 2022
@jteresco
Copy link
Contributor

Nice, thanks. I'll use this version in tonight's update.

@jteresco jteresco merged commit 7b34c0c into TravelMapping:master Apr 12, 2022
@yakra
Copy link
Contributor Author

yakra commented Apr 13, 2022

It makes me really happy to be able to prove myself wrong.
Sure, I did say siteupdate.py there, and most of the improvements are on the C++ side these days.
OTOH, technically Python did get a speed bump, however minuscule.

Coming up, the bit field and Clang's -O3 flag should yield a couple more decent speedups, but after that, I think we're at the point of diminishing returns; I don't see any further optimization potential.


An interesting oddity I forgot to mention...
...that's not terribly important:

siteupdateST is a little bit (IIRC ~10%) slower writing subgraphs than regular (threaded) siteupdate -t 1.
Most tasks are faster in siteupdateST. I assume because we don't have whatever overhead is involved in executing & managing the threads.

All that's different for subgraphs is that:

I thought that just maybe, the write_subgraphs_tmg process being interrupted by the preprocessing stage was causing some cache misses (but it shouldn't be that many, darnit!), and set up siteupdateST to be more like the threaded version, fully populate the GraphListEntry vector, and then process the whole thing with no interruptions with one call to SubgraphThread (as a regular function, not a thread). No difference. Subgraphs were the same speed as before. Slower than siteupdate -t 1.

So I have no explanation for this.
But again, not terribly important. Multithreaded siteupdate is faster, because multithreaded. So I'll happily use that version. :)

@jteresco
Copy link
Contributor

I expect I'll finally make the switch to the C++ site update program next month for production, and take advantage of some of those cores on noreaster.

@yakra
Copy link
Contributor Author

yakra commented Apr 13, 2022

I gotta say I'm really disappointed by its performance under FreeBSD. :(
At least, with subgraph generation, which I'd always kinda considered the "killer app" for multithreaded siteupdate. It simply. Does. Not. Scale.

However, it's by far the worst example out of the few tasks I've looked at for FreeBSD.

  • Processing traveler list files scales pretty nicely all the way to 24 threads, though it does lag behind lab3 & lab4 (same hardware but running CentOS & Ubuntu respectively) a bit.
  • Searching for near-miss points, the tail of the graph does start to trend upward a bit. Just barely. It's not as bad as it used to be. Below 16 threads, it performs comparably to lab4.
  • Reading waypoints for all routes scales pretty well, again lagging behind lab3 & 4. Interestingly, there's a bit of a bump up in processing time between 12 & 14 threads, though it stays flat after that. I wonder if that's related to the transition from physical to logical cores...
  • Creating per-traveler stats logs and augmenting data structure is kind of in the middle, comparable to lab3 & 4 up to 5 threads, when it starts to slowly lose ground. The line flattens out from 8 to 10 threads, and then... a mountain. We've hit a bottleneck.

On the positive side though, noreaster does nominally have 267% the memory bandwidth of bsdlab.
So while I'd still expect it to perform poorly compared to Linux, graph generation should at least scale a little bit better.
Or not. Ran a quick test of 1-5 threads on noreaster. Subgraphs are fastest @ 1 thread; processing time increases from there. ☹️

An idea I've been tossing around for the past few weeks is to create new charts, maybe of just lab3, lab4 & bsdlab to see how the different OSes compare on the same hardware. I'd update the charts for all threaded tasks (like I did before the gcc->clang transition) and we could see how FreeBSD stacks up.


One thing to be aware of before going to production:
Memory leaks? (Meme!)

#375 (comment)
Since writing that post, HighwayGraph components, Waypoints & colocation lists are properly cleaned up.
Regions, TravelerLists, HighwaySystems & by extension ConnectedRoutes, Routes, HighwaySegments & concurrency lists + whatever arrays are allocated on the heap for each, not so much.

Been taking my sweet time with writing proper destructors and calling delete out of a stubborn insistence on having the least possible impact on the Total run time figure at the end of siteupdate.log. 😁

I wait till the program ends, and System Monitor says I magically get all those gigs of space back, and my RAM use is back where it was before I ran siteupdate. (Don't know if that's a compiler thing, or an OS thing, or how that works...)

  • valgrind says:
    ==24429== HEAP SUMMARY:
    ==24429==     in use at exit: 879,847,400 bytes in 19,037,811 blocks
    
    on lab2, Centos, with 4 threads. That's a lot of RAM. Enough to notice.
  • Nonetheless, freshly booting up BiggaTomato (Ubuntu), System Monitor says I'm using about 800 MiB. After several runs of siteupdate, still at 800 MiB. OK, that's good!
  • Somewhat different story with bsdlab. From a fresh boot, I started a tmux session with htop in one pane. 525M in use. Then 2.16G, 2.76G, 3.01G, 3.24G, 3.34G, 3.35G, 3.37G, 3.38G...
    A little mysterious in that the first pass leaves us with a truckload more RAM in use, and subsequent passes use up smaller inconsistent amounts. It's not 839 MiB, but still.
    Edit: A 2nd attempt gets us from 526M to 2.40G from a fresh boot.
    A couple questions that may shed some light:
    • What does valgrind report for the heap summary under FreeBSD? 781,493,400 B.
    • What will htop report after running, say, siteupdate.py after a fresh boot? 1.82G.

I'm not sure what's going on here, or whether properly deleting everything will in fact help. But it can't hurt.
Making siteupdate "Valgrind clean" does seem like good practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PlaceRadius vertex search
2 participants