graph generation speedups #515

yakra · 2022-04-12T22:24:24Z

@jteresco, this doesn't make any of the changes I PMed you about on the forum.
What is here is mostly C++, various graph generation optimizations. The ones that I could also do in Python I did there too, though the results were underwhelming.

6c97b69 closes yakra#195.
Area graph vertex sets are populated in place instead of dealing with a bunch of temporaries, same idea as what #508 did with NMPs.

C++: cuts area graph time by ~28%. Though area graphs already took only 0.49-0.79 s depending on machine.
Python: saves about 1.5%.

33183f2 takes advantage of memory locality (and probably also faster iteration?) and speeds up:

initial vertex creation
getting matching sets for subgraphs

Some harmless redundancy is involved.
Writing subgraphs, saves ~5-7% depending on machine @ 1 thread, tapering off as thread count increases. Hm. Wonder why that is. Cache line bouncing?

f0d618a is a step toward the long-term goal of moving as much code as practical out of the main siteupdate.cpp file, whether by function calls or simple #includes.

12beccd uses a switch when writing subgraph vertices. Much more concise. Will help with readability when TravelMapping/EduTools#156 happens.

~~f699eba~~ 8f28a2c is 5 commits squashed together. Now we're getting into the good stuff!

V1a: store vertices in contiguous memory
Writing subgraphs, saves 5-9% @ 1 thread on the various Linux machines, tapering off as thread count increases.
This sets the stage for the bit field I mentioned on the forum, but doesn't implement it just yet.
V1b: switches when writing master graph vertices & edges
T: allocate clinchedby_code once per graph; modify in place
This avoids having to allocate a string for every edge, instead reusing one char* for each edge in a graph.
- Ubuntu: Saves ~9.5% on subgraphs, looking fairly flat WRT thread count.
- CentOS: Much bigger impact here. At worst, lab1 is down 11.6% @ 1 thread. Savings increase with thread count up to ~10-12 threads, and then the time ratio relative to "V1b" starts inching back up a bit.
  My CentOS machines have older compilers without small string optimization. Across all traveled graphs, 34% of clinchedby_codes are < 16 bytes long. Meaning, this improvement saves us only ~1.5x the mallocs on CentOS as on Ubuntu. The mallocs alone can't explain the performance boost; there has to be something else going on.
V1c: master vertex & edge counts in HighwayGraph object
This is what should have happened in #258. Keeps tracks of these numbers during initial construction instead of needlessly making a 2nd pass thru the structure afterward to grab this data. Saves a bit of time Writing master TM graph files; cleans up the code a bit.
V1d: nix GRAPH & ADDGRAPH macros
I always felt a little yucky about them. Plus, it's all done in a way that uses fewer lines of code.

All together, subgraph time is down 11.4% (lab5 @ 5 threads) to 44% (lab2 @ 8 threads).
Before, the record for subgraph processing was 6.3 s, set by lab5 @ 8 threads. Now, lab2 takes 4.4 s at 8 threads, despite its considerably slower clock. Must be its larger L3 cache.
Implementing the bit field will break the 4 second barrier. Other improvements in the pipeline additionally looked quite promising, but haven't been formally tested yet.

Hey, wait a sec -- where's bsdlab?

...Oh. It's up here. :(
Maybe some upcoming speedups or Clang's optimization flags (currently being tested) may put a dent in that processing time, but at this point I'm not holding my breath...

Finally, performance of several single-threaded tasks has also changed. Top = before; bottom = after:

C++:

task	Bigga Tomato	lab1	lab5	lab2	lab3	lab4	bsdlab
Setting up for graphs of highway data	.656 .817	.437 .48	.513 .587	.542 .589	.692 .762	.774 .904	.96 1.1540
Creating unique names and vertices	4.086 2.458	2.64 1.894	2.464 1.576	3.222 2.257	4.128 2.962	3.692 2.422	5.642 3.9840
Creating edges	1.476 1.455	.928 .843	.995 .967	1.165 1.067	1.422 1.382	1.55 1.482	2.488 2.4460
Compressing collapsed edges	1.107 1.079	.661 .616	.77 .743	.798 .758	.968 .952	1.154 1.122	1.34 1.3240
Writing master TM graph files	5.55 4.804	4.917 3.111	3.461 3.033	6.352 3.777	7.868 4.658	5.196 4.526	6.454 5.2480
Setting up subgraphs	.199 .168	.109 .084	.128 .091	.121 .089	.164 .13	.182 .126	.166 .1100
combined	13.074 10.781	9.692 7.028	8.331 6.997	12.2 8.537	15.242 10.846	12.548 10.582	17.05 14.266

Python:

task	Bigga Tomato	lab1	lab5	lab2	lab3	lab4	bsdlab
Creating unique names and vertices	22.86 22.66	11.68 11.56	10.51 10.42	13.77 13.67	19.24 19.14	15.36 15.34	23.98 23.4600
Creating edges	15.82 16.12	7.74 7.88	7.19 7.22	9.39 9.49	13.08 13.28	10.54 10.66	16.3 16.3600
Compressing collapsed edges	4.02 4.26	2.18 2.26	1.87 1.93	2.69 2.79	3.36 3.52	2.8 2.86	4.4 4.5600
Writing master TM graph files	29.66 28.8	17.64 17.31	14.12 13.76	22.17 21.65	28.26 27.74	21.5 21.0	36.3 35.2400
subgraphs	106.02 103.8	65.7 65.02	53.91 53.55	79.96 79.35	103.24 102.48	79.82 78.76	133.02 133.0800
combined	178.38 175.64	104.94 104.03	87.6 86.88	127.98 126.95	167.18 166.16	130.02 128.62	214 212.7

Populate set in situ. Don't keep re-inserting into a series of temporaries.

V1a: store vertices in contiguous memory V1b: switches in write_master_graphs_tmg() T: allocate cbycode 1x per graph; modify in place V1c: master v&e counts in HighwayGraph object V1d: nix GRAPH & ADDGRAPH macros Interactive rebase 3b79e75b08d0548c30f03d867d170fd6c62eb1de

jteresco · 2022-04-12T22:36:51Z

Nice, thanks. I'll use this version in tonight's update.

yakra · 2022-04-13T00:27:25Z

It makes me really happy to be able to prove myself wrong.
Sure, I did say siteupdate.py there, and most of the improvements are on the C++ side these days.
OTOH, technically Python did get a speed bump, however minuscule.

Coming up, the bit field and Clang's -O3 flag should yield a couple more decent speedups, but after that, I think we're at the point of diminishing returns; I don't see any further optimization potential.

An interesting oddity I forgot to mention...
...that's not terribly important:

siteupdateST is a little bit (IIRC ~10%) slower writing subgraphs than regular (threaded) siteupdate -t 1.
Most tasks are faster in siteupdateST. I assume because we don't have whatever overhead is involved in executing & managing the threads.

All that's different for subgraphs is that:

siteupdate sets up the info for all subgraphs first -- the region(s), system(s), and PlaceRadius to which they're restricted. Then the SubgraphThreads plow through the entire GraphListEntry vector in one go.
siteupdateST, in order to have its output look a little more like siteupdate.py, does the preprocessing for each category and then calls write_subgraphs_tmg for each.
Threaded siteupdate has a few more lines of code to write Writing FOO graphs to the terminal. This should make things slower, even if not perceptibly so.

I thought that just maybe, the write_subgraphs_tmg process being interrupted by the preprocessing stage was causing some cache misses (but it shouldn't be that many, darnit!), and set up siteupdateST to be more like the threaded version, fully populate the GraphListEntry vector, and then process the whole thing with no interruptions with one call to SubgraphThread (as a regular function, not a thread). No difference. Subgraphs were the same speed as before. Slower than siteupdate -t 1.

So I have no explanation for this.
But again, not terribly important. Multithreaded siteupdate is faster, because multithreaded. So I'll happily use that version. :)

jteresco · 2022-04-13T00:29:28Z

I expect I'll finally make the switch to the C++ site update program next month for production, and take advantage of some of those cores on noreaster.

yakra · 2022-04-13T21:38:27Z

I gotta say I'm really disappointed by its performance under FreeBSD. :(
At least, with subgraph generation, which I'd always kinda considered the "killer app" for multithreaded siteupdate. It simply. Does. Not. Scale.

However, it's by far the worst example out of the few tasks I've looked at for FreeBSD.

Processing traveler list files scales pretty nicely all the way to 24 threads, though it does lag behind lab3 & lab4 (same hardware but running CentOS & Ubuntu respectively) a bit.
Searching for near-miss points, the tail of the graph does start to trend upward a bit. Just barely. It's not as bad as it used to be. Below 16 threads, it performs comparably to lab4.
Reading waypoints for all routes scales pretty well, again lagging behind lab3 & 4. Interestingly, there's a bit of a bump up in processing time between 12 & 14 threads, though it stays flat after that. I wonder if that's related to the transition from physical to logical cores...
Creating per-traveler stats logs and augmenting data structure is kind of in the middle, comparable to lab3 & 4 up to 5 threads, when it starts to slowly lose ground. The line flattens out from 8 to 10 threads, and then... a mountain. We've hit a bottleneck.

On the positive side though, noreaster does nominally have 267% the memory bandwidth of bsdlab.
~~So while I'd still expect it to perform poorly compared to Linux, graph generation should at least scale a little bit better.~~
Or not. Ran a quick test of 1-5 threads on noreaster. Subgraphs are fastest @ 1 thread; processing time increases from there. ☹️

An idea I've been tossing around for the past few weeks is to create new charts, maybe of just lab3, lab4 & bsdlab to see how the different OSes compare on the same hardware. I'd update the charts for all threaded tasks (like I did before the gcc->clang transition) and we could see how FreeBSD stacks up.

One thing to be aware of before going to production:
Memory leaks? (Meme!)

#375 (comment)
Since writing that post, HighwayGraph components, Waypoints & colocation lists are properly cleaned up.
Regions, TravelerLists, HighwaySystems & by extension ConnectedRoutes, Routes, HighwaySegments & concurrency lists + whatever arrays are allocated on the heap for each, not so much.

Been taking my sweet time with writing proper destructors and calling delete out of a stubborn insistence on having the least possible impact on the Total run time figure at the end of siteupdate.log. 😁

I wait till the program ends, and System Monitor says I magically get all those gigs of space back, and my RAM use is back where it was before I ran siteupdate. (Don't know if that's a compiler thing, or an OS thing, or how that works...)

valgrind says:

==24429== HEAP SUMMARY:
==24429==     in use at exit: 879,847,400 bytes in 19,037,811 blocks

on lab2, Centos, with 4 threads. That's a lot of RAM. Enough to notice.

Nonetheless, freshly booting up BiggaTomato (Ubuntu), System Monitor says I'm using about 800 MiB. After several runs of siteupdate, still at 800 MiB. OK, that's good!
Somewhat different story with bsdlab. From a fresh boot, I started a tmux session with htop in one pane. 525M in use. Then 2.16G, 2.76G, 3.01G, 3.24G, 3.34G, 3.35G, 3.37G, 3.38G...
A little mysterious in that the first pass leaves us with a truckload more RAM in use, and subsequent passes use up smaller inconsistent amounts. It's not 839 MiB, but still.
Edit: A 2nd attempt gets us from 526M to 2.40G from a fresh boot.
A couple questions that may shed some light:
- What does valgrind report for the heap summary under FreeBSD? 781,493,400 B.
- What will htop report after running, say, siteupdate.py after a fresh boot? 1.82G.

I'm not sure what's going on here, or whether properly deleting everything will in fact help. But it can't hurt.
Making siteupdate "Valgrind clean" does seem like good practice.

yakra added 5 commits April 7, 2022 09:02

faster PlaceRadius vertex search

6c97b69

Populate set in situ. Don't keep re-inserting into a series of temporaries.

V0: sets -> vectors for regions & systems

33183f2

refactor listupdates.txt processing

f0d618a

V0d: switch writing subgraph vertices

12beccd

yakra added graph generation code organization speed RAM labels Apr 12, 2022

yakra changed the title ~~Graphs~~ graph generation speedups Apr 12, 2022

jteresco merged commit 7b34c0c into TravelMapping:master Apr 12, 2022

yakra deleted the graphs branch April 14, 2022 03:52

This was referenced Apr 22, 2022

subgraph vertex bit field / full custom graphs #521

Merged

user log speedup / remove redundant computation #526

Merged

yakra mentioned this pull request Mar 31, 2023

C++ siteupdate: Use of sprintf, deprecated on MacOS/XCode #585

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph generation speedups #515

graph generation speedups #515

yakra commented Apr 12, 2022 •

edited

Loading

jteresco commented Apr 12, 2022

yakra commented Apr 13, 2022

jteresco commented Apr 13, 2022

yakra commented Apr 13, 2022 •

edited

Loading

graph generation speedups #515

graph generation speedups #515

Conversation

yakra commented Apr 12, 2022 • edited Loading

C++:

Python:

jteresco commented Apr 12, 2022

yakra commented Apr 13, 2022

jteresco commented Apr 13, 2022

yakra commented Apr 13, 2022 • edited Loading

yakra commented Apr 12, 2022 •

edited

Loading

yakra commented Apr 13, 2022 •

edited

Loading