-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graph generation speedups #515
Conversation
Populate set in situ. Don't keep re-inserting into a series of temporaries.
V1a: store vertices in contiguous memory V1b: switches in write_master_graphs_tmg() T: allocate cbycode 1x per graph; modify in place V1c: master v&e counts in HighwayGraph object V1d: nix GRAPH & ADDGRAPH macros Interactive rebase 3b79e75b08d0548c30f03d867d170fd6c62eb1de
Nice, thanks. I'll use this version in tonight's update. |
It makes me really happy to be able to prove myself wrong. Coming up, the bit field and Clang's An interesting oddity I forgot to mention...
All that's different for subgraphs is that:
I thought that just maybe, the So I have no explanation for this. |
I expect I'll finally make the switch to the C++ site update program next month for production, and take advantage of some of those cores on noreaster. |
I gotta say I'm really disappointed by its performance under FreeBSD. :( However, it's by far the worst example out of the few tasks I've looked at for FreeBSD.
On the positive side though, noreaster does nominally have 267% the memory bandwidth of bsdlab. An idea I've been tossing around for the past few weeks is to create new charts, maybe of just lab3, lab4 & bsdlab to see how the different OSes compare on the same hardware. I'd update the charts for all threaded tasks (like I did before the gcc->clang transition) and we could see how FreeBSD stacks up. One thing to be aware of before going to production: #375 (comment) Been taking my sweet time with writing proper destructors and calling
I'm not sure what's going on here, or whether properly deleting everything will in fact help. But it can't hurt. |
@jteresco, this doesn't make any of the changes I PMed you about on the forum.
What is here is mostly C++, various graph generation optimizations. The ones that I could also do in Python I did there too, though the results were underwhelming.
6c97b69 closes yakra#195.
Area graph vertex sets are populated in place instead of dealing with a bunch of temporaries, same idea as what #508 did with NMPs.
33183f2 takes advantage of memory locality (and probably also faster iteration?) and speeds up:
Some harmless redundancy is involved.
Writing subgraphs, saves ~5-7% depending on machine @ 1 thread, tapering off as thread count increases. Hm. Wonder why that is. Cache line bouncing?
f0d618a is a step toward the long-term goal of moving as much code as practical out of the main siteupdate.cpp file, whether by function calls or simple
#include
s.12beccd uses a
switch
when writing subgraph vertices. Much more concise. Will help with readability when TravelMapping/EduTools#156 happens.f699eba8f28a2c is 5 commits squashed together. Now we're getting into the good stuff!Writing subgraphs, saves 5-9% @ 1 thread on the various Linux machines, tapering off as thread count increases.
This sets the stage for the bit field I mentioned on the forum, but doesn't implement it just yet.
switch
es when writing master graph vertices & edgesclinchedby_code
once per graph; modify in placeThis avoids having to allocate a string for every edge, instead reusing one
char*
for each edge in a graph.My CentOS machines have older compilers without small string optimization. Across all traveled graphs, 34% of
clinchedby_code
s are < 16 bytes long. Meaning, this improvement saves us only ~1.5x the mallocs on CentOS as on Ubuntu. The mallocs alone can't explain the performance boost; there has to be something else going on.This is what should have happened in #258. Keeps tracks of these numbers during initial construction instead of needlessly making a 2nd pass thru the structure afterward to grab this data. Saves a bit of time Writing master TM graph files; cleans up the code a bit.
I always felt a little yucky about them. Plus, it's all done in a way that uses fewer lines of code.
All together, subgraph time is down 11.4% (lab5 @ 5 threads) to 44% (lab2 @ 8 threads).
Before, the record for subgraph processing was 6.3 s, set by lab5 @ 8 threads. Now, lab2 takes 4.4 s at 8 threads, despite its considerably slower clock. Must be its larger L3 cache.
Implementing the bit field will break the 4 second barrier. Other improvements in the pipeline additionally looked quite promising, but haven't been formally tested yet.
Hey, wait a sec -- where's bsdlab?
...Oh. It's up here. :(
Maybe some upcoming speedups or Clang's optimization flags (currently being tested) may put a dent in that processing time, but at this point I'm not holding my breath...
Finally, performance of several single-threaded tasks has also changed. Top = before; bottom = after:
C++:
Tomato
.817
.48
.587
.589
.762
.904
1.1540
2.458
1.894
1.576
2.257
2.962
2.422
3.9840
1.455
.843
.967
1.067
1.382
1.482
2.4460
1.079
.616
.743
.758
.952
1.122
1.3240
4.804
3.111
3.033
3.777
4.658
4.526
5.2480
.168
.084
.091
.089
.13
.126
.1100
10.781
7.028
6.997
8.537
10.846
10.582
14.266
Python:
Tomato
22.66
11.56
10.42
13.67
19.14
15.34
23.4600
16.12
7.88
7.22
9.49
13.28
10.66
16.3600
4.26
2.26
1.93
2.79
3.52
2.86
4.5600
28.8
17.31
13.76
21.65
27.74
21.0
35.2400
103.8
65.02
53.55
79.35
102.48
78.76
133.0800
175.64
104.03
86.88
126.95
166.16
128.62
212.7