-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathlearn-c-the-hard-waych42.txt
459 lines (395 loc) · 19.8 KB
/
learn-c-the-hard-waych42.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
Learn C The Hard Way A Learn Code The Hard Way Book
* Book
* Comments
* Video Courses
* Related Books
[next] [prev] [prev-tail] [tail] [up]
Chapter 42
Exercise 41: Using Cachegrind And Callgrind For Performance Tuning
In this exercise I'm going to give you a quick course in using two
tools for Valgrind called callgrind and cachegrind. These two tools
will analyze your program's execution and tell you what parts are
running slow. The results are accurate because of the way Valgrind
works and help you spot problems such as lines of code that execute too
much, hot spots, memory access problems, and even CPU cache misses.
To do this exercise I'm going to use the bstree_tests unit tests you
just did to look for places to improve the algorithms used. Make sure
your versions of these programs are running without any valgrind errors
and that it is exactly like my code. I'll be using dumps of my code to
talk about how cachegrind and callgrind work.
42.1 Running Callgrind
To run callgrind you pass the --tool=callgrind option to valgrind and
it will produce a callgrind.out.PID file (where PID is replace with the
process ID of the program that ran). Once you run it you can analyze
this callgrind.out file with a tool called callgrind_annotate which
will tell you which functions used the most instructions to run. Here's
an example of me running callgrind on bstree_tests and then getting its
information:
__________________________________________________________________
Source 133: Callgrind On bstree_tests
1 $ valgrind --dsymutil=yes --tool=callgrind tests/bstree_tests
2 ...
3 $ callgrind_annotate callgrind.out.1232
4 --------------------------------------------------------------------
------------
5 Profile data file 'callgrind.out.1232' (creator: callgrind-3.7.0.SVN
)
6 --------------------------------------------------------------------
------------
7 I1 cache:
8 D1 cache:
9 LL cache:
10 Timerange: Basic block 0 - 1098689
11 Trigger: Program termination
12 Profiled target: tests/bstree_tests (PID 1232, part 1)
13 Events recorded: Ir
14 Events shown: Ir
15 Event sort order: Ir
16 Thresholds: 99
17 Include dirs:
18 User annotated:
19 Auto-annotation: off
20
21 -------------------------------------------------------------------
-------------
22 Ir
23 -------------------------------------------------------------------
-------------
24 4,605,808 PROGRAM TOTALS
25
26 -------------------------------------------------------------------
-------------
27 Ir file:function
28 -------------------------------------------------------------------
-------------
29 670,486 src/lcthw/bstrlib.c:bstrcmp [tests/bstree_tests]
30 194,377 src/lcthw/bstree.c:BSTree_get [tests/bstree_tests]
31 65,580 src/lcthw/bstree.c:default_compare [tests/bstree_tests]
32 16,338 src/lcthw/bstree.c:BSTree_delete [tests/bstree_tests]
33 13,000 src/lcthw/bstrlib.c:bformat [tests/bstree_tests]
34 11,000 src/lcthw/bstrlib.c:bfromcstralloc [tests/bstree_tests]
35 7,774 src/lcthw/bstree.c:BSTree_set [tests/bstree_tests]
36 5,800 src/lcthw/bstrlib.c:bdestroy [tests/bstree_tests]
37 2,323 src/lcthw/bstree.c:BSTreeNode_create [tests/bstree_tests
]
38 1,183 /private/tmp/pkg-build/coregrind//vg_preloaded.c:vg_clea
nup_env [/usr/local/lib/valgrind/vgpreload_core-amd64-darwin.so]
39
40 $
__________________________________________________________________
I've removed the unit test run and the valgrind output since it's not
very useful for this exercise. What you should look at is the
callgrind_anotate output. What this shows you is the number of
instructions run (which valgrind calls Ir) for each function, and the
functions sorted highest to lowest. You can usually ignore the header
data and just jump to the list of functions.
__________________________________________________________________
Note 13: More OSX Annoyances
In if you get a ton of "???:Image" lines and things that are not in
your program then you're picking up junk from the OS. Just add
| grep -v "???" at the end to filter those out, like this.
__________________________________________________________________
I can now do a quick breakdown of this output to figure out where to
look next:
1. Each line lists the number of Ir and the file:function that
executed them. The Ir is just the instruction count, and if you
make that lower then you have made it faster. There's some
complexity to that, but at first just focus on getting the Ir down.
2. The way to attack this is to look at your top functions, and then
see which one you think you can improve first. In this case, I'd
look at improving bstrcmp or BStree_get. It's probably easier to
start with BStree_get.
3. Some of these functions are just called from the unit test, so I
would just ignore those. Functions like bformat, bfromcstralloc,
and bdestroy fit this description.
4. I would also look for functions I can simply avoid calling. For
example, maybe I can just say BSTree only works with bstring keys,
and then I can just not use the callback system and remove
default_compare entirely.
At this point though, I only know that I want to look at BSTree_get to
improve it, and not the reason BSTree_get is slow. That is phase two of
the analysis.
42.2 Callgrind Annotating Source
I will next tell callgrind_annotate to dump out the bstree.c file and
annotate each line with the number of Ir it took. You get the annotated
source file by running:
__________________________________________________________________
Source 134: Callgrind Annotated BSTree_get
1 $ callgrind_annotate callgrind.out.1232 src/lcthw/bstree.c
2 ...
__________________________________________________________________
Your output will have a big dump of the file's source, but I've cut out
the parts for BSTree_get and BSTree_getnode:
__________________________________________________________________
Source 135: Callgrind Annotated BSTree_get
---------------------------------------------------------------------
-----------
-- User-annotated source: src/lcthw/bstree.c
---------------------------------------------------------------------
-----------
Ir
2,453 static inline BSTreeNode *BSTree_getnode(BSTree *map, BSTreeN
ode *node, void *key)
. {
61,853 int cmp = map->compare(node->key, key);
663,908 => src/lcthw/bstree.c:default_compare (14850x)
.
14,850 if(cmp == 0) {
. return node;
24,794 } else if(cmp < 0) {
30,623 if(node->left) {
. return BSTree_getnode(map, node->left, key);
. } else {
. return NULL;
. }
. } else {
13,146 if(node->right) {
. return BSTree_getnode(map, node->right, key);
. } else {
. return NULL;
. }
. }
. }
.
. void *BSTree_get(BSTree *map, void *key)
4,912 {
24,557 if(map->root == NULL) {
14,736 return NULL;
. } else {
. BSTreeNode *node = BSTree_getnode(map, map->root, key
);
2,453 return node == NULL ? NULL : node->data;
. }
. }
__________________________________________________________________
Each line is shown with either the number of Ir (instructions) it ran,
or a period (.) to show that it's not important. What I'm looking for
is hotspots, or lines that have huge numbers of Ir that I can possibly
bring down. In this case, line 10 of the output above shows that what
makes BSTree_getnode so expensive is that it calls default_comapre
which calls bstrcmp. I already know that bstrcmp is the worst running
function, so if I want to improve the speed of BSTree_getnode I should
work on that first.
I'll then look at bstrcmp the same way:
__________________________________________________________________
Source 136: Callgrind Annotated bstcmp
98,370 int bstrcmp (const_bstring b0, const_bstring b1) {
. int i, v, n;
.
196,740
if (b0 == NULL || b1 == NULL || b0->data == NULL || b1->data == NULL ||
32,790 b0->slen < 0 || b1->slen < 0) return SHRT_MIN;
65,580 n = b0->slen; if (n > b1->slen) n = b1->slen;
89,449
if (b0->slen == b1->slen && (b0->data == b1->data || b0->slen == 0))
. return BSTR_OK;
.
23,915 for (i = 0; i < n; i ++) {
163,642 v = ((char) b0->data[i]) - ((char) b1->data[i]);
. if (v != 0) return v;
. if (b0->data[i] == (unsigned char) '\0') return BSTR_OK;
. }
.
. if (b0->slen > n) return 1;
. if (b1->slen > n) return -1;
. return BSTR_OK;
. }
__________________________________________________________________
The Ir for this function shows two lines that take up most of the
execution. First, bstrcmp seems to go through a lot of trouble to make
sure that it is not given a NULL value. That's a good thing so I want
to leave that alone, but I'd consider writing a different compare
function that was more "risky" and assumed it was never given a NULL.
The next one is the loop that does the actual comparison. It seems that
there's some optimization that could be done in comparing the
characters of the two data buffers.
42.3 Analyzing Memory Access With Cachegrind
What I want to do next is see how many times this bstrcmp function
access memory to either read it or write it. The tool for doing that
(and other things) is cachegrind and you use it like this:
__________________________________________________________________
Source 137: Cachegrind On bstree_tests
1 $ valgrind --tool=cachegrind tests/bstree_tests
2 ...
3 $ cg_annotate --show=Dr,Dw cachegrind.out.1316 | grep -v "???"
4 --------------------------------------------------------------------
------------
5 I1 cache: 32768 B, 64 B, 8-way associative
6 D1 cache: 32768 B, 64 B, 8-way associative
7 LL cache: 4194304 B, 64 B, 16-way associative
8 Command: tests/bstree_tests
9 Data file: cachegrind.out.1316
10 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
11 Events shown: Dr Dw
12 Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
13 Thresholds: 0.1 100 100 100 100 100 100 100 100
14 Include dirs:
15 User annotated:
16 Auto-annotation: off
17
18 -------------------------------------------------------------------
-------------
19 Dr Dw
20 -------------------------------------------------------------------
-------------
21 997,124 349,058 PROGRAM TOTALS
22
23 -------------------------------------------------------------------
-------------
24 Dr Dw file:function
25 -------------------------------------------------------------------
-------------
26 169,754 19,430 src/lcthw/bstrlib.c:bstrcmp
27 67,548 27,428 src/lcthw/bstree.c:BSTree_get
28 19,430 19,430 src/lcthw/bstree.c:default_compare
29 5,420 2,383 src/lcthw/bstree.c:BSTree_delete
30 2,000 4,200 src/lcthw/bstrlib.c:bformat
31 1,600 2,800 src/lcthw/bstrlib.c:bfromcstralloc
32 2,770 1,410 src/lcthw/bstree.c:BSTree_set
33 1,200 1,200 src/lcthw/bstrlib.c:bdestroy
34
35 $
__________________________________________________________________
I tell valgrind to use the cachegrind tool, which runs bstree_tests and
then produces a cachegrind.out.PID file just like callgrind did. I then
use the program cg_annotate to get a similar output, but notice that
I'm telling it to do --show=Dr,Dw. This option says that I only want
the memory read Dr and write Dw counts for each function.
After that you get your usual header and then the counts for Dr and Dw
for each file:function combination. I've edited this down so it shows
the files and also removed any OS junk with | grep -v "???" so your
output may be a little different. What you see in my output is that
bstrcmp is the worst function for memory usage too, which is to be
expected since that's mostly the only thing it does. I'm going to now
dump it's annotated source to see where.
__________________________________________________________________
Source 138: Cachegrind Annotated bstrcmp
---------------------------------------------------------------------
-----------
-- User-annotated source: src/lcthw/bstrlib.c
---------------------------------------------------------------------
-----------
Dr Dw
0 19,430 int bstrcmp (const_bstring b0, const_bstring b1) {
. . int i, v, n;
. .
77,720 0
if (b0 == NULL || b1 == NULL || b0->data == NULL || b1->data == NULL ||
38,860 0 b0->slen < 0 || b1->slen < 0) return SHRT_MIN;
0 0 n = b0->slen; if (n > b1->slen) n = b1->slen;
0 0
if (b0->slen == b1->slen && (b0->data == b1->data || b0->slen == 0))
. . return BSTR_OK;
. .
0 0 for (i = 0; i < n; i ++) {
53,174 0 v = ((char) b0->data[i]) - ((char) b1->data[i]);
. . if (v != 0) return v;
. .
if (b0->data[i] == (unsigned char) '\0') return BSTR_OK;
. . }
. .
. . if (b0->slen > n) return 1;
. . if (b1->slen > n) return -1;
. . return BSTR_OK;
. . }
__________________________________________________________________
The surprising thing about this output is that the worst line of
bstrcmp isn't the character comparison like I thought. For memory
access it's that protective if-statement at the top that checks every
possible bad variable it could receive. That one if statement does more
than twice as many memory accesses compared to the line that's
comparing the characters on line 17 of this output. If I were to make
bstrcmp then I would definitely just ditch that or do it once somewhere
else.
Another option is to turn this check into an assert that only exists
when running in development, and then compile it out in production. I
now have enough evidence to say that this line is bad for this data
structure, so I can justify removing it.
What I don't want to do however is justify making this function less
defensive to just gain a few more cycles. In a real performance
improvement situation I would simply put this on a list and then dig
for other gains I can make in the program.
42.4 Judo Tuning
"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil."
(Donald Knuth)
In my opinion, this quote seems to miss a major point about performance
tuning. In this Knuth is saying that when you performance tune matters,
in that if you do it in the beginning, then you'll cause all sorts of
problems. According to him optimization should happen "sometime later",
or at least that's my guess. Who knows these days really.
I'm going to declare this quote not necessarily wrong, but missing the
point, and instead I'm going to officially give my quote. You can quote
me on this:
"Use evidence to find the biggest optimizations that take the least
effort."
(Zed A. Shaw)
It doesn't matter when you try to optimize something, but instead it's
how you figure out if your optimization actually improved the software,
and how much effort you put into doing them. With evidence you can find
the places in the code where just a little effort gets you big
improvements. Usually these places are just dumb decisions, as in
bstrcmp trying to check everything possible for a NULL value.
At a certain point you have tuned the code to where the only thing that
remains is tiny little micro-optimizations such as reorganizing
if-statements and special loops like Duff's Device. At this point, just
stop because there's a good chance that you'd gain more by redesigning
the software to just not do things.
This is something that programmers who are optimizing simply fail to
see. Many times the best way to do something fast is to find out ways
to not do them. In the above analysis, I wouldn't try to make bstrcmp
faster, I'd try to find a way to not use bstrcmp so much. Maybe there's
a hashing scheme I can use that let's me do a sortable hash instead of
constantly doing bstrcmp. Maybe I can optimize it by trying the first
char first, and if it's comparable just don't call bstrcmp.
If after all that you can't do a redesign then start looking for little
micro-optimizations, but as you do them constantly confirm they improve
speed. Remember that the goal is to cause the biggest impact with the
least effort possible.
42.5 Using KCachegrind
The final section of this exercise is going to point you at a tool
called KCachegrind. This is a fantastic GUI for analyzing callgrind and
cachegrind output. I use it almost exclusively when I'm working on a
Linux or BSD computer, and I've actually switched to just coding on
Linux for projects because of KCachegrind.
Teaching you how to use it is outside the scope of this exercise, but
you should be able to understand how to use it after this exercise. The
output is nearly the same except KCachegrind lets you do the following:
1. Graphically browse the source and execution times doing various
sorts to find things to improve.
2. Analyze different graphs to visually see what's taking up the most
time and also what it is calling.
3. Look at the actual machine code assembler output so you can see
possible instructions that are happening, giving you more clues.
4. Visualize the jump patterns for loops and branches in the source
code, helping you find wayso to optimize the code easier.
You should spend some time getting KCachegrind installed and play with
it.
42.6 Extra Credit
1. Read the callgrind manual and try some advanced options.
2. Read the cachegrind manual and also try some advanced options.
3. Use callgrind and cachegrind on all the unit tests and see if you
can find optimizations to make. Did you find some things that
surprised you? If not you probably aren't looking hard enough.
4. Use KCachegrind and see how it compares to doing the terminal
output like I'm doing here.
5. Now use these tools to do the Exercise 40 extra credits and
improvements.
[next] [prev] [prev-tail] [front] [up]
__________________________________________________________________
Please enable JavaScript to view the comments powered by Disqus.
Take An Online Video Course
You can sign up for a video course at:
http://www.udemy.com/learn-c-the-hard-way/
This course is currently being built at the same time that the book is
being built, but if you sign up now then you get early access to both
the videos and PDF of the book.
Related Books
You might want to check out these other books in the series:
1. Learn Ruby The Hard Way
2. Learn Regex The Hard Way
3. Learn SQL The Hard Way
4. Learn C The Hard Way
5. Learn Python The Hard Way
I'll be referencing other books shortly.
Copyright 2011 Zed A. Shaw. All Rights Reserved.