-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathlearn-c-the-hard-waych38.txt
602 lines (557 loc) · 20.5 KB
/
learn-c-the-hard-waych38.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
Learn C The Hard Way A Learn Code The Hard Way Book
* Book
* Comments
* Video Courses
* Related Books
[next] [prev] [prev-tail] [tail] [up]
Chapter 38
Exercise 37: Hashmaps
Hash Maps (Hashmaps, Hashes, or sometimes Dictionaries) are used
frequently in many dynamic programming for storing key/value data. A
Hashmap works by performing a "hashing" calculation on the keys to
produce an integer, then uses that integer to find a bucket to get or
set the value. It is a very fast practical data structure since it
works on nearly any data and they are easy to implement.
Here's an example of using a Hashmap (aka dict) in Python:
__________________________________________________________________
Source 115: ex37.py
1 fruit_weights = {'Apples': 10, 'Oranges': 100, 'Grapes': 1.0}
2
3 for key, value in fruit_weights.items():
4 print key, "=", value
__________________________________________________________________
Almost every modern language has something like this, so many people
end up writing code and never understand how this actually works. By
creating the Hashmap data structure in C I'll show you how this works.
I'll start with the header file so I can talk about the data structure.
__________________________________________________________________
Source 116: src/lcthw/hashmap.h
1 #ifndef _lcthw_Hashmap_h
2 #define _lcthw_Hashmap_h
3
4 #include <stdint.h>
5 #include <lcthw/darray.h>
6
7 #define DEFAULT_NUMBER_OF_BUCKETS 100
8
9 typedef int (*Hashmap_compare)(void *a, void *b);
10 typedef uint32_t (*Hashmap_hash)(void *key);
11
12 typedef struct Hashmap {
13 DArray *buckets;
14 Hashmap_compare compare;
15 Hashmap_hash hash;
16 } Hashmap;
17
18 typedef struct HashmapNode {
19 void *key;
20 void *data;
21 uint32_t hash;
22 } HashmapNode;
23
24 typedef int (*Hashmap_traverse_cb)(HashmapNode *node);
25
26 Hashmap *Hashmap_create(Hashmap_compare compare, Hashmap_hash);
27 void Hashmap_destroy(Hashmap *map);
28
29 int Hashmap_set(Hashmap *map, void *key, void *data);
30 void *Hashmap_get(Hashmap *map, void *key);
31
32 int Hashmap_traverse(Hashmap *map, Hashmap_traverse_cb traverse_cb)
;
33
34 void *Hashmap_delete(Hashmap *map, void *key);
35
36 #endif
__________________________________________________________________
The structure consists of a Hashmap that contains any number of
HashmapNode structs. Looking at Hashmap you can see that it is
structured like this:
DArray *buckets
A dynamic array that will be set to a fixed size of 100 buckets.
Each bucket will in turn contain a DArray that will actually
hold HashmapNode pairs.
Hashmap_compare compare
This is a comparison function that the Hashmap uses to actually
find elements by their key. It should work like all of the other
compare functions, and defaults to using bstrcmp so that keys
are just bstrings.
Hashmap_hash hash
This is the hashing function and it's responsible for taking a
key, processing its contents, and producing a single uint32_t
index number. You'll see the default one soon.
This almost tells you how the data is stored, but the bucketsDArray
isn't created yet. Just remember that it's kind of a two level mapping:
1. There are 100 buckets that make up the first level, and things are
in these buckets based on their hash.
2. Each bucket is a DArray that then contains HashmapNode structs
simply appended to the end as they're added.
The HashmapNode is then composed of these three elements:
void *key
The key for this key=value pair.
void *value
The value.
uint32_t hash
The calculated hash, which makes finding this node quicker since
we can just check the hash and skip any that don't match, only
checking they key if it's equal.
The rest of the header file is nothing new, so now I can show you the
implementation hashmap.c file:
__________________________________________________________________
Source 117: src/lcthw/hashmap.c
1 #undef NDEBUG
2 #include <stdint.h>
3 #include <lcthw/hashmap.h>
4 #include <lcthw/dbg.h>
5 #include <lcthw/bstrlib.h>
6
7 static int default_compare(void *a, void *b)
8 {
9 return bstrcmp((bstring)a, (bstring)b);
10 }
11
12 /**
13 * Simple Bob Jenkins's hash algorithm taken from the
14 * wikipedia description.
15 */
16 static uint32_t default_hash(void *a)
17 {
18 size_t len = blength((bstring)a);
19 char *key = bdata((bstring)a);
20 uint32_t hash = 0;
21 uint32_t i = 0;
22
23 for(hash = i = 0; i < len; ++i)
24 {
25 hash += key[i];
26 hash += (hash << 10);
27 hash ^= (hash >> 6);
28 }
29
30 hash += (hash << 3);
31 hash ^= (hash >> 11);
32 hash += (hash << 15);
33
34 return hash;
35 }
36
37
38 Hashmap *Hashmap_create(Hashmap_compare compare, Hashmap_hash hash)
39 {
40 Hashmap *map = calloc(1, sizeof(Hashmap));
41 check_mem(map);
42
43 map->compare = compare == NULL ? default_compare : compare;
44 map->hash = hash == NULL ? default_hash : hash;
45 map->buckets = DArray_create(sizeof(DArray *), DEFAULT_NUMBER_O
F_BUCKETS);
46 map->buckets->end = map->buckets->max; // fake out expanding it
47 check_mem(map->buckets);
48
49 return map;
50
51 error:
52 if(map) {
53 Hashmap_destroy(map);
54 }
55
56 return NULL;
57 }
58
59
60 void Hashmap_destroy(Hashmap *map)
61 {
62 int i = 0;
63 int j = 0;
64
65 if(map) {
66 if(map->buckets) {
67 for(i = 0; i < DArray_count(map->buckets); i++) {
68 DArray *bucket = DArray_get(map->buckets, i);
69 if(bucket) {
70 for(j = 0; j < DArray_count(bucket); j++) {
71 free(DArray_get(bucket, j));
72 }
73 DArray_destroy(bucket);
74 }
75 }
76 DArray_destroy(map->buckets);
77 }
78
79 free(map);
80 }
81 }
82
83 static inline HashmapNode *Hashmap_node_create(int hash, void *key,
void *data)
84 {
85 HashmapNode *node = calloc(1, sizeof(HashmapNode));
86 check_mem(node);
87
88 node->key = key;
89 node->data = data;
90 node->hash = hash;
91
92 return node;
93
94 error:
95 return NULL;
96 }
97
98
99 static inline DArray *Hashmap_find_bucket(Hashmap *map, void *key,
100 int create, uint32_t *hash_out)
101 {
102 uint32_t hash = map->hash(key);
103 int bucket_n = hash % DEFAULT_NUMBER_OF_BUCKETS;
104 check(bucket_n >= 0, "Invalid bucket found: %d", bucket_n);
105 *hash_out = hash; // store it for the return so the caller can
use it
106
107
108 DArray *bucket = DArray_get(map->buckets, bucket_n);
109
110 if(!bucket && create) {
111 // new bucket, set it up
112 bucket = DArray_create(sizeof(void *), DEFAULT_NUMBER_OF_B
UCKETS);
113 check_mem(bucket);
114 DArray_set(map->buckets, bucket_n, bucket);
115 }
116
117 return bucket;
118
119 error:
120 return NULL;
121 }
122
123
124 int Hashmap_set(Hashmap *map, void *key, void *data)
125 {
126 uint32_t hash = 0;
127 DArray *bucket = Hashmap_find_bucket(map, key, 1, &hash);
128 check(bucket, "Error can't create bucket.");
129
130 HashmapNode *node = Hashmap_node_create(hash, key, data);
131 check_mem(node);
132
133 DArray_push(bucket, node);
134
135 return 0;
136
137 error:
138 return -1;
139 }
140
141 static inline int Hashmap_get_node(Hashmap *map, uint32_t hash, DA
rray *bucket, void *key)
142 {
143 int i = 0;
144
145 for(i = 0; i < DArray_end(bucket); i++) {
146 debug("TRY: %d", i);
147 HashmapNode *node = DArray_get(bucket, i);
148 if(node->hash == hash && map->compare(node->key, key) == 0
) {
149 return i;
150 }
151 }
152
153 return -1;
154 }
155
156 void *Hashmap_get(Hashmap *map, void *key)
157 {
158 uint32_t hash = 0;
159 DArray *bucket = Hashmap_find_bucket(map, key, 0, &hash);
160 if(!bucket) return NULL;
161
162 int i = Hashmap_get_node(map, hash, bucket, key);
163 if(i == -1) return NULL;
164
165 HashmapNode *node = DArray_get(bucket, i);
166 check(node != NULL, "Failed to get node from bucket when it sh
ould exist.");
167
168 return node->data;
169
170 error: // fallthrough
171 return NULL;
172 }
173
174
175 int Hashmap_traverse(Hashmap *map, Hashmap_traverse_cb traverse_cb
)
176 {
177 int i = 0;
178 int j = 0;
179 int rc = 0;
180
181 for(i = 0; i < DArray_count(map->buckets); i++) {
182 DArray *bucket = DArray_get(map->buckets, i);
183 if(bucket) {
184 for(j = 0; j < DArray_count(bucket); j++) {
185 HashmapNode *node = DArray_get(bucket, j);
186 rc = traverse_cb(node);
187 if(rc != 0) return rc;
188 }
189 }
190 }
191
192 return 0;
193 }
194
195 void *Hashmap_delete(Hashmap *map, void *key)
196 {
197 uint32_t hash = 0;
198 DArray *bucket = Hashmap_find_bucket(map, key, 0, &hash);
199 if(!bucket) return NULL;
200
201 int i = Hashmap_get_node(map, hash, bucket, key);
202 if(i == -1) return NULL;
203
204 HashmapNode *node = DArray_get(bucket, i);
205 void *data = node->data;
206 free(node);
207
208 HashmapNode *ending = DArray_pop(bucket);
209
210 if(ending != node) {
211 // alright looks like it's not the last one, swap it
212 DArray_set(bucket, i, ending);
213 }
214
215 return data;
216 }
__________________________________________________________________
There's nothing very complicated in the implementation, but the
default_hash and Hashmap_find_bucket functions will need some
explanation. When you use Hashmap_create you can pass in any compare
and hash functions you want, but if you don't it uses the
default_compare and default_hash functions.
The first thing to look at is how default_hash does its thing. This is
a simple hash function called a "Jenkins hash" after Bob Jenkins. I got
if from the Wikipedia page for the algorithm. It simply goes through
each byte of the key to hash (a bstring) and works the bits so that the
end result is a single uint32_t. It does this with some adding and xor
operations.
There are many different hash functions, all with different properties,
but once you have one you need a way to use it to find the right
buckets. The Hashmap_find_bucket does it like this:
1. First it calls map->hash(key) to get the hash for the key.
2. It then finds the bucket using hash % DEFAULT_NUMBER_OF_BUCKETS,
that way every hash will always find some bucket no matter how big
it is.
3. It then gets the bucket, which is also a DArray, and if it's not
there it will create it. That depends on if the create variable
says too.
4. Once it has found the DArray bucket for the right hash, it returns
it, and also the hash_out variable is used to give the caller the
hash that was found.
All of the other functions then use Hashmap_find_bucket to do their
work:
1. Setting a key/value involves finding the bucket, then making a
HashmapNode, and then adding it to the bucket.
2. Getting a key involves finding the bucket, then finding the
HashmapNode that matches the hash and key you want.
3. Deleting an item again finds the bucket, finds where the requested
node is, and then removes it by swapping the last node into its
place.
The only other function that you should study is the Hashmap_travers.
This simply walks every bucket, and for any bucket that has possible
values, it calls the traverse_cb on each value. This is how you scan a
whole Hashmap for its values.
38.0.1 The Unit Test
Finally you have the unit test that is testing all of these operations:
__________________________________________________________________
Source 118: tests/hashmap_tests.c
1 #include "minunit.h"
2 #include <lcthw/hashmap.h>
3 #include <assert.h>
4 #include <lcthw/bstrlib.h>
5
6 Hashmap *map = NULL;
7 static int traverse_called = 0;
8 struct tagbstring test1 = bsStatic("test data 1");
9 struct tagbstring test2 = bsStatic("test data 2");
10 struct tagbstring test3 = bsStatic("xest data 3");
11 struct tagbstring expect1 = bsStatic("THE VALUE 1");
12 struct tagbstring expect2 = bsStatic("THE VALUE 2");
13 struct tagbstring expect3 = bsStatic("THE VALUE 3");
14
15 static int traverse_good_cb(HashmapNode *node)
16 {
17 debug("KEY: %s", bdata((bstring)node->key));
18 traverse_called++;
19 return 0;
20 }
21
22
23 static int traverse_fail_cb(HashmapNode *node)
24 {
25 debug("KEY: %s", bdata((bstring)node->key));
26 traverse_called++;
27
28 if(traverse_called == 2) {
29 return 1;
30 } else {
31 return 0;
32 }
33 }
34
35
36 char *test_create()
37 {
38 map = Hashmap_create(NULL, NULL);
39 mu_assert(map != NULL, "Failed to create map.");
40
41 return NULL;
42 }
43
44 char *test_destroy()
45 {
46 Hashmap_destroy(map);
47
48 return NULL;
49 }
50
51
52 char *test_get_set()
53 {
54 int rc = Hashmap_set(map, &test1, &expect1);
55 mu_assert(rc == 0, "Failed to set &test1");
56 bstring result = Hashmap_get(map, &test1);
57 mu_assert(result == &expect1, "Wrong value for test1.");
58
59 rc = Hashmap_set(map, &test2, &expect2);
60 mu_assert(rc == 0, "Failed to set test2");
61 result = Hashmap_get(map, &test2);
62 mu_assert(result == &expect2, "Wrong value for test2.");
63
64 rc = Hashmap_set(map, &test3, &expect3);
65 mu_assert(rc == 0, "Failed to set test3");
66 result = Hashmap_get(map, &test3);
67 mu_assert(result == &expect3, "Wrong value for test3.");
68
69 return NULL;
70 }
71
72 char *test_traverse()
73 {
74 int rc = Hashmap_traverse(map, traverse_good_cb);
75 mu_assert(rc == 0, "Failed to traverse.");
76 mu_assert(traverse_called == 3, "Wrong count traverse.");
77
78 traverse_called = 0;
79 rc = Hashmap_traverse(map, traverse_fail_cb);
80 mu_assert(rc == 1, "Failed to traverse.");
81 mu_assert(traverse_called == 2, "Wrong count traverse for fail.
");
82
83 return NULL;
84 }
85
86 char *test_delete()
87 {
88 bstring deleted = (bstring)Hashmap_delete(map, &test1);
89 mu_assert(deleted != NULL, "Got NULL on delete.");
90 mu_assert(deleted == &expect1, "Should get test1");
91 bstring result = Hashmap_get(map, &test1);
92 mu_assert(result == NULL, "Should delete.");
93
94 deleted = (bstring)Hashmap_delete(map, &test2);
95 mu_assert(deleted != NULL, "Got NULL on delete.");
96 mu_assert(deleted == &expect2, "Should get test2");
97 result = Hashmap_get(map, &test2);
98 mu_assert(result == NULL, "Should delete.");
99
100 deleted = (bstring)Hashmap_delete(map, &test3);
101 mu_assert(deleted != NULL, "Got NULL on delete.");
102 mu_assert(deleted == &expect3, "Should get test3");
103 result = Hashmap_get(map, &test3);
104 mu_assert(result == NULL, "Should delete.");
105
106 return NULL;
107 }
108
109 char *all_tests()
110 {
111 mu_suite_start();
112
113 mu_run_test(test_create);
114 mu_run_test(test_get_set);
115 mu_run_test(test_traverse);
116 mu_run_test(test_delete);
117 mu_run_test(test_destroy);
118
119 return NULL;
120 }
121
122 RUN_TESTS(all_tests);
__________________________________________________________________
The only thing to learn about this unit test is that at the top I use a
feature of bstring to create static strings to work with in the tests.
I use the tagbstring and bsStatic to create them on lines 7-13.
38.1 How To Improve It
This is a very simple implementation of Hashmap as are most of the
other data structures in this book. My goal isn't to give you insanely
great hyper speed well tuned data structures. Usually those are much
too complicated to discuss and only distract you from the real basic
data structure at work. My goal is to give you an understandable
starting point to then improve it or understand how they are
implemented.
In this case, there's some things you can do with this implementation:
1. You can use a sort on each bucket so that they are always sorted.
This increases your insert time, but decreases your find time
because you can then use a binary search to find each node. Right
now it's looping through all of the nodes in a bucket just to find
one.
2. You can dynamically size the number of buckets, or let the caller
specify the number for each Hashmap created.
3. You can use a better default_hash. There are tons of them.
4. This (and nearly every Hashmap is vulnerable to someone picking
keys that will fill only one bucket, and then tricking your program
into processing them. This then makes your program run slower
because it changes from processing a Hashmap to effectively
processing a single DArray. If you sort the nodes in the bucket
this helps, but you can also use better hashing functions, and for
the really paranoid add a random salt so that keys can't be
predicted.
5. You could have it delete buckets that are empty of nodes to save
space, or put empty buckets into a cache so you save on creating
and destroying them.
6. Right now it just adds elements even if they already exist. Write
an alternative set method that only adds it if it isn't set
already.
As usual you should go through each function and make it bullet proof.
The Hashmap could also use a debug setting for doing an invariant
check.
38.2 Extra Credit
1. Research the Hashmap implementation of your favorite programming
language to see what features they have.
2. Find out what the major disadvantages of a Hashmap are and how to
avoid them. For example, they do not preserve order without special
changes and they don't work when you need to find things based on
parts of keys.
3. Write a unit test that demonstrates the defect of filling a Hashmap
with keys that land in the same bucket, then test how this impact
performance. A good way to do this is to just reduce the number of
buckets to something stupid like 5.
[next] [prev] [prev-tail] [front] [up]
__________________________________________________________________
Please enable JavaScript to view the comments powered by Disqus.
Take An Online Video Course
You can sign up for a video course at:
http://www.udemy.com/learn-c-the-hard-way/
This course is currently being built at the same time that the book is
being built, but if you sign up now then you get early access to both
the videos and PDF of the book.
Related Books
You might want to check out these other books in the series:
1. Learn Ruby The Hard Way
2. Learn Regex The Hard Way
3. Learn SQL The Hard Way
4. Learn C The Hard Way
5. Learn Python The Hard Way
I'll be referencing other books shortly.
Copyright 2011 Zed A. Shaw. All Rights Reserved.