slides/06-hashes.html

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>CS 2150: 06-hashes slide set</title>
    <meta name="description" content="A set of slides for a course on Program and Data Representation">
    
    <meta name="apple-mobile-web-app-capable" content="yes" />
    <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
    <link rel="stylesheet" href="../slides/reveal.js/css/reveal.css">
    <link rel="stylesheet" href="../slides/reveal.js/css/theme/black.css" id="theme">
    <link rel="stylesheet" href="../slides/css/pdr.css">
    <!-- Code syntax highlighting -->
    <link rel="stylesheet" href="../slides/reveal.js/lib/css/zenburn.css">
    <!-- Printing and PDF exports -->
    <script>
      var link = document.createElement( 'link' );
      link.rel = 'stylesheet';
      link.type = 'text/css';
      link.href = window.location.search.match( /print-pdf/gi ) ? '../slides/reveal.js/css/print/pdf.css' : '../slides/reveal.js/css/print/paper.css';
      document.getElementsByTagName( 'head' )[0].appendChild( link );
    </script>
    <!--[if lt IE 9]>
	<script src="../slides/reveal.js/lib/js/html5shiv.js"></script>
	<![endif]-->
    <script type="text/javascript" src="../slides/js/dhtmlwindow.js"></script>
    <script type="text/javascript" src="../slides/js/canvas.js"></script>
    <link rel="stylesheet" href="../slides/css/dhtmlwindow.css" type="text/css">
    <style>.reveal li { font-size:93%; line-height:120%; }</style>
  </head>

  <body onload="canvasinit()">
    <div id="dhtmlwindowholder"><span style="display:none"></span></div>

    <div class="reveal">

      <!-- Any section element inside of this container is displayed as a slide -->
      <div class="slides">

	<section data-markdown id="cover"><script type="text/template">
# CS 2150
&nbsp;
### Program and Data Representation
&nbsp;
<center><small><a href="http://www.cs.virginia.edu/~asb">Aaron Bloomfield</a> (aaron@virginia.edu)<br><a href="http://www.cs.virginia.edu/~nn4pj">Rich Nguyen</a> (nn4pj@virginia.edu)<br><a href="http://www.cs.virginia.edu/~mrf8t">Mark Floryan</a> (mrf8t@virginia.edu)</small></center>
<center><small><a href="http://@github/uva-cs/pdr">@github</a> | <a href="index.html">&uarr;</a> | <a href="daily-announcements.html?print-pdf"><img class="print" width="20" src="../slides/images/print-icon.png"></a></small></center>

&nbsp;  
&nbsp;  
## Hash Tables
	</script></section>
    <section>
<h2>CS 2150 Roadmap</h2>
<table class="wide">
  <tr><td colspan="3"><p class="center">Data Representation</p></td><td></td><td colspan="3"><p class="center">Program Representation</p></td></tr>
  <tr>
    <td class="top"><small>&nbsp;<br>&nbsp;<br>string<br>&nbsp;<br>&nbsp;<br>&nbsp;<br>int x[3]<br>&nbsp;<br>&nbsp;<br>&nbsp;<br>char x<br>&nbsp;<br>&nbsp;<br>&nbsp;<br>0x9cd0f0ad<br>&nbsp;<br>&nbsp;<br>&nbsp;<br>01101011</small></td>
    <!-- image adapted from http://openclipart.org/detail/3677/arrow-left-right-by-torfnase -->
    <td><img class="noborder" src="images/red-double-arrow.png" height="500" alt="vertical red double arrow"></td>
    <td class="top">&nbsp;<br>Objects<br>&nbsp;<br>Arrays<br>&nbsp;<br>Primitive types<br>&nbsp;<br>Addresses<br>&nbsp;<br>bits</td>
    <td>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</td>
    <td class="top"><small>&nbsp;<br>&nbsp;<br>Java code<br>&nbsp;<br>&nbsp;<br>C++ code<br>&nbsp;<br>&nbsp;<br>C code<br>&nbsp;<br>&nbsp;<br>x86 code<br>&nbsp;<br>&nbsp;<br>IBCM<br>&nbsp;<br>&nbsp;<br>hexadecimal</small></td>
    <!-- image adapted from http://openclipart.org/detail/3677/arrow-left-right-by-torfnase -->
    <td><img class="noborder" src="images/green-double-arrow.png" height="500" alt="vertical green double arrow"></td>
    <td class="top">&nbsp;<br>High-level language<br>&nbsp;<br>Low-level language<br>&nbsp;<br>Assembly language<br>&nbsp;<br>Machine code</td>
  </tr>
</table>
    </section>
	<section data-markdown><script type="text/template">
# Contents
&nbsp;  
[ADTs Covered So Far](#/adtssofar)  
[Hash Tables](#/hashtables)  
[Separate Chaining](#/separatechaining)  
[Open Addressing](#/openaddressing)  
[Miscellaneous](#/miscellaneous)  
	</script></section>


	<section>

	  <section id="adtssofar" data-markdown><script type="text/template">
# ADTs Covered So Far
	  </script></section>

	  <section data-markdown><script type="text/template">
## Lists
- Operations
  - find
  - insert
  - remove
  - findKth
- Implementations
  - Array (vector)
  - Linked list
	  </script></section>

	  <section data-markdown><script type="text/template">
## Lists
| | Array (vector) | Linked List |
|-|-|-|
| find | &Theta;(*n*) | &Theta;(*n*) |
| insert | &Theta;(*n*) worst case,<br>but often &Theta;(1) | &Theta;(1) |
| remove | &Theta;(*n*) | &Theta;(*n*) |
| findKth | &Theta;(1) | &Theta;(*n*) |

<center>The operations are <i>generally</i> linear-time operations</center>
	  </script></section>

	  <section data-markdown><script type="text/template">
## Stacks
- List with data handled last-in first-out
- Operations:
  - push
  - pop
  - top
- Implementations
  - Array (vector)
  - Linked list
	  </script></section>

	  <section data-markdown><script type="text/template">
## Stacks
| | Array (vector) | Linked List |
|-|-|-|
| push | &Theta;(*n*) worst case,<br>but often &Theta;(1) | &Theta;(1) |
| pop | &Theta;(1) | &Theta;(1) |
| top | &Theta;(1) | &Theta;(1) |

<center>The operations are <i>generally</i> constant-time operations</center>
	  </script></section>

	  <section data-markdown><script type="text/template">
## Queues
- First-in first-out list
- Operations:
  - enqueue
  - dequeue
- Implementations
  - Array (vector)
  - Linked lists
	  </script></section>

	  <section data-markdown><script type="text/template">
## Queues
| | Array (vector) | Linked List |
|-|-|-|
| enqueue | &Theta;(*n*) worst case,<br>but often &Theta;(1) | &Theta;(1) |
| dequeue | &Theta;(1) | &Theta;(1) |

<center>The operations are <i>generally</i> constant-time operations</center>
	  </script></section>

	  <section data-markdown><script type="text/template">
## Trees
- Goal is &Theta;(log *n*) runtime for most operations
  - Binary search trees
  - AVL Trees
  - Red-black trees
  - Splay trees
	  </script></section>

	  <section data-markdown><script type="text/template">
## Trees
| | BST | AVL | Red-black | Splay |
|-|-|-|-|-|
| find | &Theta;(*h*), where<br>log *n* < *h* &le; *n*-1;<br>worst case is &Theta;(*n*) | &Theta;(log *n*) | &Theta;(log *n*) | &Theta;(log *n*)<br>amortized |
| insert | &Theta;(*h*), where<br>log *n* < *h* &le; *n*-1;<br>worst case is &Theta;(*n*) | &Theta;(log *n*) | &Theta;(log *n*) | &Theta;(log *n*)<br>amortized |
| remove | &Theta;(*h*), where<br>log *n* < *h* &le; *n*-1;<br>worst case is &Theta;(*n*) | &Theta;(log *n*) | &Theta;(log *n*) | &Theta;(log *n*)<br>amortized |

<center>Balanced trees are <i>generally</i> logarithmic-time operations</center>
	  </script></section>

	  <section data-markdown><script type="text/template">
## Is There Anything Faster?
- Fastest possible search using binary comparison: &Theta;(log *n*)
  - Rephrased: binary comparison searches are &Omega;(log *n*)
- We can do better: (almost) constant (&Theta;(1)) is possible with *hash tables*
- Hash tables (lookup table)
  - Standard set of operations: find, insert, delete
  - No ordering property!
    - Thus, no findMin or findMax
	  </script></section>

	</section>


	<section>

	  <section id="hashtables" data-markdown><script type="text/template">
# Hash Tables
	  </script></section>

	  <section data-markdown><script type="text/template">
## Key-value pairs
- Hash tables store key-value pairs
  - Each value has a specific key associated with it
  - Keys and values need not be the same type!
- Examples
  - Definitions: "set", "v.tr. 1. To put in a specified position..."
  - Uva e-mail redirects: "aaron@", "asb2t@cms .virginia.edu"
  - Anything that can be stored in a tree
    - Userid / IDnum pairs
    - Userid / lots_of_info_about_them_in_an_object pairs
	  </script></section>

	  <section>
<h2>Hash Tables</h2>
<table class="transparent"><tr><td>
<table class="transparent"><tr><td>
<div style="font-size:130%;line-height:110%">
<ul>
<li>Hash table<ul>
    <li>fixed size <i>array</i> of some size, usually a prime number<ul>
	<li>Should be larger than the number of elements</li></ul></li></ul></li>
<li>Given a key space:</li>
</ul>
</div>
</td></tr>
<tr><td>
    <table class="transparent"><tr><td><img alt="blob" src="images/06-hashes/blob.png"></td><td class="middle">
	  <table class="transparent"><tr><td><div style="font-size:130%;line-height:110%">hash function</div></td></tr><tr><td><div style="font-size:130%;line-height:110%"><i>hash</i>(<i>k</i>)</div></td></tr><tr><td><div style="font-size:200%">&rarr;</div></td></tr></table>
    </td></tr></table>
</td></tr></table>
</td><td class="middle">
<table class="transparent">
<tr><td>&nbsp;</td><td></td></tr>
<tr><td>&nbsp;</td><td></td></tr>
<tr><td style="text-align:right;">hash</td><td style="text-align:left;">table</td></tr>
<tr><td>&nbsp;</td><td></td></tr>
<tr><td>0</td><td class="border" style="width:100px"></td></tr>
<tr><td>1</td><td class="border" style="width:100px"></td></tr>
<tr><td>2</td><td class="border" style="width:100px"></td></tr>
<tr><td>&nbsp;</td><td class="border" style="width:100px"></td></tr>
<tr><td>...</td><td class="border" style="width:100px"></td></tr>
<tr><td>&nbsp;</td><td class="border" style="width:100px"></td></tr>
<tr><td>tablesize&#8209;1</td><td class="border" style="border-bottom:medium solid;"></td></tr>
</table>
</td></tr></table>
	  </section>

	  <section data-markdown><script type="text/template">
## Hash function
- A hash function takes in a "thing"...
  - string, int, object, etc.
  - and returns an *unsigned* integer value
    - which is then mod'ed by the size of the hash table to yield a spot within the bounds of the hash table array
- Three *required* properties
  - Must be *deterministic*
    - Meaning it must return the same value each time for the same "thing"
  - Must be *fast*
  - Must be *evenly distributed*
  - Technically, only the first is required for *correctness*, but the other two are required for fast running times
	  </script></section>

	  <section>
<h2>Hash functions KLA</h2>
<ul>
<li>I'm going hash all of you into 10 buckets (0-9) by your birthday<ul>
    <li>(you are welcome to make up another birthday, as long as you are consistent)</li></ul></li>
<li>The hash functions:<ul>
<li class="fragment" data-fragment-index="1">By the decade of your birth year<ul>
<li class="fragment" data-fragment-index="1"><i>hash</i>(<i>birthday</i>) = (<i>year</i>/10) % 10</li></ul></li>
<li class="fragment" data-fragment-index="2">By the last digit of your birth year<ul>
<li class="fragment" data-fragment-index="2"><i>hash</i>(<i>birthday</i>) = <i>year</i> % 10</li></ul></li>
<li class="fragment" data-fragment-index="3">By the last digit of your birth month<ul>
<li class="fragment" data-fragment-index="3"><i>hash</i>(<i>birthday</i>) = <i>month</i> % 10</li></ul></li>
<li class="fragment" data-fragment-index="4">By the last digit of your birth day<ul>
<li class="fragment" data-fragment-index="4"><i>hash</i>(<i>birthday</i>) = <i>day</i> % 10</li></ul></li>
</ul></li></ul>
	  </section>

	  <section>
<h2>Keys</h2>
<table class="wide">
  <tr>
    <td class="top">
<div style="width:400px;font-size:130%;line-height:110%">
<ul>
<li>How can we hash the keys if the keys can be anything?</li>
<li>Best one binary comparison can do is eliminate one half of the elements &Theta;(log <i>n</i>)</li>
<li>We want &Theta;(1)</li>
<li>The keys must be bits, so we can do better!</li>
</ul></div>
</td>
<td style="width:75px"></td>
<td class="top">
&nbsp;<br><small>&quot;Hello&quot;</small><br>&nbsp;<br><small>['H','i',\0]</small><br>&nbsp;<br><small>3.14</small><br>&nbsp;<br><small>'x'</small><br>&nbsp;<br><small>0x42381a</small><br>&nbsp;<br><small>01001010</small></td>
    <!-- image adapted from http://openclipart.org/detail/3677/arrow-left-right-by-torfnase -->
    <td style="width:150px;vertical-align:top"><img class="noborder" src="images/red-double-arrow.png" height="500" alt="vertical red double arrow"></td>
    <td class="top">&nbsp;<br>&nbsp;<br><small>Objects</small><br>&nbsp;<br><small>Arrays</small><br>&nbsp;<br><small>Primitive types</small><br>&nbsp;<br><small>Addresses</small><br>&nbsp;<br><small>bits</small></td>
    <td>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</td>
  </tr>
</table>
	  </section>

	  <section data-markdown><script type="text/template">
## Lookup Table
| hash(key) | key |
|-|-|
| 000000 | "red" |
| 000001 | "orange" |
| 000010 | "blue" |
| 000011 | `null` |
| 000100 | "green" |
| 000101 | ... |
&nbsp;  

This can work, unless the key space is sparse, or we don't know the keys ahead of time.  But it's slow to look up a value in a table!
	  </script></section>

	  <section>
<h2>Example</h2>
<table class="transparent"><tr><td>
<div style="font-size:120%">
<ul>
<li>Key space: integers<br>&nbsp;</li>
<li>Table size: 10<br>&nbsp;</li>
<li><i>hash</i>(<i>k</i>) = <i>k</i> mod 10<ul>
    <li>Technically, <i>hash</i>(<i>k</i>) = <i>k</i>,<br>which is <i>then</i> mod'ed by<br>the table size of 10<br>&nbsp;</li>
</ul></li>
<li>Insert: 7, 18, 41, 34<br>&nbsp;</li>
<li>How do we find them?</li>
</ul></div>
</td><td style="width:100px"></td><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"></td></tr>
<tr><td>1</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="3">41</span></td></tr>
<tr><td>2</td><td class="border" style="width:100px"></td></tr>
<tr><td>3</td><td class="border" style="width:100px"></td></tr>
<tr><td>4</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="4">34</span></td></tr>
<tr><td>5</td><td class="border" style="width:100px"></td></tr>
<tr><td>6</td><td class="border" style="width:100px"></td></tr>
<tr><td>7</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="1">7</span></td></tr>
<tr><td>8</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="2">18</span></td></tr>
<tr><td>9</td><td class="border" style="border-bottom:medium solid;"></td></tr>
</table>
</td></tr></table>
	  </section>

	  <section data-markdown><script type="text/template">
## Table size issues...
- Why not just have a table of size 100
  - And map them directly to the location corresponding to their key?
- We assume that the key space is too large
  - Example: mapping social security numbers for students at UVa
    - There are not 999,999,999 students at UVa, even if taken across all time
- Do you see why find max and find min are not easy?
  - We have not preserved any ordering info
	  </script></section>

	  <section>
<h2>Another Example</h2>
<table class="transparent"><tr><td>
<div style="font-size:120%">
<ul>
<li>Key space: integers<br>&nbsp;</li>
<li>Table size: 6<br>&nbsp;</li>
<li><i>hash</i>(<i>k</i>) = <i>k</i> mod 6<br>&nbsp;</li>
<li>Insert: 7, 18, 41, 34, <span class='red'>12</span><br>&nbsp;</li>
<li>How do we find them?</li>
</ul></div>
</td><td style="width:100px"></td><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="2">18</span><span class="fragment" data-fragment-index="5"><span class="red"> 12</span></span></td></tr>
<tr><td>1</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="1">7</span></td></tr>
<tr><td>2</td><td class="border" style="width:100px"></td></tr>
<tr><td>3</td><td class="border" style="width:100px"></td></tr>
<tr><td>4</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="4">34</span></td></tr>
<tr><td>5</td><td class="border" style="border-bottom:medium solid;"><span class="fragment" data-fragment-index="3">41</span></td></tr>
</table>
</td></tr></table>
	  </section>

	  <section data-markdown><script type="text/template">
## Hash Table
- Hash function: *hash*: *key* &rarr; [0, *m*-1]
  - Really to any unsigned integer, which is then mod'ed by *m*, the table size
- Here, *hash*(*key*) = `firstletter`(*key*)<br>&nbsp;

| Location | Key | Value |
|-|-|-|
| 0 | "Alice" | "red" |
| 1 | "Bob" | "orange" |
| 2 | "Colleen" | "blue" |
| 3 | `null` | `null` |
| 4 | "Eve" | "green" |
| ... | ... | ... |
| *m*-1 | "Zeus" | "purple" |
	  </script></section>

	  <section data-markdown><script type="text/template">
## Hash Functions
- Required properties described earlier
  - Must be deterministic
  - Must be fast
  - Must be evenly distributed
    - This implies avoiding of collisions
- A perfect hash function has:
  - No blanks (i.e., no empty cells)
  - No collisions
	  </script></section>

	  <section>
<h2>Sample String Hash Functions</h2>
<ul>
<li>Key space: strings</li>
<li>A string <i>s</i> is made up of characters <i>s<sub>i</sub></i></li>
<li>\( s = s_0s_1s_2s_3\ldots s_{k-1} \)</li>
</ul>
<p>&nbsp;</p>
<ol>
<li class="fragment">\( hash(s) = s_0 \mod table\_size \)<br>&nbsp;</li>
<li class="fragment">\( hash(s) = \left( \sum_{i=0}^{k-1}s_i \right) \mod table\_size \)<br>&nbsp;</li>
<li class="fragment">\( hash(s) = \left( \sum_{i=0}^{k-1}s_i*37^i \right) \mod table\_size \)<br>&nbsp;</li>
</ol>
	  </section>

	  <section data-markdown><script type="text/template">
## Hash function notes
- They should always return an *unsigned* int
  - Otherwise your program will be trying to find a negative array index
- Integer overflow is fine, as long as it overflows *deterministically*
  - Meaning the same way each time
  - This will especially be true with the last of the string hash functions presented on the previous slide
	  </script></section>

	  <section data-markdown><script type="text/template">
## Collision Resolution
- Collision: when two keys map to the same location in the hash table
- Two primary ways to resolve collisions:
  1. Separate Chaining (make each spot in the table a 'bucket' or a collection)
  2. Open Addressing, of which there are 3 types:
     - Linear probing
     - Quadratic probing
     - Double hashing
	  </script></section>

	</section>


	<section>

	  <section id="separatechaining" data-markdown><script type="text/template">
# Separate Chaining
	  </script></section>

	  <section>
<h2>Separate Chaining</h2>
<table class="transparent"><tr><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"></td></tr>
<tr><td>1</td><td class="border" style="width:100px"></td></tr>
<tr><td>2</td><td class="border" style="width:100px"></td></tr>
<tr><td>3</td><td class="border" style="width:100px"></td></tr>
<tr><td>4</td><td class="border" style="width:100px"></td></tr>
<tr><td>5</td><td class="border" style="width:100px"></td></tr>
<tr><td>6</td><td class="border" style="width:100px"></td></tr>
<tr><td>7</td><td class="border" style="width:100px"></td></tr>
<tr><td>8</td><td class="border" style="width:100px"></td></tr>
<tr><td>9</td><td class="border" style="border-bottom:medium solid;width:100px"></td></tr>
</table>
</td><td style="width:200px"></td><td class="top">
<div style="font-size:120%;line-height:110%">
<ul>
<li>All keys that map to the same hash value are kept in a &quot;bucket&quot;<ul>
  <li>This &quot;bucket&quot; is another data structure, typically a linked list</li></ul><br>&nbsp;</li>
<li><i>hash</i>(<i>k</i>) = <i>k</i> mod 10<br>&nbsp;</li>
<li>Insert: 10, 22, 107, 12, 42</li>
</ul></div>
</td></tr></table>
<script type="text/javascript">insertCanvas();</script>
	  </section>

	  <section data-markdown><script type="text/template">
## Analysis of find
- Definition: The *load factor*, &lambda;, of a hash table is the ratio of the number of elements divided by the table size
- For separate chaining, &lambda; is the average number of elements in a bucket
  - Average time on unsuccessful find: &lambda;
    - Average length of a list at *hash*(*k*)
  - Average time on successful find: 1 + (&lambda;/2)
    - One node, plus half the average length of a list (not including the item)
	  </script></section>

	  <section data-markdown><script type="text/template">
## Load factor
- How big should we make the hash table?
  - Well, we want "constant" time for find and insert...
- Possible sizes for hash table with separate chaining
  - &lambda; = 1
    - Make hash table be the number of elements expected; average bucket size is 1
    - Also make it a prime number
  - &lambda; = 0.75
    - Java's [Hashtable](http://docs.oracle.com/javase/7/docs/api/java/util/Hashtable.html) but can be set to another value
    - Table will always be bigger than number of elements
      - This reduces the chance of a collision!
    - Good trade-off between memory use and time
  - &lambda; = 0.5
    - Uses more memory, but fewer collisions
	  </script></section>

	  <section data-markdown><script type="text/template">
## Separate Chaining: find()
- Note that we now have to keep each key in the chain, as well as the value!
- What is the worst case?
  - Hint: [Wikipedia](http://en.wikipedia.org/wiki/Hash_table) is wrong on this one...
  - In the worst case, every key could hash to the same spot!
    - As nobody uses anything other than a linked list as the secondary data structure, this means it will be a &Theta;(*n*) algorithm to perform a find!
- What is the "hopeful" case?
	  </script></section>

	  <section data-markdown><script type="text/template">
## What data structure to use for the buckets?
- AVL & red-black trees will give the best running time
  - But that's a lot of overhead!
- Vectors are easier and simpler, but take up a *lot* of space
  - All those extra, unused, cells
  - Don't *ever* use vectors for this!
- Linked lists are quick and easy, and take up very little extra space
  - That's &Theta;(*n*)!
  - Still faster *in practice* than trees due to having a very small number of items in the bucket
	  </script></section>

	  <section data-markdown><script type="text/template">
## Requirements for "Hopeful" Case
- Our ideal hash function and hash table:
  - Function *hash*(*k*) is well distributed for key space
    - For a randomly selected *k* &isin; *K*,
    - probability(*hash*(*k*) = i) = 1/*table_size*
  - Size of table scales linearly with number of elements
    - Expected bucket size is &Theta;(*num_elements* / *table_size*)
- Finding a good hash function can be tough
	  </script></section>

	  <section data-markdown><script type="text/template">
## Separate chaining insert is &Theta;(1)
- In an unsorted linked list, you can just put it on the front
- So all inserts into a seprate chained hash table, that uses linked lists, are actually in constant time
  - If you were to *sort* the linked list, that would be linear time
  - And finds (and thus deletes) are still linear time
	  </script></section>

	</section>


	<section>

	  <section id="openaddressing" data-markdown><script type="text/template">
# Open Addressing
	  </script></section>

	  <section data-markdown><script type="text/template">
## Saving Memory
![separate chaining diagram](images/06-hashes/separate-chaining-diagram.png)
&nbsp;  
<center>Can we avoid the overhead of all those linked lists?</center>
	  </script></section>

	  <section data-markdown><script type="text/template">
## Three Types of Probing Strategies
- The three types:
  - Linear
  - Quadratic
  - Double hashing
- The general idea with all of them is that, if a spot is occupied, to 'probe', or try, other spots in the table to use
  - How we determine where else to probe depends on which strategy we are using
	  </script></section>

	  <section>
<h2>Linear Probing</h2>
<table class="transparent"><tr><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="3">37</span></td></tr>
<tr><td>1</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="4">14</span></td></tr>
<tr><td>2</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="5">21</span></td></tr>
<tr><td>3</td><td class="border" style="width:100px"></td></tr>
<tr><td>4</td><td class="border" style="width:100px"></td></tr>
<tr><td>5</td><td class="border" style="width:100px"></td></tr>
<tr><td>6</td><td class="border" style="width:100px"></td></tr>
<tr><td>7</td><td class="border" style="width:100px"></td></tr>
<tr><td>8</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="2">27</span></td></tr>
<tr><td>9</td><td class="border" style="border-bottom:medium solid;width:100px"><span class="fragment" data-fragment-index="1">4</span></td></tr>
</table>
</td><td style="width:200px"></td><td class="top">
<ul>
<li>Check spots in this order:<ul>
    <li><i>hash</i>(<i>k</i>)</li>
    <li><i>hash</i>(<i>k</i>)+1</li>
    <li><i>hash</i>(<i>k</i>)+2</li>
    <li><i>hash</i>(<i>k</i>)+3</li>
    <li>etc.</li>
</ul>&nbsp;</li>
<li><i>hash</i>(<i>k</i>) = 3<i>k</i>+7<ul><li>Which is then mod'ed by the table size (10)</li><li>Result: <i>hash</i>(<i>k</i>) = (3<i>k</i>+7) mod 10</li></ul>&nbsp;</li>
<li>Insert: 4, 27, 37, 14, 21<ul>
    <li><i>hash</i>(<i>k</i>) values: 19, 88, 118, 49, 70, respectively</li>
</ul></li>
</ul>
</td></tr></table>
	  </section>

	  <section data-markdown><script type="text/template">
## Linear Probing
- With all open addressing schemes, we examine ('probe') the cells in the order:
  - *p*<sub>0</sub>(*k*), *p*<sub>1</sub>(*k*), *p*<sub>2</sub>(*k*), ...
  - where: *p*<sub>i</sub>(*k*) = (*hash*(*k*) + *f*(*i*)) mod *table_size*
- With *linear probing*, <span class="red">*f*(*i*) = *i*</span>
  - After searching spot *hash*(*k*) in the array, look in:
    - *hash*(*k*) + 1
    - *hash*(*k*) + 2
    - *hash*(*k*) + 3
    - etc.
	  </script></section>

	  <section data-markdown><script type="text/template">
## Problems with Linear Probing
- Primary clustering
  - Large blocks of occupied cells
  - As table fills, increased number of attempts required to solve collision
    - And thus slower lookup times
  - "Holes" when an element is removed
    - We'll see how to solve this later
  - When to stop looking?
	  </script></section>

	  <section data-markdown><script type="text/template">
## Quadratic Probing
- With all open addressing schemes, we examine ('probe') the cells in the order:
  - *p*<sub>0</sub>(*k*), *p*<sub>1</sub>(*k*), *p*<sub>2</sub>(*k*), ...
  - where: *p*<sub>i</sub>(*k*) = (*hash*(*k*) + *f*(*i*)) mod *table_size*
- With *quadratic probing*, <span class="red">*f*(*i*) = *i*<sup>2</sup></span>
  - After searching spot *hash*(*k*) in the array, look in:
    - *hash*(*k*) + 1
    - *hash*(*k*) + 4
    - *hash*(*k*) + 9
    - etc.
	  </script></section>

	  <section>
<h2>Quadratic Probing</h2>
<table class="transparent"><tr><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="3">14 </span></td></tr>
<tr><td>1</td><td class="border" style="width:100px"></td></tr>
<tr><td>2</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="4">37</span></td></tr>
<tr><td>3</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="5">22</span></td></tr>
<tr><td>4</td><td class="border" style="width:100px"></td></tr>
<tr><td>5</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="6">34</span></td></tr>
<tr><td>6</td><td class="border" style="width:100px"></td></tr>
<tr><td>7</td><td class="border" style="width:100px"></td></tr>
<tr><td>8</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="2">27</span></td></tr>
<tr><td>9</td><td class="border" style="border-bottom:medium solid;width:100px"><span class="fragment" data-fragment-index="1">4</span></td></tr>
</table>
</td><td style="width:200px"></td><td class="top">
<ul>
<li>Check spots in this order:<ul>
    <li><i>hash</i>(<i>k</i>)</li>
    <li><i>hash</i>(<i>k</i>)+1<sup>2</sup> = <i>hash</i>(<i>k</i>)+1</li>
    <li><i>hash</i>(<i>k</i>)+2<sup>2</sup> = <i>hash</i>(<i>k</i>)+4</li>
    <li><i>hash</i>(<i>k</i>)+3<sup>2</sup> = <i>hash</i>(<i>k</i>)+9</li>
    <li>etc.</li>
</ul>&nbsp;</li>
<li><i>hash</i>(<i>k</i>) = 3<i>k</i>+7<ul><li>Which is then mod'ed by the table size (10)</li><li>Result: <i>hash</i>(<i>k</i>) = (3<i>k</i>+7) mod 10</li></ul>&nbsp;</li>
<li>Insert: 4, 27, 14, 37, 22, 34<ul>
    <li><i>hash</i>(<i>k</i>) values: 19, 88, 49, 118, 73, 109, respectively</li>
</ul></li>
</ul>
</td></tr></table>
	  </section>

	  <section data-markdown><script type="text/template">
## Double Hashing
- With all open addressing schemes, we examine ('probe') the cells in the order:
  - *p*<sub>0</sub>(*k*), *p*<sub>1</sub>(*k*), *p*<sub>2</sub>(*k*), ...
  - where: *p*<sub>i</sub>(*k*) = (*hash*(*k*) + *f*(*i*)) mod *table_size*
- With *double hashing*, <span class="red">*f*(*i*) = *i* \* hash<sub>2</sub>(*k*)</span>
  - Which means we have to define a *secondary* hash function!
  - After searching spot *hash*(*k*) in the array, look in:
    - *hash*(*k*) + 1 \* *hash*<sub>2</sub>(*k*)
    - *hash*(*k*) + 2 \* *hash*<sub>2</sub>(*k*)
    - *hash*(*k*) + 3 \* *hash*<sub>2</sub>(*k*)
    - etc.
	  </script></section>

	  <section>
<h2>Double Hashing</h2>
<table class="transparent"><tr><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="5">69</span></td></tr>
<tr><td>1</td><td class="border" style="width:100px"></td></tr>
<tr><td>2</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="6">60</span></td></tr>
<tr><td>3</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="3">58</span></td></tr>
<tr><td>4</td><td class="border" style="width:100px"></td></tr>
<tr><td>5</td><td class="border" style="width:100px"></td></tr>
<tr><td>6</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="4">49</span></td></tr>
<tr><td>7</td><td class="border" style="width:100px"></td></tr>
<tr><td>8</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="2">18</span></td></tr>
<tr><td>9</td><td class="border" style="border-bottom:medium solid;width:100px"><span class="fragment" data-fragment-index="1">89</span></td></tr>
</table>
</td><td style="width:200px"></td><td class="top">
<ul>
<li>Check spots in this order:<ul>
    <li><i>hash</i>(<i>k</i>)</li>
    <li><i>hash</i>(<i>k</i>) + 1 * <i>hash</i><sub>2</sub>(<i>k</i>)</li>
    <li><i>hash</i>(<i>k</i>) + 2 * <i>hash</i><sub>2</sub>(<i>k</i>)</li>
    <li><i>hash</i>(<i>k</i>) + 3 * <i>hash</i><sub>2</sub>(<i>k</i>)</li>
    <li>etc.</li>
</ul>&nbsp;</li>
<li><i>hash</i>(<i>k</i>) = <i>k</i><ul>
<li>The hash function was made simpler for this example...</li>
<li>Which is then mod'ed by the table size (10)</li>
<li>Result: <i>hash</i>(<i>k</i>) = <i>k</i> mod 10</li></ul></li>
<li><i>hash</i><sub>2</sub>(<i>k</i>) = 7 - (<i>k</i> mod 7)<br>&nbsp;</li>
<li>Insert: 89, 18, 58, 49, 69, 60</li>
</ul>
</td></tr></table>
	  </section>

	  <section>
<h2>Double Hashing Thrashing</h2>
<table class="transparent"><tr><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="1">10</span></td></tr>
<tr><td>1</td><td class="border" style="width:100px"></td></tr>
<tr><td>2</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="2">12</span></td></tr>
<tr><td>3</td><td class="border" style="width:100px"></td></tr>
<tr><td>4</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="3">14</span></td></tr>
<tr><td>5</td><td class="border" style="width:100px"></td></tr>
<tr><td>6</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="4">16</span></td></tr>
<tr><td>7</td><td class="border" style="width:100px"></td></tr>
<tr><td>8</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="5">18</span></td></tr>
<tr><td>9</td><td class="border" style="border-bottom:medium solid;width:100px"></td></tr>
</table>
</td><td style="width:200px"></td><td class="top">
<ul>
<li><i>hash</i>(<i>k</i>) = <i>k</i> mod 10&nbsp;<ul>
<li>Same as the previous slide</li>
<li>Result: <i>hash</i>(<i>k</i>) = <i>k</i> mod 10</li></ul>&nbsp;</li>
<li><i>hash</i><sub>2</sub>(<i>k</i>) = (<i>k</i> mod 5) +1<br>&nbsp;</li>
<li>Insert: 10, 12, 14, 16, 18, <span class='red'>36</span></li>
</ul>
</td></tr></table>
	  </section>

	  <section data-markdown><script type="text/template">
## Table size must be prime!
- The table size must always be a prime number
  - It will prevent the thrashing from the previous slide
    - Thrashing will only occur when the double hash value is a *factor* of the table size
    - The only factors of a prime number *p* are 1 and *p*
      - 1 is effectively linear probing, which is fine
      - *p* will mod to zero, which is an invalid return value for a secondary hash function
  - It will provide better distribution of the hash keys into the table
    - Less clustering, etc.
- A prime number table size does not remove the need for a good hash function!
	  </script></section>

	</section>


	<section>

	  <section id="miscellaneous" data-markdown><script type="text/template">
# Miscellaneous
	  </script></section>

	  <section data-markdown><script type="text/template">
## Rehashing
- Problem: when the table gets too full, running time for operations increases
- Solution: create a bigger table and hash all the items from the original table into the new table
  - The position in a table is dependent on the table size, which means we have to *rehash* each value
  - This means we have to re-compute the hash value for *each* element, and insert it into the new table!
	  </script></section>

	  <section data-markdown><script type="text/template">
## Rehashing
- When to rehash?
  - When half full (&lambda; = 0.5)
  - When mostly full (&lambda; = 0.75)
    - Java's hashtable does this by default
  - When an insertion fails
  - Some other threshold
- Cost of rehashing
  - Let's assume that the hash function computation is constant
  - We have to do *n* inserts, and if each key hashes to the same spot, then it will be a &Theta;(*n*<sup>2</sup>) operation!
    - Although it is not likely to ever run that slow
	  </script></section>

	  <section data-markdown><script type="text/template">
## Removing an element
- How to handle this?
- You could:
  - Rehash upon each delete, which is *very* expensive
  - Put in a 'placeholder' or 'sentinel' value
    - But the table gets filled with these rather fast
    - Perhaps rehashing after a certain number of deletes
  - Disallow deletes entirely; you can do this for [lab 6](../labs/lab06/index.html)
- Hash tables are not an ideal data structure if you need to perform a lot of deletions
	  </script></section>

	  <section data-markdown><script type="text/template">
## Hashing: MD5
- [MD5: Message Digest 5](http://en.wikipedia.org/wiki/Md5)
- Given a string (or file contents, etc.) generate a 128-bit hash
  - 2<sup>128</sup> = 3.4*10<sup>38</sup> (coincidentally, this is is also the [maximum finite value](03-numbers.html#/maxfloatvalue) of a `float`)
  - Typically an MD5 is always written in hex: 16e28b7986fd74f65b061de89dc8b78e
- This could then be used as the key
  - Obviously having to mod it by the table size
- (Was) good for checking if a download completed successfully
	  </script></section>

	  <section data-markdown><script type="text/template">
## Can you reverse an MD5 hash?
- Technically, no
  - A 129-bit file has 2<sup>129</sup> possibilities, and if you were to hash each one, it would go into 2<sup>128</sup> buckets
  - By the pigeonhole principle, there would be at least one hash value (pigeonhole) with multiple keys (pigeons), and you don't know which one
    - In reality, *many* (and eventually *all*) would have multiple keys
- But if a password is stored by it's MD5 hash...
  - ... then there are enough online hash libraries that you can find at least one password that hashes to that value
  - Try Googling for [3858f62230ac3c915f300c664312c63f](https://www.google.com/search?q=3858f62230ac3c915f300c664312c63f)
- Plus there are lots of weaknesses in MD5...
	  </script></section>

	  <section data-markdown><script type="text/template">
## More hashing: SHA
- MD5 has been "broken"
  - One can generate two files that have the same hash; this is called a [collision attack](https://en.wikipedia.org/wiki/Collision_attack)
    - In fact, I have the students do something similar when I teach Defense Against the Dark Arts (albeit with a [weaker hashing algorithm](https://en.wikipedia.org/wiki/Crc32))
  - So it is useless for any security-related purposes
- [SHA (Secure Hash Algorithm)](http://en.wikipedia.org/wiki/Secure_Hash_Algorithm) is a family of algorithms that (the more recent ones) are much more secure
  - Same overall idea: it generates a hash value up to 512 bits
  - SHA-1 has been broken also, but more recent SHAs are secure
	  </script></section>

	</section>


      </div>

    </div>

    <div id="calibratediv" style="display:none">
      <div id="calibratecanvasdiv">
        <canvas id="calibratecanvas" width="300" height="300">Your browser does not support the canvas tag</canvas>
      </div>
      <p style="text-align:center">Click the center of the target<br><a href="#" onClick="calibratewin.close(); return false">Close window</a></p>
    </div>

    <script src="reveal.js/js/reveal.js"></script>
    <script src="js/settings.js"></script>

  </body>
</html>