Skip to content

Commit

Permalink
BMZ documentation was finished
Browse files Browse the repository at this point in the history
  • Loading branch information
fc_botelho committed Jan 31, 2005
1 parent 8401ce6 commit 4951ded
Show file tree
Hide file tree
Showing 13 changed files with 738 additions and 46 deletions.
289 changes: 255 additions & 34 deletions BMZ.t2t

Large diffs are not rendered by default.

24 changes: 20 additions & 4 deletions CHM.t2t
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,11 @@ CHM Algorithm
%!includeconf: CONFIG.t2t

----------------------------------------

==The Algorithm==

----------------------------------------

==Memory Consumption==

Now we detail the memory consumption to generate and to store minimal perfect hash functions
Expand All @@ -23,9 +26,11 @@ following:
of 4 bytes that represent the vertices. As there are //n// edges, the
vector edges is stored in //8n// bytes.

+ **next**: given a vertex //v//, we can discover the edges that contain //v//
following its list of edges, which starts on first[//v//] and the next
edges are given by next[...first[//v//]...]. Therefore, the vectors first and next represent
+ **next**: given a vertex [figs/img139.png], we can discover the edges that
contain [figs/img139.png] following its list of edges, which starts on
first[[figs/img139.png]] and the next
edges are given by next[...first[[figs/img139.png]]...]. Therefore,
the vectors first and next represent
the linked lists of edges of each vertex. As there are two vertices for each edge,
when an edge is iserted in the graph, it must be inserted in the two linked lists
of the vertices in its composition. Therefore, there are //2n// entries of integer
Expand All @@ -47,12 +52,23 @@ As the value of constant //c// must be at least 2.09 we have:
|| //c// | Memory consumption to generate a MPHF |
| 2.09 | //33.00n + O(1)// |

| **Table 1:** Memory consumption to generate a MPHF using the CHM algorithm.

Now we present the memory consumption to store the resulting function.
We only need to store the //g// function. Thus, we need //4cn// bytes.
Again we have:
|| //c// | Memory consumption to store a MPHF |
| 2.09 | //8.36n// |

| **Table 2:** Memory consumption to store a MPHF generated by the CHM algorithm.

----------------------------------------

==Experimental Results==

[CHM x BMZ comparison.html]

----------------------------------------

==Papers==

Expand All @@ -66,7 +82,7 @@ Again we have:


----------------------------------------
[Home index.html]
| [Home index.html] | [CHM chm.html] | [BMZ bmz.html]
----------------------------------------

%!include: FOOTER.t2t
95 changes: 92 additions & 3 deletions COMPARISON.t2t
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,106 @@ Comparison Between BMZ And CHM Algorithms

----------------------------------------

==Features==
==Characteristics==
Table 1 presents the main characteristics of the two algorithms.
The number of edges in the graph [figs/img27.png] is [figs/img236.png],
the number of keys in the input set [figs/img20.png].
The number of vertices of [figs/img32.png] is equal
to [figs/img12.png] and [figs/img237.png] for BMZ algorithm and the CHM algorithm, respectively.
This measure is related to the amount of space to store the array [figs/img37.png].
This improves the space required to store a function in BMZ algorithm to [figs/img238.png] of the space required by the CHM algorithm.
The number of critical edges is [figs/img76.png] and 0, for BMZ algorithm and the CHM algorithm,
respectively.
BMZ algorithm generates random graphs that necessarily contains cycles and the
CHM algorithm
generates
acyclic random graphs.
Finally, the CHM algorithm generates [order preserving functions concepts.html]
while BMZ algorithm does not preserve order.

==Constructing Minimal Perfect Hash Functions==
%!include(html): ''TABLE1.t2t''
| **Table 1:** Main characteristics of the algorithms.

----------------------------------------

==Memory Consumption==

- Memory consumption to generate the minimal perfect hash function (MPHF):
|| Algorithm | //c// | Memory consumption to generate a MPHF |
| BMZ | 0.93 | //24.80n + O(1)// |
| BMZ | 1.15 | //26.42n + O(1)// |
| CHM | 2.09 | //33.00n + O(1)// |

| **Table 2:** Memory consumption to generate a MPHF using the algorithms BMZ and CHM.

- Memory consumption to store the resulting minimal perfect hash function (MPHF):
|| Algorithm | //c// | Memory consumption to store a MPHF |
| BMZ | 0.93 | //3.72n// |
| BMZ | 1.15 | //4.60n// |
| CHM | 2.09 | //8.36n// |

| **Table 3:** Memory consumption to store a MPHF generated by the algorithms BMZ and CHM.

----------------------------------------

==Run times==
We now present some experimental results to compare the BMZ and CHM algorithms.
The data consists of a collection of 100 million universe resource locations
(URLs) collected from the Web.
The average length of a URL in the collection is 63 bytes.
All experiments were carried on
a computer running the Linux operating system, version 2.6.7,
with a 2.4 gigahertz processor and
4 gigabytes of main memory.

Table 4 presents time measurements.
All times are in seconds.
The table entries represent averages over 50 trials.
The column labelled as [figs/img243.png] represents
the number of iterations to generate the random graph [figs/img32.png] in the
mapping step of the algorithms.
The next columns represent the run times
for the mapping plus ordering steps together and the searching
step for each algorithm.
The last column represents the percent gain of our algorithm
over the CHM algorithm.

%!include(html): ''TABLE4.t2t''
| **Table 4:** Time measurements for BMZ and the CHM algorithm.

The mapping step of the BMZ algorithm is faster because
the expected number of iterations in the mapping step to generate [figs/img32.png] are
2.13 and 2.92 for BMZ algorithm and the CHM algorithm, respectively
(see [[2 bmz.html#papers]] for details).
The graph [figs/img32.png] generated by BMZ algorithm
has [figs/img12.png] vertices, against [figs/img237.png] for the CHM algorithm.
These two facts make BMZ algorithm faster in the mapping step.
The ordering step of BMZ algorithm is approximately equal to
the time to check if [figs/img32.png] is acyclic for the CHM algorithm.
The searching step of the CHM algorithm is faster, but the total
time of BMZ algorithm is, on average, approximately 59 % faster
than the CHM algorithm.
It is important to notice the times for the searching step:
for both algorithms they are not the dominant times,
and the experimental results clearly show
a linear behavior for the searching step.

We now present run times for BMZ algorithm using a [heuristic bmz.html#heuristic] that
reduces the space requirement
to any given value between [figs/img12.png] words and [figs/img13.png] words.
For example, for [figs/img244.png] and [figs/img6.png], the analytical expected number
of iterations are [figs/img245.png] and [figs/img246.png], respectively
(for [figs/img247.png], the number of iterations are 2.78 for [figs/img244.png] and 3.04
for [figs/img6.png]).
Table 5 presents the total times to construct a
function for [figs/img247.png], with an increase from [figs/img248.png] seconds
for [figs/img128.png] (see Table 4) to [figs/img249.png] seconds for [figs/img244.png] and
to [figs/img250.png] seconds for [figs/img6.png].

%!include(html): ''TABLE5.t2t''
| **Table 5:** Time measurements for BMZ tuned algorithm with [figs/img5.png] and [figs/img6.png].
----------------------------------------
[Home index.html]
| [Home index.html] | [CHM chm.html] | [BMZ bmz.html]
----------------------------------------

%!include: FOOTER.t2t
56 changes: 56 additions & 0 deletions CONCEPTS.t2t
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
Minimal Perfect Hash Functions - Introduction


%!includeconf: CONFIG.t2t

----------------------------------------
==Basic Concepts==

Suppose [figs/img14.png] is a universe of //keys//.
Let [figs/img15.png] be a //hash function// that maps the keys from [figs/img14.png] to a given interval of integers [figs/img16.png].
Let [figs/img17.png] be a set of [figs/img8.png] keys from [figs/img14.png].
Given a key [figs/img18.png], the hash function [figs/img7.png] computes an
integer in [figs/img19.png] for the storage or retrieval of [figs/img11.png] in
a //hash table//.
Hashing methods for //non-static sets// of keys can be used to construct
data structures storing [figs/img20.png] and supporting membership queries
"[figs/img18.png]?" in expected time [figs/img21.png].
However, they involve a certain amount of wasted space owing to unused
locations in the table and waisted time to resolve collisions when
two keys are hashed to the same table location.

For //static sets// of keys it is possible to compute a function
to find any key in a table in one probe; such hash functions are called
//perfect//.
More precisely, given a set of keys [figs/img20.png], we shall say that a
hash function [figs/img15.png] is a //perfect hash function//
for [figs/img20.png] if [figs/img7.png] is an injection on [figs/img20.png],
that is, there are no //collisions// among the keys in [figs/img20.png]:
if [figs/img11.png] and [figs/img22.png] are in [figs/img20.png] and [figs/img23.png],
then [figs/img24.png].
Figure 1(a) illustrates a perfect hash function.
Since no collisions occur, each key can be retrieved from the table
with a single probe.
If [figs/img25.png], that is, the table has the same size as [figs/img20.png],
then we say that [figs/img7.png] is a //minimal perfect hash function//
for [figs/img20.png].
Figure 1(b) illustrates a minimal perfect hash function.
Minimal perfect hash functions totally avoid the problem of wasted
space and time. A perfect hash function [figs/img7.png] is //order preserving//
if the keys in [figs/img20.png] are arranged in some given order
and [figs/img7.png] preserves this order in the hash table.

| [figs/img26.png]
| **Figure 1:** (a) Perfect hash function. (b) Minimal perfect hash function.

Minimal perfect hash functions are widely used for memory efficient
storage and fast retrieval of items from static sets, such as words in natural
languages, reserved words in programming languages or interactive systems,
universal resource locations (URLs) in Web search engines, or item sets in
data mining techniques.

----------------------------------------
| [Home index.html] | [CHM chm.html] | [BMZ bmz.html]
----------------------------------------

%!include: FOOTER.t2t
42 changes: 42 additions & 0 deletions CONFIG.t2t
Original file line number Diff line number Diff line change
@@ -1,4 +1,46 @@
%! style(html): DOC.css
%! PreProc(html): '^%html% ' ''
%! PreProc(txt): '^%txt% ' ''
%! PostProc(html): "&" "&"
%! PostProc(txt): " " " "
%! PostProc(html): 'ALIGN="middle" SRC="figs/img7.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img7.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img57.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img57.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img32.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img32.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img20.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img20.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img60.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img60.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img62.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img62.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img79.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img79.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img139.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img139.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img140.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img140.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img143.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img143.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img115.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img115.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img11.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img11.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img169.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img169.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img96.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img96.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img178.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img178.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img180.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img180.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img183.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img183.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img189.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img189.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img196.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img196.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img172.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img172.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img8.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img8.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img1.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img1.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img14.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img14.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img128.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img128.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img112.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img112.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img12.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img12.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img13.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img13.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img244.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img244.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img245.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img245.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img246.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img246.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img15.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img15.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img25.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img25.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img168.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img168.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img6.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img6.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img5.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img5.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img28.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img28.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img237.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img248.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img248.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img249.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img249.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img250.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img250.png"\1>'
33 changes: 33 additions & 0 deletions DOC.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
/* implement both fixed-size and relative sizes */
SMALL.XTINY { }
SMALL.TINY { }
SMALL.SCRIPTSIZE { }
BODY { font-size: 13 }
TD { font-size: 13 }
SMALL.FOOTNOTESIZE { font-size: 13 }
SMALL.SMALL { }
BIG.LARGE { }
BIG.XLARGE { }
BIG.XXLARGE { }
BIG.HUGE { }
BIG.XHUGE { }

/* heading styles */
H1 { }
H2 { }
H3 { }
H4 { }
H5 { }


/* mathematics styles */
DIV.displaymath { } /* math displays */
TD.eqno { } /* equation-number cells */


/* document-specific styles come next */
DIV.navigation { }
DIV.center { }
SPAN.textit { font-style: italic }
SPAN.arabic { }
SPAN.eqn-number { }
3 changes: 2 additions & 1 deletion FAQ.t2t
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
CMPH FAQ


%!includeconf: CONFIG.t2t

- How do I define the ids of the keys?
- You don't. The ids will be assigned by the algorithm creating the minimal
Expand All @@ -26,7 +27,7 @@ one is executed?


----------------------------------------
[Home index.html]
| [Home index.html] | [CHM chm.html] | [BMZ bmz.html]
----------------------------------------

%!include: FOOTER.t2t
3 changes: 2 additions & 1 deletion GPERF.t2t
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
GPERF versus CMPH


%!includeconf: CONFIG.t2t

You might ask why cmph if [gperf http://www.gnu.org/software/gperf/gperf.html]
already works perfectly. Actually, gperf and cmph have different goals.
Expand Down Expand Up @@ -32,7 +33,7 @@ assigning ids to millions of documents), while the former is usually found in
the compiler programming area (detect reserved keywords).

----------------------------------------
[Home index.html]
| [Home index.html] | [CHM chm.html] | [BMZ bmz.html]
----------------------------------------

%!include: FOOTER.t2t
1 change: 1 addition & 0 deletions LOGO.t2t
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<a href="http://sourceforge.net"><img src="http://sourceforge.net/sflogo.php?group_id=96251&amp;type=1" width="88" height="31" border="0" alt="SourceForge.net Logo" /> </a>
7 changes: 4 additions & 3 deletions README.t2t
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ CMPH - C Minimal Perfect Hashing Library
==Description==

C Minimal Perfect Hashing Library is a portable LGPLed library to create and
to work with minimal perfect hash functions. The cmph library encapsulates the newest
to work with [minimal perfect hash functions concepts.html].
The cmph library encapsulates the newest
and more efficient algorithms (available in the literature) in an easy-to-use,
production-quality and fast API. The library is designed to work with big entries that
can not fit in the main memory. It has been used successfully for constructing minimal perfect
Expand Down Expand Up @@ -54,7 +55,7 @@ of the distinguishable features of cmph:

- New heuristic added to the bmz algorithm permits to generate a mphf with only
//24.6n + O(1)// bytes. The resulting function can be stored in //3.72n// bytes.
%html% [click here bmz.html] for details.
%html% [click here bmz.html#heuristic] for details.


----------------------------------------
Expand Down Expand Up @@ -173,5 +174,5 @@ Code is under the LGPL.

%!include: FOOTER.t2t

%!include(html): ''LOGO.html''
%!include(html): ''LOGO.t2t''
Last Updated: %%date(%c)
Loading

0 comments on commit 4951ded

Please sign in to comment.