Skip to content

Commit 227121e

Browse files
committed
Add section on entropy bits calculation
1 parent 3e4ff97 commit 227121e

File tree

6 files changed

+49
-26
lines changed

6 files changed

+49
-26
lines changed

Diff for: README.md

+34-11
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Efficiently generate cryptographically strong random strings of specified entrop
1313
- [Custom Characters](#CustomCharacters)
1414
- [Efficiency](#Efficiency)
1515
- [Custom Bytes](#CustomBytes)
16+
- [Entropy Bits](#EntropyBits)
1617
- [Take Away](#TakeAway)
1718

1819
### Installation
@@ -203,15 +204,15 @@ Further commands can use the loaded modules:
203204
204205
### <a name="Overview"></a>Overview
205206
206-
`EntropyString` provides easy creation of randomly generated strings of specific entropy using various character sets. Such strings are needed when generating, for example, random IDs and you don't want the overkill of a GUID, or for ensuring that some number of items have unique identifiers.
207+
`EntropyString` provides easy creation of randomly generated strings of specific entropy using various character sets. Such strings are needed as unique identifiers when generating, for example, random IDs and you don't want the overkill of a GUID.
207208
208-
A key concern when generating such strings is that they be unique. To truly guarantee uniqueness requires either deterministic generation (e.g., a counter) that is not random, or that each newly created random string be compared against all existing strings. When ramdoness is required, the overhead of storing and comparing all strings is often too onerous and a different tack is needed.
209+
A key concern when generating such strings is that they be unique. Guaranteed uniqueness, however,, requires either deterministic generation (e.g., a counter) that is not random, or that each newly created random string be compared against all existing strings. When ramdoness is required, the overhead of storing and comparing strings is often too onerous and a different tack is chosen.
209210
210-
A common strategy is to replace the *guarantee of uniqueness* with a weaker but often sufficient *probabilistic uniqueness*. Specifically, rather than being absolutely sure of uniqueness, we settle for a statement such as *"there is less than a 1 in a billion chance that two of my strings are the same"*. This strategy requires much less overhead, but does require we have some manner of qualifying what we mean by, for example, *"there is less than a 1 in a billion chance that 1 million strings of this form will have a repeat"*.
211+
A common strategy is to replace the *guarantee of uniqueness* with a weaker but often sufficient *probabilistic uniqueness*. Specifically, rather than being absolutely sure of uniqueness, we settle for a statement such as *"there is less than a 1 in a billion chance that two of my strings are the same"*. This strategy requires much less overhead, but does require we have some manner of qualifying what we mean by *"there is less than a 1 in a billion chance that 1 million strings of this form will have a repeat"*.
211212
212-
Understanding probabilistic uniqueness requires some understanding of [*entropy*](https://en.wikipedia.org/wiki/Entropy_(information_theory)) and of estimating the probability of a [*collision*](https://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem) (i.e., the probability that two strings in a set of randomly generated strings might be the same). Happily, you can use `EntropyString` without a deep understanding of these topics.
213+
Understanding probabilistic uniqueness of random strings requires an understanding of [*entropy*](https://en.wikipedia.org/wiki/Entropy_(information_theory)) and of estimating the probability of a [*collision*](https://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem) (i.e., the probability that two strings in a set of randomly generated strings might be the same). The blog posting [Hash Collision Probabilities](http://preshing.com/20110504/hash-collision-probabilities/) provides an excellent overview of deriving an expression for calculating the probability of a collision in some number of hashes using a perfect hash with an N-bit output. The [Entropy Bits](#EntropyBits) section below discribes how `EntropyString` takes this idea a step further to address a common need in generating unique identifiers.
213214
214-
We'll begin investigating `EntropyString` by considering our [Real Need](Read%20Need) when generating random strings.
215+
We'll begin investigating `EntropyString` and this common need by considering our [Real Need](#RealNeed) when generating random strings.
215216
216217
[TOC](#TOC)
217218
@@ -257,32 +258,32 @@ Not only is this statement more specific, there is no mention of string length.
257258
258259
How do you address this need using a library designed to generate strings of specified length? Well, you don't directly, because that library was designed to answer the originally stated need, not the real need we've uncovered. We need a library that deals with probabilistic uniqueness of a total number of some strings. And that's exactly what `EntropyString` does.
259260
260-
Let's use `EntropyString` to help this developer by generating 5 IDs:
261+
Let's use `EntropyString` to help this developer generate 5 hexadecimal IDs from a pool of a potentail 10,000 IDs with a 1 in a milllion chance of a repeat:
261262
262263
```elixir
263264
iex(1)> import EntropyString
264265
EntropyString
265266
iex(2)> import EntropyString.CharSet, only: [charset16: 0]
266267
EntropyString.CharSet
267-
iex(3)> bits = entropy_bits(10000, 1000000)
268+
iex(3)> bits = entropy_bits(10000, 1.0e6)
268269
45.50699332842307
269270
iex(4)> for x <- :lists.seq(1,5), do: random_string(bits, charset16)
270271
["85e442fa0e83", "a74dc126af1e", "368cd13b1f6e", "81bf94e1278d", "fe7dec099ac9"]
271272
```
272273
273-
To generate the IDs, we first use
274+
Examining the above code,
274275
275276
```elixir
276-
bits = entropy_bits(10000, 1000000)
277+
bits = entropy_bits(10000, 1.0e6)
277278
```
278279
279-
to determine how much entropy is needed to generate a potential of _10000 strings_ while satisfy the probabilistic uniqueness of a _1 in a million risk_ of repeat. We can see from the output of the Elixir shell it's about **45.51** bits. Inside the list comprehension we used
280+
is used to determine how much entropy is needed to satisfy the probabilistic uniqueness of a **1 in a million** risk of repeat in a total of **10,000** strings. We didn't print the result, but if you did you'd see it's about **45.51** bits. Then
280281
281282
```elixir
282283
random_string(bits, charset16)
283284
```
284285
285-
to actually generate a random string of the specified entropy using hexadecimal (charSet16) characters. Looking at the IDs, we can see each is 12 characters long. Again, the string length is a by-product of the characters used to represent the entropy we needed. And it seems the developer didn't really need 16 characters after all.
286+
is used to actually generate a random string of the specified entropy using hexadecimal (charSet16) characters. Looking at the IDs, we can see each is 12 characters long. Again, the string length is a by-product of the characters used to represent the entropy we needed. And it seems the developer didn't really need 16 characters after all.
286287
287288
Finally, given that the strings are 12 hexadecimals long, each string actually has an information carrying capacity of 12 * 4 = 48 bits of entropy (a hexadecimal character carries 4 bits). That's fine. Assuming all characters are equally probable, a string can only carry entropy equal to a multiple of the amount of entropy represented per character. `EntropyString` produces the smallest strings that *exceed* the specified entropy.
288289
@@ -474,6 +475,28 @@ Note the number of bytes needed is dependent on the number of characters in the
474475
475476
[TOC](#TOC)
476477
478+
### <a name="EntropyBits"></a>Entropy Bits
479+
480+
Thus far we've avoided the mathematics behind the calculation of the entropy bits required to specify a risk that some number random strings will not have a repeat. As noted in the [Overview](#Overview), the posting [Hash Collision Probabilities](http://preshing.com/20110504/hash-collision-probabilities/) derives an expression, based on the well-known [Birthday Problem](https://en.wikipedia.org/wiki/Birthday_problem#Approximations), for calculating the probability of a collision in some number of hashes (denoted by `k`) using a perfect hash with an output of `M` bits:
481+
482+
![Hash Collision Probability](images/HashCollision.png)
483+
484+
There are two slight tweaks to this equation as compared to the one in the referenced posting. `M` is used for the total number of possible hashes and an equation is formed by explicitly specifying that the expression in the posting is approximately equal to `1/n`.
485+
486+
More importantly, the above equation isn't in a form conducive to our entropy string needs. The equation was derived for a set number of possible hashes and yields a probability, which is fine for hash collisions but isn't quite right for calculating the bits of entropy needed for our random strings.
487+
488+
The first thing we'll change is to use `M = 2^N`, where `N` is the number of entropy bits. This simply states that the number of possible strings is equal to the number of possible values using `N` bits:
489+
490+
![N-Bit Collision Probability](images/NBitCollision.png)
491+
492+
Now we massage the equation to represent `N` as a function of `k` and `n`:
493+
494+
![Entropy Bits Equation](images/EntropyBits.png)
495+
496+
The final line represents the number of entropy bits `N` as a function of the number of potential strings `k` and the risk of repeat of 1 in `n`, exactly what we want. Furthermore, the equation is in a form that avoids really large numbers in calculating `N` since we immediately take a logarithm of each large value `k` and `n`.
497+
498+
[TOC](#TOC)
499+
477500
### <a name="TakeAway"></a>Take Away
478501
479502
- You don't need random strings of length L.

Diff for: examples.exs

+14-14
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
# > iex
1010
#
1111

12+
alias EntropyString.CharSet
13+
1214
#--------------------------------------------------------------------------------------------------
1315
# Id
1416
# Predefined base 32 characters
@@ -23,31 +25,31 @@ IO.puts " Session ID: #{Id.session_id}\n"
2325
# Base64Id
2426
# Predefined URL and file system safe characters
2527
#--------------------------------------------------------------------------------------------------
26-
defmodule Base64Id, do: use EntropyString, charset: EntropyString.CharSet.charset64
28+
defmodule Base64Id, do: use EntropyString, charset: CharSet.charset64
2729

2830
IO.puts "Base64Id: Predefined URL and file system safe CharSet"
2931
IO.puts " Characters: #{Base64Id.charset}"
3032
IO.puts " Session ID: #{Base64Id.session_id}\n"
3133

3234
#--------------------------------------------------------------------------------------------------
33-
# HexId
35+
# Hex Id
3436
# Predefined hex characters
3537
#--------------------------------------------------------------------------------------------------
36-
defmodule HexId, do: use EntropyString, charset: EntropyString.CharSet.charset16
38+
defmodule Hex, do: use EntropyString, charset: CharSet.charset16
3739

38-
IO.puts "HexId: Predefined hex CharSet"
39-
IO.puts " Characters: #{HexId.charset}"
40-
IO.puts " Small ID: #{HexId.small_id}\n"
40+
IO.puts "Hex: Predefined hex CharSet"
41+
IO.puts " Characters: #{Hex.charset}"
42+
IO.puts " Small ID: #{Hex.small_id}\n"
4143

4244
#--------------------------------------------------------------------------------------------------
43-
# UpperHexId
45+
# Uppercase Hex Id
4446
# Uppercase hex characters
4547
#--------------------------------------------------------------------------------------------------
46-
defmodule UpperHexId, do: use EntropyString, charset: "0123456789ABCDEF"
48+
defmodule UpperHex, do: use EntropyString, charset: "0123456789ABCDEF"
4749

48-
IO.puts "UpperHexId: Upper case hex CharSet"
49-
IO.puts " Characters: #{UpperHexId.charset}"
50-
IO.puts " Medium ID: #{UpperHexId.medium_id}\n"
50+
IO.puts "UpperHex: Upper case hex CharSet"
51+
IO.puts " Characters: #{UpperHex.charset}"
52+
IO.puts " Medium ID: #{UpperHex.medium_id}\n"
5153

5254
#--------------------------------------------------------------------------------------------------
5355
# DingoSky
@@ -70,7 +72,7 @@ IO.puts " DingoSky ID: #{DingoSky.id}\n"
7072
# 256 entropy bit token
7173
#--------------------------------------------------------------------------------------------------
7274
defmodule MyServer do
73-
use EntropyString, charset: EntropyString.CharSet.charset64
75+
use EntropyString, charset: CharSet.charset64
7476

7577
@bits 256
7678

@@ -80,5 +82,3 @@ end
8082
IO.puts "MyServer: 256 entropy bit token"
8183
IO.puts " Characters: #{MyServer.charset}"
8284
IO.puts " MyServer Token: #{MyServer.token}\n"
83-
84-

Diff for: images/EntropyBits.png

30.2 KB
Loading

Diff for: images/HashCollision.png

4.31 KB
Loading

Diff for: images/NBitCollision.png

4.34 KB
Loading

Diff for: mix.exs

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ defmodule EntropyString.Mixfile do
33

44
def project do
55
[ app: :entropy_string,
6-
version: "1.0.2",
6+
version: "1.0.3",
77
elixir: "~> 1.4",
88
deps: deps(),
99

0 commit comments

Comments
 (0)