Skip to content

Latest commit

 

History

History
50 lines (30 loc) · 3.82 KB

README.md

File metadata and controls

50 lines (30 loc) · 3.82 KB

Generating Multivariate Hypergeometric Distribution

Let's consider the following setup:

  • We take a set having N number of elements.
  • We categorize these elements along some arbitrary requirement or requirements into m number of categories.
  • We know there's exactly n1, n2, ..., nm elements in each category, therefore ∑ni = N, (i=1,2,...,m).
  • We choose a sample size of K elements from the set above.

Where N, K, m ∈ ℕ0 and K ≤ N.

Multivariate hypergeometric distribution describes the probabilities of cases of this situation. These cases can be identified by number of elements of each category in the sample, let's note them as follows by k1, k2, ..., km, where ki ≤ ni, (i=1, 2, ..., m).
This is a generalisation of hypergeometric distribution, where m = 2.

Example:

In a poker game there's N = 52 card in a deck and m = 4 suits each has ni = 13 ranks. Each player holds K = 5 cards. To calculate the probability of a flush we have to find the following cases (k1,k2,k3,k4) = (5,0,0,0), (0,5,0,0), (0,0,5,0), (0,0,0,5).

Using the Classical Model

To calculate the probability of a case of this distribution we can use a calssic combinatoric formula:

P(k1,k2,...,km) = Cn1k1· Cn2k2· ... · Cnmkm/ CNK.

Where Cnk is the binomial coefficient or n over k.

Motivation

Although this is a well known formula it's has the disadvantage being computationally demanding both in terms of CPU usage and representation of partial and final results. And even aside that this only gives us the probability of a single case, therfore we still need to find a solution to enumerate through each of them. Even if it isn't that difficult, there are some tricky parts, e.g. when K > ni.

Using the Law of Total Probability

This project takes another approach and compaires it to the previous solution. Namely, it uses a lattice structure, where each level corespond to K = 1, 2, ..., N distribution and calculates the probabilites using the formula of total probability. The examined algorithms enumerates the cases of these distributions in lexicographic order and exploit this to find indices of the conditional events in the adjasent distribution. It uses an implicit method to keep track of the ranking function. The lattice can be defined as (ℕm, MAX, MIN), where MAX(A,B) = (max(ai,bi) | ai ∈ A, bi ∈ B, i=1,2,...,m) and MIN is defined similarly.

Achievement and Tradeoff

All in all this means that calaculateing a single probability can be done by m division and same amount of multiplication, but in exchange of storing all of the probabilites of two distribution (one being calculated and, the adjesent one). Therefore this method is suitable for problems that need all these numbers anyway.

Final Thoughts

This is rather a proof of concept. Showcases the gain using such method to generate a multivariate distribution. There still should be plenty of chance to optimise this algorithm. And I'll continue to study this as time allows it.