Generating Multivariate Hypergeometric Distribution

Let's consider the following setup:

We take a set having N number of elements.

We categorize these elements along some arbitrary requirement or requirements into m number of categories.

We know there's exactly n₁, n₂, ..., n_m elements in each category, therefore ∑n_i = N, (i=1,2,...,m).

We choose a sample size of K elements from the set above.

Where N, K, m ∈ ℕ₀ and K ≤ N.

Multivariate hypergeometric distribution describes the probabilities of cases of this situation. These cases can be identified by number of elements of each category in the sample, let's note them as follows by k₁, k₂, ..., k_m, where k_i ≤ n_i, (i=1, 2, ..., m).
This is a generalisation of hypergeometric distribution, where m = 2.

Example:

In a poker game there's N = 52 card in a deck and m = 4 suits each has n_i = 13 ranks. Each player holds K = 5 cards. To calculate the probability of a flush we have to find the following cases (k₁,k₂,k₃,k₄) = (5,0,0,0), (0,5,0,0), (0,0,5,0), (0,0,0,5).

Using the Classical Model

To calculate the probability of a case of this distribution we can use a calssic combinatoric formula:

P(k₁,k₂,...,k_m) = C^n₁_k₁· C^n₂_k₂· ... · C^n_m_{k_m}/ C^N_K.

Where Cⁿ_k is the binomial coefficient or n over k.

Motivation

Although this is a well known formula it's has the disadvantage being computationally demanding both in terms of CPU usage and representation of partial and final results. And even aside that this only gives us the probability of a single case, therfore we still need to find a solution to enumerate through each of them. Even if it isn't that difficult, there are some tricky parts, e.g. when K > n_i.

Using the Law of Total Probability

This project takes another approach and compaires it to the previous solution. Namely, it uses a lattice structure, where each level corespond to K = 1, 2, ..., N distribution and calculates the probabilites using the formula of total probability. The examined algorithms enumerates the cases of these distributions in lexicographic order and exploit this to find indices of the conditional events in the adjasent distribution. It uses an implicit method to keep track of the ranking function. The lattice can be defined as (ℕ^m, MAX, MIN), where MAX(A,B) = (max(a_i,b_i) | a_i ∈ A, b_i ∈ B, i=1,2,...,m) and MIN is defined similarly.

Achievement and Tradeoff

All in all this means that calaculateing a single probability can be done by m division and same amount of multiplication, but in exchange of storing all of the probabilites of two distribution (one being calculated and, the adjesent one). Therefore this method is suitable for problems that need all these numbers anyway.

Final Thoughts

This is rather a proof of concept. Showcases the gain using such method to generate a multivariate distribution. There still should be plenty of chance to optimise this algorithm. And I'll continue to study this as time allows it.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
mvhgd		mvhgd
README.md		README.md
mvhgd_dia.py		mvhgd_dia.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generating Multivariate Hypergeometric Distribution

Example:

Using the Classical Model

Motivation

Using the Law of Total Probability

Achievement and Tradeoff

Final Thoughts

About

Releases

Languages

sulyi/multivariate-hypergeometric

Folders and files

Latest commit

History

Repository files navigation

Generating Multivariate Hypergeometric Distribution

Example:

Using the Classical Model

Motivation

Using the Law of Total Probability

Achievement and Tradeoff

Final Thoughts

About

Resources

Stars

Watchers

Forks

Releases

Languages