Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HyperGeometric Distribution to pymc3.distributions.discrete #3504 #4108

Closed
wants to merge 8 commits into from

Conversation

Harivallabha
Copy link
Contributor

@Harivallabha Harivallabha commented Sep 17, 2020

  • Support HyperGeometric Distribution as part of distributions/discrete
  • Added tests
  • Added docstrings

@twiecki @ricardoV94. Hope this helps. Please feel free to modify/add on top of this, if this is useful. If not, please ignore 😬

@Harivallabha
Copy link
Contributor Author

Harivallabha commented Sep 17, 2020

I can see that scipy.stats.hypergeom and the above pymc3.distributions.discrete.HyperGeometric give the same result for usual values.

However, when I test for something like: M = 3, n = 3, N = 4, and k = 1
scipy.stats.hypergeom.logpmf(1, 3, 3, 4) gives nan

And,
with pm.Model() as model:
z = pm.HyperGeometric('z', 3, 3, 4)
model.logp({'z':1}) gives -inf

@fonnesbeck
Copy link
Member

Thanks for the PR! Can you please run the Black formatter on this?

@Harivallabha
Copy link
Contributor Author

@fonnesbeck Sure! I should tell you that running the Black formatter on discrete.py, test_distributions.py, and test_distributions_random.py formats some code that I haven't touched as well 😅

@fonnesbeck
Copy link
Member

No problem. It likely slipped through on a previous PR.

@Harivallabha
Copy link
Contributor Author

Harivallabha commented Sep 17, 2020

Okay, done! The Black formatter has reached in and changed stuff in a lottt of places (Exa: change all ' ' to " "). So lemme know if I should revert this because this is probably not gonna be easy for you to review 😅

k = self.k
n = self.n
return bound(binomln(k, value) + binomln(N - k, n - value) - binomln(N, n),
0 <= k, k <= N, 0 <= n, 0 <= N, n - N + k <= value, 0 <= value,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are already testing that 0 <= k and k <= N, so you should not have to include 0 <= N

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, point! Removing the unnecessary condition.

@fonnesbeck
Copy link
Member

As for the difference between your implementation and SciPy's, theirs uses a different formulat for the logprob:

    def _logpmf(self, k, M, n, N):
        tot, good = M, n
        bad = tot - good
        result = (betaln(good+1, 1) + betaln(bad+1, 1) + betaln(tot-N+1, N+1) -
                  betaln(k+1, good-k+1) - betaln(N-k+1, bad-N+k+1) -
                  betaln(tot+1, 1))
        return result

So, you can either change yours to match this, or address the test another way.

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Sep 17, 2020

Okay, done! The Black formatter has reached in and changed stuff in a lottt of places (Exa: change all ' ' to " "). So lemme know if I should revert this because this is probably not gonna be easy for you to review sweat_smile

FWIW, I think this is a fair point...would PyMC3 take a large PR that does nothing except for applying black everywhere and also adds it to CI / pre-commit? Because otherwise reviews do become hard. Conversely, once black has been applied everywhere, diffs do tend to become smaller.

EDIT

taken forward in #4109

@Harivallabha
Copy link
Contributor Author

@fonnesbeck Thanks! Modified the implementation to match the scipy one.

@fonnesbeck
Copy link
Member

Looks like you need a rebase to address the conflicts.

Copy link
Contributor

@tirthasheshpatel tirthasheshpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @Harivallabha for implementing this distribution. I have been working on adding NegativeHypergeometric and MultivariateHypergeometric distributions to pymc3 so I thought I would help with this too. Hope these comments are helpful :)

Please tell me if I have missed anything.

plt.show()

======== =============================
Support :math:`x \in \mathbb{N}_{>0}`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is right? On the Wikipedia page, support is given as x in [max(0, n - N + k), min(k, n)].


class HyperGeometric(Discrete):
R"""
Hypergeometric log-likelihood.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Hypergeometric log-likelihood.
Discrete hypergeometric distribution.

Isn't this better?

Comment on lines +863 to +866
The probability of x successes in a sequence of n Bernoulli
trials (That is, sample size = n) - where the population
size is N, containing a total of k successful individuals.
The process is carried out without replacement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The probability of x successes in a sequence of n Bernoulli
trials (That is, sample size = n) - where the population
size is N, containing a total of k successful individuals.
The process is carried out without replacement.
The probability of :math:`x` successes in a sequence of :math:`n` bernoulli
trials taken without replacement from a population of :math:`N` objects,
containing :math:`k` good (or successful or Type I) objects.

Nitpick. Not a blocking comment.

Comment on lines +890 to +891
Mean :math:`\dfrac{n.k}{N}`
Variance :math:`\dfrac{(N-n).n.k.(N-k)}{(N-1).N^2}`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Mean :math:`\dfrac{n.k}{N}`
Variance :math:`\dfrac{(N-n).n.k.(N-k)}{(N-1).N^2}`
Mean :math:`\dfrac{nk}{N}`
Variance :math:`\dfrac{(N-n)nk(N-k)}{(N-1)N^2}`

Comment on lines +906 to +908
self.N = N = tt.as_tensor_variable(intX(N))
self.k = k = tt.as_tensor_variable(intX(k))
self.n = n = tt.as_tensor_variable(intX(n))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.N = N = tt.as_tensor_variable(intX(N))
self.k = k = tt.as_tensor_variable(intX(k))
self.n = n = tt.as_tensor_variable(intX(n))
self.N = intX(N)
self.k = intX(k)
self.n = intX(n)

Nitpick. Not a blocking comment.

Comment on lines +868 to +870
The pmf of this distribution is
.. math:: f(x \mid N, n, k) = \frac{\binom{k}{x}\binom{N-k}{n-x}}{\binom{N}{n}}
.. plot::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The pmf of this distribution is
.. math:: f(x \mid N, n, k) = \frac{\binom{k}{x}\binom{N-k}{n-x}}{\binom{N}{n}}
.. plot::
The pmf of this distribution is
.. math:: f(x \mid N, n, k) = \frac{\binom{k}{x}\binom{N-k}{n-x}}{\binom{N}{n}}
.. plot::

Nitpick. Not a blocking comment

Comment on lines +921 to +924
specified).
Returns
-------
array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
specified).
Returns
-------
array
specified).
Returns
-------
array

Nitpick. Not a blocking comment

Comment on lines +933 to +935
Calculate log-probability of HyperGeometric distribution at specified value.
Parameters
----------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Calculate log-probability of HyperGeometric distribution at specified value.
Parameters
----------
Calculate log-probability of HyperGeometric distribution at specified value.
Parameters
----------

Nitpick. Not a blocking comment

Comment on lines +938 to +940
values are desired the values must be provided in a numpy array or theano tensor
Returns
-------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
values are desired the values must be provided in a numpy array or theano tensor
Returns
-------
values are desired the values must be provided in a numpy array or theano tensor
Returns
-------

Nitpick. Not a blocking comment

- betaln(n - value + 1, bad - n + value + 1)
- betaln(tot + 1, 1)
)
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should mask the invalid entries with -inf before returning the result. I have used the inequality of support from the Wikipedia page. If that comment is wrong, please change the lower and upper bound conditions according to the formula of support you use.

Suggested change
return result
# value in [max(0, n - N + k), min(k, n)]
lower = tt.switch(tt.gt(n - N + k, 0), n - N + k, 0)
upper = tt.switch(tt.lt(k, n), k, n)
nonint_value = (value != intX(tt.floor(value)))
return bound(result, lower <= value, value <= upper, nonint_value)

There is also an additional condition checking whether all the values are integers. scipy checks it but haven't seen pymc3 use it anywhere. It should ideally be there but has multiple numerical problems (like 1e-17 is considered non integer value). Please avoid if you think the condition is better avoided.

@michaelosthege michaelosthege added this to the 3.10 milestone Nov 14, 2020
@Spaak
Copy link
Member

Spaak commented Nov 23, 2020

@Harivallabha thanks for the PR, we're trying to get this in release 3.10. For that, could you please:

@Spaak
Copy link
Member

Spaak commented Nov 24, 2020

Thanks, let's continue in #4108 then.

@Spaak Spaak closed this Nov 24, 2020
Spaak pushed a commit that referenced this pull request Dec 3, 2020
…#4249)

* Add HyperGeometric distribution to discrete.py; Add tests

* Add HyperGeo to distirbutions/__init__.py

* Fix minor linting issue

* Add ref_rand helper function. Clip lower in logp

* Fix bug. Now pymc3_matches_scipy runs without error but pymc3_random_discrete diverges from expected value

* passes match with scipy test in test_distributions.py but fails in test_distributions_random.py

* Modify HyperGeom.random; Random test still failing. match_with_scipy test passing

* rm stray print

* Fix failing random test by specifying domain

* Update pymc3/distributions/discrete.py

Remove stray newline

Co-authored-by: Tirth Patel <tirthasheshpatel@gmail.com>

* Add note in RELEASE-NOTES.md

Co-authored-by: Tirth Patel <tirthasheshpatel@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants