-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HyperGeometric Distribution to pymc3.distributions.discrete #3504 #4108
Conversation
I can see that However, when I test for something like: M = 3, n = 3, N = 4, and k = 1 And, |
Thanks for the PR! Can you please run the Black formatter on this? |
@fonnesbeck Sure! I should tell you that running the Black formatter on discrete.py, test_distributions.py, and test_distributions_random.py formats some code that I haven't touched as well 😅 |
…stributions_random.py
No problem. It likely slipped through on a previous PR. |
Okay, done! The Black formatter has reached in and changed stuff in a lottt of places (Exa: change all ' ' to " "). So lemme know if I should revert this because this is probably not gonna be easy for you to review 😅 |
pymc3/distributions/discrete.py
Outdated
k = self.k | ||
n = self.n | ||
return bound(binomln(k, value) + binomln(N - k, n - value) - binomln(N, n), | ||
0 <= k, k <= N, 0 <= n, 0 <= N, n - N + k <= value, 0 <= value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are already testing that 0 <= k and k <= N, so you should not have to include 0 <= N
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, point! Removing the unnecessary condition.
As for the difference between your implementation and SciPy's, theirs uses a different formulat for the logprob: def _logpmf(self, k, M, n, N):
tot, good = M, n
bad = tot - good
result = (betaln(good+1, 1) + betaln(bad+1, 1) + betaln(tot-N+1, N+1) -
betaln(k+1, good-k+1) - betaln(N-k+1, bad-N+k+1) -
betaln(tot+1, 1))
return result So, you can either change yours to match this, or address the test another way. |
FWIW, I think this is a fair point...would PyMC3 take a large PR that does nothing except for applying EDITtaken forward in #4109 |
@fonnesbeck Thanks! Modified the implementation to match the scipy one. |
Looks like you need a rebase to address the conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @Harivallabha for implementing this distribution. I have been working on adding NegativeHypergeometric
and MultivariateHypergeometric
distributions to pymc3 so I thought I would help with this too. Hope these comments are helpful :)
Please tell me if I have missed anything.
plt.show() | ||
|
||
======== ============================= | ||
Support :math:`x \in \mathbb{N}_{>0}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this is right? On the Wikipedia page, support is given as x in [max(0, n - N + k), min(k, n)]
.
|
||
class HyperGeometric(Discrete): | ||
R""" | ||
Hypergeometric log-likelihood. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hypergeometric log-likelihood. | |
Discrete hypergeometric distribution. |
Isn't this better?
The probability of x successes in a sequence of n Bernoulli | ||
trials (That is, sample size = n) - where the population | ||
size is N, containing a total of k successful individuals. | ||
The process is carried out without replacement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The probability of x successes in a sequence of n Bernoulli | |
trials (That is, sample size = n) - where the population | |
size is N, containing a total of k successful individuals. | |
The process is carried out without replacement. | |
The probability of :math:`x` successes in a sequence of :math:`n` bernoulli | |
trials taken without replacement from a population of :math:`N` objects, | |
containing :math:`k` good (or successful or Type I) objects. |
Nitpick. Not a blocking comment.
Mean :math:`\dfrac{n.k}{N}` | ||
Variance :math:`\dfrac{(N-n).n.k.(N-k)}{(N-1).N^2}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mean :math:`\dfrac{n.k}{N}` | |
Variance :math:`\dfrac{(N-n).n.k.(N-k)}{(N-1).N^2}` | |
Mean :math:`\dfrac{nk}{N}` | |
Variance :math:`\dfrac{(N-n)nk(N-k)}{(N-1)N^2}` |
self.N = N = tt.as_tensor_variable(intX(N)) | ||
self.k = k = tt.as_tensor_variable(intX(k)) | ||
self.n = n = tt.as_tensor_variable(intX(n)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.N = N = tt.as_tensor_variable(intX(N)) | |
self.k = k = tt.as_tensor_variable(intX(k)) | |
self.n = n = tt.as_tensor_variable(intX(n)) | |
self.N = intX(N) | |
self.k = intX(k) | |
self.n = intX(n) |
Nitpick. Not a blocking comment.
The pmf of this distribution is | ||
.. math:: f(x \mid N, n, k) = \frac{\binom{k}{x}\binom{N-k}{n-x}}{\binom{N}{n}} | ||
.. plot:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pmf of this distribution is | |
.. math:: f(x \mid N, n, k) = \frac{\binom{k}{x}\binom{N-k}{n-x}}{\binom{N}{n}} | |
.. plot:: | |
The pmf of this distribution is | |
.. math:: f(x \mid N, n, k) = \frac{\binom{k}{x}\binom{N-k}{n-x}}{\binom{N}{n}} | |
.. plot:: |
Nitpick. Not a blocking comment
specified). | ||
Returns | ||
------- | ||
array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specified). | |
Returns | |
------- | |
array | |
specified). | |
Returns | |
------- | |
array |
Nitpick. Not a blocking comment
Calculate log-probability of HyperGeometric distribution at specified value. | ||
Parameters | ||
---------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calculate log-probability of HyperGeometric distribution at specified value. | |
Parameters | |
---------- | |
Calculate log-probability of HyperGeometric distribution at specified value. | |
Parameters | |
---------- |
Nitpick. Not a blocking comment
values are desired the values must be provided in a numpy array or theano tensor | ||
Returns | ||
------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values are desired the values must be provided in a numpy array or theano tensor | |
Returns | |
------- | |
values are desired the values must be provided in a numpy array or theano tensor | |
Returns | |
------- |
Nitpick. Not a blocking comment
- betaln(n - value + 1, bad - n + value + 1) | ||
- betaln(tot + 1, 1) | ||
) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should mask the invalid entries with -inf
before returning the result. I have used the inequality of support from the Wikipedia page. If that comment is wrong, please change the lower
and upper
bound conditions according to the formula of support you use.
return result | |
# value in [max(0, n - N + k), min(k, n)] | |
lower = tt.switch(tt.gt(n - N + k, 0), n - N + k, 0) | |
upper = tt.switch(tt.lt(k, n), k, n) | |
nonint_value = (value != intX(tt.floor(value))) | |
return bound(result, lower <= value, value <= upper, nonint_value) |
There is also an additional condition checking whether all the values are integers. scipy checks it but haven't seen pymc3 use it anywhere. It should ideally be there but has multiple numerical problems (like 1e-17
is considered non integer value). Please avoid if you think the condition is better avoided.
@Harivallabha thanks for the PR, we're trying to get this in release 3.10. For that, could you please:
|
Thanks, let's continue in #4108 then. |
…#4249) * Add HyperGeometric distribution to discrete.py; Add tests * Add HyperGeo to distirbutions/__init__.py * Fix minor linting issue * Add ref_rand helper function. Clip lower in logp * Fix bug. Now pymc3_matches_scipy runs without error but pymc3_random_discrete diverges from expected value * passes match with scipy test in test_distributions.py but fails in test_distributions_random.py * Modify HyperGeom.random; Random test still failing. match_with_scipy test passing * rm stray print * Fix failing random test by specifying domain * Update pymc3/distributions/discrete.py Remove stray newline Co-authored-by: Tirth Patel <tirthasheshpatel@gmail.com> * Add note in RELEASE-NOTES.md Co-authored-by: Tirth Patel <tirthasheshpatel@gmail.com>
@twiecki @ricardoV94. Hope this helps. Please feel free to modify/add on top of this, if this is useful. If not, please ignore 😬