-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow OrderedProbit distribution to take vector inputs #5418
Allow OrderedProbit distribution to take vector inputs #5418
Conversation
Codecov Report
@@ Coverage Diff @@
## main #5418 +/- ##
=======================================
Coverage 81.39% 81.39%
=======================================
Files 82 82
Lines 14213 14214 +1
=======================================
+ Hits 11568 11569 +1
Misses 2645 2645
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably simplify this test quite a lot. We don't need to actually create a whole dataset with pandas, just pass vector inputs and then check that the categorical variable that is returned has the right inputs (shape and values)
categorical = pm.OrderedProbit.dist(
cutpoints=np.array(...)
eta=np.array(...),
sigma=np.array(...),
)
p = categorical.owner.inputs[3].eval()
# Assert p is what we expected (shape and values)
We can also go one step further and check if it is also working with matrix inputs, or combinations of scalar and vector inputs
Hi @ricardoV94 , somethings like this?
Besides, how do we know that |
That's just the position of the I think the new tests are great. I just wonder what happens when the input has more dimensions (such as 2D cutpoints, or 2D eta/sigma). Do we get sensible If it works it would be nice to test it as well. Otherwise we can:
|
By the way, I like to use # Create vectorized function from base case (scalar eta, scalar sigma, and vector cutpoints)
eta = at.scalar('eta')
sigma = at.scalar('sigma')
cutpoints = at.vector('cutpoints')
probits = eta - cutpoints
left = normal_lccdf(0, sigma, probits[0])
middle = log_diff_normal_cdf(0, sigma, probits[:-1], probits[1:])
right = normal_lcdf(0, sigma, probits[-1])
p = at.exp(at.concatenate([[left], middle, [right]]))
base_p = aesara.function([eta, sigma, cutpoints], ret)
vec_p = np.vectorize(base_p, signature="(),(),(n)->(m)") You can ten use this assert vec_p(eta=0, sigma=1, cutpoints=[-2, 0, 2]) == (4,)
assert vec_p(eta=np.zeros((5, 2)), sigma=np.ones((2, 5, 2)), cutpoints=[[-2, 0, 2], [-2, 0, 2]]).shape == (2, 5, 2, 4) Of couse, even better, we can test not only the shapes but that all the values are close to the expected, with |
Hi yes, let's also test 2D case. I will prefer to use I think we should not repeat ourself by rewrite the following code.
Thanks mate 👍 |
@danhphan the idea was not to skip using This would not only test the shape of For precedence, here is a test that uses a pymc/pymc/tests/test_distributions.py Lines 471 to 476 in 1a35a3d
pymc/pymc/tests/test_distributions.py Line 2169 in 1a35a3d
Edit: I can be convinced that it is an overkill, and testing shapes is enough because there is no other way the values could have broadcasted. Just wanted to clarify what I was suggesting :) |
By the way, do you want to address the same limitation of |
Hi @ricardoV94, thanks for the information. On the FYI, I tested the vector inputs for To be safe, I will add a test to check different shapes of Cheers. |
I see what you mean. We have tests before were we check our pymc distribution (what we care about) matches scipy reference directly in the basic situation (no vectorization). Given this, we can then use Similarly here, you can see that we have a In a sense we don't need to test out helpers because of this type of test triangulation.
Definitely useful! |
Hi @ricardoV94, I have simplify the Let's me know if they need to update anythings. Thanks mate 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just some small suggestions
( | ||
[[1.0, -2.0, 3.0], [1.0, 2.0, -4.0]], | ||
[-2.0, 0, 1.0], | ||
[[0.0, 2.0, -4.0], [-1.0, 1.0, 3.0]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of these sigma are negative. We should test with valid sigma values (> 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test with 2d cutpoints missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ricardoV94, I will change the sigma
to all positive.
Also, cutpoints
should be always 1 dimension (to my understanding) as it represents (n-1) cut points of a categorical feature with n categories. I am not sure if there is any cases that needs 2d cutpoints
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our distributions, when possible, can always be "batched". That means we can arbitrarily increase the dimensionality of the distribution by adding parameters with more dimensions. The last axes represent the parameters for each "atomic" distribution in the batch
For example the Categorical distribution is happy to take 2D, 3D, ... ND dimensional probability parameters, as long as they add up to 1 over the last axis
pm.Categorical.dist(
np.full((4, 2, 3), [
[0., .1, .9],
[.9, .1, 0]
])
).eval()
The same should apply here if possible. See this issue where we are pursuing this for all multivariate distributions: #5383
The reason why this is useful is vectorization. Specifying a (3, 3) shaped distribution with different cutpoints can be much more efficient than specifying 3 times a (3,) shaped distribution with the different cutpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was exactly how this issue started by the way, just with the batching across sigma and eta, and fixed cutpoints. But it could have been the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi yes, the batch
dimension totally make sense. My initial thought was that for dealing with a large data set, batch_size
should be managed in pm.Data
(similar to DataLoader in pytorch). Although I have not checked pm.Data
yet :)
Anyways, I will check the case of 2d cutpoints as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe now my crazy example from before makes more sense?
OrderedProbit.dist(
eta=np.zeros((5, 2)),
sigma=np.ones((2, 5, 2)),
cutpoints=[[-2, 0, 2], [-2, 0, 2]]
) #shape should be (2, 5, 2, 4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, it makes sense. I have added the tests for 2d cutpoints
and positve sigma
.
Already run push
, but not sure why it has not updated in this PR:
danhphan@9b311bf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for your help @danhphan
Looking forward to your next PR :D
Hi @ricardoV94 many thanks for your support 💯 Cheer! 🍷 🍷 |
Congrats on number 2 @danhphan! |
This PR allows the
OrderedProbit
distribution to take vector inputs using advanced indexing in #5216sigma
parameterOrderedProbit
in test_distributions_random.pyHi @ricardoV94, let me know if it needs to update. Thanks