Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow OrderedProbit distribution to take vector inputs #5418

Merged
merged 5 commits into from
Feb 6, 2022

Conversation

danhphan
Copy link
Member

This PR allows the OrderedProbit distribution to take vector inputs using advanced indexing in #5216

  • Fix the shape of the sigma parameter
  • Add test_vector_inputs for OrderedProbit in test_distributions_random.py

Hi @ricardoV94, let me know if it needs to update. Thanks

@danhphan danhphan changed the title Allow OrderedProbit distribution to take vector inputs using advanced indexing Allow OrderedProbit distribution to take vector inputs Jan 29, 2022
@codecov
Copy link

codecov bot commented Jan 29, 2022

Codecov Report

Merging #5418 (466a941) into main (0dca647) will increase coverage by 0.00%.
The diff coverage is n/a.

❗ Current head 466a941 differs from pull request most recent head 9b311bf. Consider uploading reports for the commit 9b311bf to get more accurate results

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #5418   +/-   ##
=======================================
  Coverage   81.39%   81.39%           
=======================================
  Files          82       82           
  Lines       14213    14214    +1     
=======================================
+ Hits        11568    11569    +1     
  Misses       2645     2645           
Impacted Files Coverage Δ
pymc/distributions/discrete.py 99.76% <ø> (ø)
pymc/distributions/shape_utils.py 96.73% <0.00%> (-2.14%) ⬇️
pymc/aesaraf.py 90.17% <0.00%> (-0.03%) ⬇️
pymc/distributions/distribution.py 91.43% <0.00%> (+0.03%) ⬆️
pymc/model.py 85.97% <0.00%> (+0.03%) ⬆️
pymc/sampling.py 86.06% <0.00%> (+0.22%) ⬆️
pymc/sampling_jax.py 98.30% <0.00%> (+0.84%) ⬆️

Copy link
Member

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably simplify this test quite a lot. We don't need to actually create a whole dataset with pandas, just pass vector inputs and then check that the categorical variable that is returned has the right inputs (shape and values)

categorical = pm.OrderedProbit.dist(
  cutpoints=np.array(...)
  eta=np.array(...),
  sigma=np.array(...),
)

p = categorical.owner.inputs[3].eval()
# Assert p is what we expected (shape and values)

We can also go one step further and check if it is also working with matrix inputs, or combinations of scalar and vector inputs

@danhphan
Copy link
Member Author

Hi @ricardoV94 , somethings like this?

def test_vector_inputs(self):
    """ 
    This test checks when providing vector inputs for `eta` and `sigma` parameters using advanced indexing.
    """
    categorical = pm.OrderedProbit.dist(
        eta=0,
        cutpoints=np.array([-2.0, 0, 2.0]),
        sigma=1.0,
        )
    p = categorical.owner.inputs[3].eval()
    assert p.shape == (4,)

    categorical = pm.OrderedProbit.dist(
        eta=np.array([1.0, 2.0, 3.0, 4.0, 5.0]),
        cutpoints=np.array([-2.0, 0, 2.0]),
        sigma=1,
        )
    p = categorical.owner.inputs[3].eval()
    assert p.shape == (5, 4)

    categorical = pm.OrderedProbit.dist(
        eta=np.array([1.0, 2.0, 3.0, 4.0, 5.0]),
        cutpoints=np.array([-2.0, 0, 2.0]),
        sigma=np.array([1.0, 2.0, 3.0, 4.0, 5.0]),
        )
    p = categorical.owner.inputs[3].eval()
    assert p.shape == (5, 4)

Besides, how do we know that .owner.inputs[3] will be p?

@ricardoV94
Copy link
Member

ricardoV94 commented Jan 29, 2022

Besides, how do we know that .owner.inputs[3] will be p?

That's just the position of the p input in the returned CategoricalRV variable. It's always the same position.

I think the new tests are great. I just wonder what happens when the input has more dimensions (such as 2D cutpoints, or 2D eta/sigma). Do we get sensible p? Or does it just fail altogether?

If it works it would be nice to test it as well. Otherwise we can:

  1. Try to make it work
  2. Add an explicit check in the dist with an informative NotImplementedError for invalid input dimensions

@ricardoV94
Copy link
Member

By the way, I like to use np.vectorize when testing that a function is broadcasting as expected. In your case something like this:

# Create vectorized function from base case (scalar eta, scalar sigma, and vector cutpoints)
eta = at.scalar('eta')
sigma = at.scalar('sigma')
cutpoints = at.vector('cutpoints')

probits = eta - cutpoints
left = normal_lccdf(0, sigma, probits[0])
middle = log_diff_normal_cdf(0, sigma, probits[:-1], probits[1:])
right = normal_lcdf(0, sigma, probits[-1])
p = at.exp(at.concatenate([[left], middle, [right]]))

base_p = aesara.function([eta, sigma, cutpoints], ret)
vec_p = np.vectorize(base_p, signature="(),(),(n)->(m)")

You can ten use this vec_p to test arbitrary (valid) shapes of inputs.

assert vec_p(eta=0, sigma=1, cutpoints=[-2, 0, 2]) == (4,)
assert vec_p(eta=np.zeros((5, 2)), sigma=np.ones((2, 5, 2)), cutpoints=[[-2, 0, 2], [-2, 0, 2]]).shape == (2, 5, 2, 4)

Of couse, even better, we can test not only the shapes but that all the values are close to the expected, with numpy.testing.assert_array_almost_equal

@danhphan
Copy link
Member Author

Hi yes, let's also test 2D case.

I will prefer to use OrderedProbit.dist as this is the goal of the test function. Besides, it may be better if the test is not too complicated :), we also seem not test other distribtions this way (np.vectorize)?

I think we should not repeat ourself by rewrite the following code.

probits = eta - cutpoints
left = normal_lccdf(0, sigma, probits[0])
middle = log_diff_normal_cdf(0, sigma, probits[:-1], probits[1:])
right = normal_lcdf(0, sigma, probits[-1])
p = at.exp(at.concatenate([[left], middle, [right]]))

Thanks mate 👍

@ricardoV94
Copy link
Member

ricardoV94 commented Jan 29, 2022

@danhphan the idea was not to skip using OrderedProbit.dist, but call it and compare the values of the p obtained from p = categorical.owner.inputs[3].eval() with those from the np.vectorized function.

This would not only test the shape of p (which you were already testing by hand) but also that the values of p are what is expected (e.g., that there was no weird mixing of the axis).

For precedence, here is a test that uses a np.vectorized function for purposes of testing a logp:

def _dirichlet_logpdf(value, a):
# scipy.stats.dirichlet.logpdf suffers from numerical precision issues
return -betafn(a) + logpow(value, a - 1).sum()
dirichlet_logpdf = np.vectorize(_dirichlet_logpdf, signature="(n),(n)->()")

def test_dirichlet_vectorized(self, a, size):

Edit: I can be convinced that it is an overkill, and testing shapes is enough because there is no other way the values could have broadcasted. Just wanted to clarify what I was suggesting :)

@ricardoV94
Copy link
Member

By the way, do you want to address the same limitation of OrderedLogit in this PR?

@danhphan
Copy link
Member Author

Hi @ricardoV94, thanks for the information.

On the _dirichlet_logpdf(value, a) function, what I guess is that it is used to check if a pymc's function can produce similar results as in scipy. But should we also need to write a test for _dirichlet_logpdf(value, a) itself to make sure it work as expected?

FYI, I tested the vector inputs for OrderedLogit and it works fine. Since the error stems from the shape of sigma parameter which only _OrderedProbit has, but not in OrderedLogit. So we do not fix anything in OrderedLogit distribution.

To be safe, I will add a test to check different shapes of cutpoints and eta parameters for OrderedLogit.

Cheers.

@ricardoV94
Copy link
Member

But should we also need to write a test for _dirichlet_logpdf(value, a) itself to make sure it work as expected?

I see what you mean. We have tests before were we check our pymc distribution (what we care about) matches scipy reference directly in the basic situation (no vectorization).

Given this, we can then use _dirichlet_logpdf just to test that vectorization is working as expected. It would be very unlikely that our distribution would both match the scipy in the basic case and the vectorized helper if there was a bug in the latter.

Similarly here, you can see that we have a check_rv_params_match_pymc_params test where we check that one specific set of cutpoints, sigma, and eta get converted to specific expected p. It would be unlikely that both this check and the vectorized one would work if there was a bug in the latter.

In a sense we don't need to test out helpers because of this type of test triangulation.


To be safe, I will add a test to check different shapes of cutpoints and eta parameters for OrderedLogit.

Definitely useful!

@michaelosthege michaelosthege added this to the v4.0.0b3 milestone Jan 31, 2022
@danhphan
Copy link
Member Author

danhphan commented Feb 4, 2022

Hi @ricardoV94, I have simplify the test_shape_inputs to check scalar, vector, and matrix inputs for _OrderedProbit. Also added a similar test for _OrderedLogistic.

Let's me know if they need to update anythings.

Thanks mate 👍

Copy link
Member

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just some small suggestions

pymc/tests/test_distributions_random.py Show resolved Hide resolved
(
[[1.0, -2.0, 3.0], [1.0, 2.0, -4.0]],
[-2.0, 0, 1.0],
[[0.0, 2.0, -4.0], [-1.0, 1.0, 3.0]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these sigma are negative. We should test with valid sigma values (> 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test with 2d cutpoints missing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ricardoV94, I will change the sigma to all positive.

Also, cutpoints should be always 1 dimension (to my understanding) as it represents (n-1) cut points of a categorical feature with n categories. I am not sure if there is any cases that needs 2d cutpoints.

Copy link
Member

@ricardoV94 ricardoV94 Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our distributions, when possible, can always be "batched". That means we can arbitrarily increase the dimensionality of the distribution by adding parameters with more dimensions. The last axes represent the parameters for each "atomic" distribution in the batch

For example the Categorical distribution is happy to take 2D, 3D, ... ND dimensional probability parameters, as long as they add up to 1 over the last axis

pm.Categorical.dist(
  np.full((4, 2, 3), [
    [0., .1, .9],
    [.9, .1, 0]
  ])
).eval()

The same should apply here if possible. See this issue where we are pursuing this for all multivariate distributions: #5383

The reason why this is useful is vectorization. Specifying a (3, 3) shaped distribution with different cutpoints can be much more efficient than specifying 3 times a (3,) shaped distribution with the different cutpoints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was exactly how this issue started by the way, just with the batching across sigma and eta, and fixed cutpoints. But it could have been the other way around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi yes, the batch dimension totally make sense. My initial thought was that for dealing with a large data set, batch_size should be managed in pm.Data (similar to DataLoader in pytorch). Although I have not checked pm.Data yet :)

Anyways, I will check the case of 2d cutpoints as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe now my crazy example from before makes more sense?

OrderedProbit.dist(
  eta=np.zeros((5, 2)), 
  sigma=np.ones((2, 5, 2)), 
  cutpoints=[[-2, 0, 2], [-2, 0, 2]]
)  #shape should be (2, 5, 2, 4)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, it makes sense. I have added the tests for 2d cutpoints and positve sigma.

Already run push, but not sure why it has not updated in this PR:
danhphan@9b311bf

@danhphan danhphan requested a review from ricardoV94 February 5, 2022 22:19
Copy link
Member

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for your help @danhphan

Looking forward to your next PR :D

@danhphan
Copy link
Member Author

danhphan commented Feb 6, 2022

Hi @ricardoV94 many thanks for your support 💯

Cheer! 🍷 🍷

@ricardoV94 ricardoV94 merged commit 18346ac into pymc-devs:main Feb 6, 2022
@twiecki
Copy link
Member

twiecki commented Feb 7, 2022

Congrats on number 2 @danhphan!

@danhphan danhphan deleted the ordered-probit-vector-inputs branch February 24, 2022 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants