Allow OrderedProbit distribution to take vector inputs #5418

danhphan · 2022-01-29T04:04:36Z

This PR allows the OrderedProbit distribution to take vector inputs using advanced indexing in #5216

Fix the shape of the sigma parameter
Add test_vector_inputs for OrderedProbit in test_distributions_random.py

Hi @ricardoV94, let me know if it needs to update. Thanks

codecov · 2022-01-29T04:11:51Z

Codecov Report

Merging #5418 (466a941) into main (0dca647) will increase coverage by 0.00%.
The diff coverage is n/a.

❗ Current head 466a941 differs from pull request most recent head 9b311bf. Consider uploading reports for the commit 9b311bf to get more accurate results

@@           Coverage Diff           @@
##             main    #5418   +/-   ##
=======================================
  Coverage   81.39%   81.39%           
=======================================
  Files          82       82           
  Lines       14213    14214    +1     
=======================================
+ Hits        11568    11569    +1     
  Misses       2645     2645

Impacted Files	Coverage Δ
pymc/distributions/discrete.py	`99.76% <ø> (ø)`
pymc/distributions/shape_utils.py	`96.73% <0.00%> (-2.14%)`	⬇️
pymc/aesaraf.py	`90.17% <0.00%> (-0.03%)`	⬇️
pymc/distributions/distribution.py	`91.43% <0.00%> (+0.03%)`	⬆️
pymc/model.py	`85.97% <0.00%> (+0.03%)`	⬆️
pymc/sampling.py	`86.06% <0.00%> (+0.22%)`	⬆️
pymc/sampling_jax.py	`98.30% <0.00%> (+0.84%)`	⬆️

ricardoV94

We can probably simplify this test quite a lot. We don't need to actually create a whole dataset with pandas, just pass vector inputs and then check that the categorical variable that is returned has the right inputs (shape and values)

categorical = pm.OrderedProbit.dist(
  cutpoints=np.array(...)
  eta=np.array(...),
  sigma=np.array(...),
)

p = categorical.owner.inputs[3].eval()
# Assert p is what we expected (shape and values)

We can also go one step further and check if it is also working with matrix inputs, or combinations of scalar and vector inputs

danhphan · 2022-01-29T07:28:39Z

Hi @ricardoV94 , somethings like this?

def test_vector_inputs(self):
    """ 
    This test checks when providing vector inputs for `eta` and `sigma` parameters using advanced indexing.
    """
    categorical = pm.OrderedProbit.dist(
        eta=0,
        cutpoints=np.array([-2.0, 0, 2.0]),
        sigma=1.0,
        )
    p = categorical.owner.inputs[3].eval()
    assert p.shape == (4,)

    categorical = pm.OrderedProbit.dist(
        eta=np.array([1.0, 2.0, 3.0, 4.0, 5.0]),
        cutpoints=np.array([-2.0, 0, 2.0]),
        sigma=1,
        )
    p = categorical.owner.inputs[3].eval()
    assert p.shape == (5, 4)

    categorical = pm.OrderedProbit.dist(
        eta=np.array([1.0, 2.0, 3.0, 4.0, 5.0]),
        cutpoints=np.array([-2.0, 0, 2.0]),
        sigma=np.array([1.0, 2.0, 3.0, 4.0, 5.0]),
        )
    p = categorical.owner.inputs[3].eval()
    assert p.shape == (5, 4)

Besides, how do we know that .owner.inputs[3] will be p?

ricardoV94 · 2022-01-29T11:17:48Z

Besides, how do we know that .owner.inputs[3] will be p?

That's just the position of the p input in the returned CategoricalRV variable. It's always the same position.

I think the new tests are great. I just wonder what happens when the input has more dimensions (such as 2D cutpoints, or 2D eta/sigma). Do we get sensible p? Or does it just fail altogether?

If it works it would be nice to test it as well. Otherwise we can:

Try to make it work
Add an explicit check in the dist with an informative NotImplementedError for invalid input dimensions

ricardoV94 · 2022-01-29T12:12:12Z

By the way, I like to use np.vectorize when testing that a function is broadcasting as expected. In your case something like this:

# Create vectorized function from base case (scalar eta, scalar sigma, and vector cutpoints)
eta = at.scalar('eta')
sigma = at.scalar('sigma')
cutpoints = at.vector('cutpoints')

probits = eta - cutpoints
left = normal_lccdf(0, sigma, probits[0])
middle = log_diff_normal_cdf(0, sigma, probits[:-1], probits[1:])
right = normal_lcdf(0, sigma, probits[-1])
p = at.exp(at.concatenate([[left], middle, [right]]))

base_p = aesara.function([eta, sigma, cutpoints], ret)
vec_p = np.vectorize(base_p, signature="(),(),(n)->(m)")

You can ten use this vec_p to test arbitrary (valid) shapes of inputs.

assert vec_p(eta=0, sigma=1, cutpoints=[-2, 0, 2]) == (4,)
assert vec_p(eta=np.zeros((5, 2)), sigma=np.ones((2, 5, 2)), cutpoints=[[-2, 0, 2], [-2, 0, 2]]).shape == (2, 5, 2, 4)

Of couse, even better, we can test not only the shapes but that all the values are close to the expected, with numpy.testing.assert_array_almost_equal

danhphan · 2022-01-29T12:44:45Z

Hi yes, let's also test 2D case.

I will prefer to use OrderedProbit.dist as this is the goal of the test function. Besides, it may be better if the test is not too complicated :), we also seem not test other distribtions this way (np.vectorize)?

I think we should not repeat ourself by rewrite the following code.

probits = eta - cutpoints
left = normal_lccdf(0, sigma, probits[0])
middle = log_diff_normal_cdf(0, sigma, probits[:-1], probits[1:])
right = normal_lcdf(0, sigma, probits[-1])
p = at.exp(at.concatenate([[left], middle, [right]]))

Thanks mate 👍

ricardoV94 · 2022-01-29T13:02:23Z

@danhphan the idea was not to skip using OrderedProbit.dist, but call it and compare the values of the p obtained from p = categorical.owner.inputs[3].eval() with those from the np.vectorized function.

This would not only test the shape of p (which you were already testing by hand) but also that the values of p are what is expected (e.g., that there was no weird mixing of the axis).

For precedence, here is a test that uses a np.vectorized function for purposes of testing a logp:

pymc/pymc/tests/test_distributions.py

Lines 471 to 476 in 1a35a3d

    
           def _dirichlet_logpdf(value, a): 
        
               # scipy.stats.dirichlet.logpdf suffers from numerical precision issues 
        
               return -betafn(a) + logpow(value, a - 1).sum() 
        
           dirichlet_logpdf = np.vectorize(_dirichlet_logpdf, signature="(n),(n)->()")

pymc/pymc/tests/test_distributions.py

Line 2169 in 1a35a3d

def test_dirichlet_vectorized(self, a, size):

Edit: I can be convinced that it is an overkill, and testing shapes is enough because there is no other way the values could have broadcasted. Just wanted to clarify what I was suggesting :)

ricardoV94 · 2022-01-29T13:07:31Z

By the way, do you want to address the same limitation of OrderedLogit in this PR?

danhphan · 2022-01-30T11:40:09Z

Hi @ricardoV94, thanks for the information.

On the _dirichlet_logpdf(value, a) function, what I guess is that it is used to check if a pymc's function can produce similar results as in scipy. But should we also need to write a test for _dirichlet_logpdf(value, a) itself to make sure it work as expected?

FYI, I tested the vector inputs for OrderedLogit and it works fine. Since the error stems from the shape of sigma parameter which only _OrderedProbit has, but not in OrderedLogit. So we do not fix anything in OrderedLogit distribution.

To be safe, I will add a test to check different shapes of cutpoints and eta parameters for OrderedLogit.

Cheers.

ricardoV94 · 2022-01-30T11:56:58Z

But should we also need to write a test for _dirichlet_logpdf(value, a) itself to make sure it work as expected?

I see what you mean. We have tests before were we check our pymc distribution (what we care about) matches scipy reference directly in the basic situation (no vectorization).

Given this, we can then use _dirichlet_logpdf just to test that vectorization is working as expected. It would be very unlikely that our distribution would both match the scipy in the basic case and the vectorized helper if there was a bug in the latter.

Similarly here, you can see that we have a check_rv_params_match_pymc_params test where we check that one specific set of cutpoints, sigma, and eta get converted to specific expected p. It would be unlikely that both this check and the vectorized one would work if there was a bug in the latter.

In a sense we don't need to test out helpers because of this type of test triangulation.

To be safe, I will add a test to check different shapes of cutpoints and eta parameters for OrderedLogit.

Definitely useful!

danhphan · 2022-02-04T07:18:15Z

Hi @ricardoV94, I have simplify the test_shape_inputs to check scalar, vector, and matrix inputs for _OrderedProbit. Also added a similar test for _OrderedLogistic.

Let's me know if they need to update anythings.

Thanks mate 👍

ricardoV94

Looks great, just some small suggestions

pymc/tests/test_distributions_random.py

ricardoV94 · 2022-02-04T08:23:05Z

pymc/tests/test_distributions_random.py

+            (
+                [[1.0, -2.0, 3.0], [1.0, 2.0, -4.0]],
+                [-2.0, 0, 1.0],
+                [[0.0, 2.0, -4.0], [-1.0, 1.0, 3.0]],


A lot of these sigma are negative. We should test with valid sigma values (> 0)

Also test with 2d cutpoints missing

Hi @ricardoV94, I will change the sigma to all positive.

Also, cutpoints should be always 1 dimension (to my understanding) as it represents (n-1) cut points of a categorical feature with n categories. I am not sure if there is any cases that needs 2d cutpoints.

Our distributions, when possible, can always be "batched". That means we can arbitrarily increase the dimensionality of the distribution by adding parameters with more dimensions. The last axes represent the parameters for each "atomic" distribution in the batch

For example the Categorical distribution is happy to take 2D, 3D, ... ND dimensional probability parameters, as long as they add up to 1 over the last axis

pm.Categorical.dist( np.full((4, 2, 3), [ [0., .1, .9], [.9, .1, 0] ]) ).eval()

The same should apply here if possible. See this issue where we are pursuing this for all multivariate distributions: #5383

The reason why this is useful is vectorization. Specifying a (3, 3) shaped distribution with different cutpoints can be much more efficient than specifying 3 times a (3,) shaped distribution with the different cutpoints.

This was exactly how this issue started by the way, just with the batching across sigma and eta, and fixed cutpoints. But it could have been the other way around.

Hi yes, the batch dimension totally make sense. My initial thought was that for dealing with a large data set, batch_size should be managed in pm.Data (similar to DataLoader in pytorch). Although I have not checked pm.Data yet :)

Anyways, I will check the case of 2d cutpoints as well.

Maybe now my crazy example from before makes more sense?

OrderedProbit.dist( eta=np.zeros((5, 2)), sigma=np.ones((2, 5, 2)), cutpoints=[[-2, 0, 2], [-2, 0, 2]] ) #shape should be (2, 5, 2, 4)

Hi, it makes sense. I have added the tests for 2d cutpoints and positve sigma.

Already run push, but not sure why it has not updated in this PR:
danhphan@9b311bf

ricardoV94

Looks great! Thanks for your help @danhphan

Looking forward to your next PR :D

danhphan · 2022-02-06T08:03:36Z

Hi @ricardoV94 many thanks for your support 💯

Cheer! 🍷 🍷

twiecki · 2022-02-07T10:49:12Z

Congrats on number 2 @danhphan!

fix and add test_vector_inputs for OrderedProbit

cb75f13

danhphan changed the title ~~Allow OrderedProbit distribution to take vector inputs using advanced indexing~~ Allow OrderedProbit distribution to take vector inputs Jan 29, 2022

ricardoV94 reviewed Jan 29, 2022

View reviewed changes

simplify test_vector_inputs

8d1d9d9

michaelosthege added this to the v4.0.0b3 milestone Jan 31, 2022

danhphan added 2 commits February 4, 2022 17:25

simplify test_shape_inputs for _OrderedProbit

32f6c89

add test_shape_inputs for _OrderedLogistic

466a941

twiecki approved these changes Feb 4, 2022

View reviewed changes

ricardoV94 requested changes Feb 4, 2022

View reviewed changes

This was referenced Feb 4, 2022

allow OrderedProbit and OrderedLogit to take vector inputs #5216

Closed

Analysing ordinal data in PyMC pymc-devs/pymc-examples#277

Open

add 2d cutpoints and positive sigma

9b311bf

danhphan requested a review from ricardoV94 February 5, 2022 22:19

ricardoV94 approved these changes Feb 6, 2022

View reviewed changes

ricardoV94 merged commit 18346ac into pymc-devs:main Feb 6, 2022

danhphan deleted the ordered-probit-vector-inputs branch February 24, 2022 11:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow OrderedProbit distribution to take vector inputs #5418

Allow OrderedProbit distribution to take vector inputs #5418

danhphan commented Jan 29, 2022

codecov bot commented Jan 29, 2022 •

edited

Loading

ricardoV94 left a comment •

edited

Loading

danhphan commented Jan 29, 2022

ricardoV94 commented Jan 29, 2022 •

edited

Loading

ricardoV94 commented Jan 29, 2022

danhphan commented Jan 29, 2022

ricardoV94 commented Jan 29, 2022 •

edited

Loading

ricardoV94 commented Jan 29, 2022

danhphan commented Jan 30, 2022

ricardoV94 commented Jan 30, 2022

danhphan commented Feb 4, 2022

ricardoV94 left a comment

ricardoV94 Feb 4, 2022

ricardoV94 Feb 4, 2022

danhphan Feb 5, 2022

ricardoV94 Feb 5, 2022 •

edited

Loading

ricardoV94 Feb 5, 2022

danhphan Feb 5, 2022

ricardoV94 Feb 5, 2022

danhphan Feb 5, 2022

ricardoV94 left a comment

danhphan commented Feb 6, 2022

twiecki commented Feb 7, 2022

Allow OrderedProbit distribution to take vector inputs #5418

Allow OrderedProbit distribution to take vector inputs #5418

Conversation

danhphan commented Jan 29, 2022

codecov bot commented Jan 29, 2022 • edited Loading

Codecov Report

ricardoV94 left a comment • edited Loading

Choose a reason for hiding this comment

danhphan commented Jan 29, 2022

ricardoV94 commented Jan 29, 2022 • edited Loading

ricardoV94 commented Jan 29, 2022

danhphan commented Jan 29, 2022

ricardoV94 commented Jan 29, 2022 • edited Loading

ricardoV94 commented Jan 29, 2022

danhphan commented Jan 30, 2022

ricardoV94 commented Jan 30, 2022

danhphan commented Feb 4, 2022

ricardoV94 left a comment

Choose a reason for hiding this comment

ricardoV94 Feb 4, 2022

Choose a reason for hiding this comment

ricardoV94 Feb 4, 2022

Choose a reason for hiding this comment

danhphan Feb 5, 2022

Choose a reason for hiding this comment

ricardoV94 Feb 5, 2022 • edited Loading

Choose a reason for hiding this comment

ricardoV94 Feb 5, 2022

Choose a reason for hiding this comment

danhphan Feb 5, 2022

Choose a reason for hiding this comment

ricardoV94 Feb 5, 2022

Choose a reason for hiding this comment

danhphan Feb 5, 2022

Choose a reason for hiding this comment

ricardoV94 left a comment

Choose a reason for hiding this comment

danhphan commented Feb 6, 2022

twiecki commented Feb 7, 2022

codecov bot commented Jan 29, 2022 •

edited

Loading

ricardoV94 left a comment •

edited

Loading

ricardoV94 commented Jan 29, 2022 •

edited

Loading

ricardoV94 commented Jan 29, 2022 •

edited

Loading

ricardoV94 Feb 5, 2022 •

edited

Loading