Enforce `pred_var` is always greater than zero on GRF #480

arose13 · 2021-06-08T14:29:27Z

ensure that the pred_var is always greater than zero.

This prevents NaNs from being created for some outputted values when creating the confidence interval.

The NaNs were previously being created when the variance was converted to the sd for scipy's distribution models.

PS: I also removed duplicated code that used to appear on line 798 and 799

vsyrgkanis

Looks great! Weird that the bayesian debiasing was not ensuring that, but maybe here you were getting exact zeros, which would also be problematic.

arose13 · 2021-06-08T14:35:26Z

I was surprised also when I was looking through the predict_point_and_var function but then numbers I was getting out that were breaking things were variances like -2e28 and -3e31. So numerically it might actually be zero

kbattocchi

Thanks for this contribution! Looks good, although I added one minor suggestion.

Also, would it be possible to add a simple test where the current code is failing but this change works so that we can make sure not to regress in the future?

econml/grf/_base_grf.py

kbattocchi

The new changes look good, thanks for contributing! I'm approving the PR because it looks good code-wise, but before we can merge it there are two issues:

A minor line-too-long linting problem
We have a real test failure when running the notebooks/Generalized Random Forests.ipynb notebook where the assertion is being triggered. Could you check if this is just a case where we should be using a slightly looser tolerance, or if we're really getting big negative values there for some reason?

(there are also a couple of other random test failures that I suspect are sporadic and could be fixed by just rerunning)

kbattocchi · 2021-06-22T02:32:44Z

econml/grf/_base_grf.py

@@ -793,10 +793,13 @@ def predict_full(self, X, interval=False, alpha=0.05):
        """
        if interval:
            point, pred_var = self._predict_point_and_var(X, full=True, point=True, var=True)
+            assert np.isclose(pred_var[pred_var < 0], 0, atol=1e-8).all(), '`pred_var` should not produce large negative values'


Unfortunately this line is failing our linting step because it's too long, so you'll need to break it up over two lines instead before we can merge.

arose13 · 2021-06-23T18:00:52Z

Some of the failed test appears to be caused by an import/ModuleNotFoundError error for a library called ipyparallel and inexplicible kernel timeout errors.

kbattocchi · 2021-06-25T15:01:20Z

@arose13 The transient test failures are gone; the remaining notebook failures are due to triggering the assert within notebooks/Generalized Random Forests.ipynb as I mentioned in my previous comment. Could you run through this notebook locally on your branch and see whether it's being triggered by negative values that are just barely too large for the current check or whether they are far from zero?

arose13 · 2021-06-29T16:26:08Z

This is the list of negative numbers that the GRF is generating in cell 5 of the Generalized Random Forests notebook. There are not large but they are orders of magnitude larger than the numbers that my dataset created.

[-9.71148020e-05 -9.71148020e-05 -3.71913346e-04 -3.71913346e-04
 -1.81617759e-04 -1.81617759e-04 -1.90164152e-03 -1.90164152e-03
 -3.72337774e-05 -3.72337774e-05 -3.12026815e-03 -3.12026815e-03
 -2.87514374e-03 -2.87514374e-03 -1.64442561e-03 -1.64442561e-03
 -3.89969475e-03 -3.89969475e-03 -4.03462227e-03 -4.03462227e-03
 -3.57581040e-03 -3.57581040e-03 -4.10402059e-03 -4.10402059e-03
 -9.66940790e-05 -9.66940790e-05 -9.64839622e-05 -9.64839622e-05
 -2.56454874e-03 -2.56454874e-03 -7.59402132e-03 -7.59402132e-03
 -3.00155712e-03 -3.00155712e-03 -9.65701081e-04 -9.65701081e-04
 -3.91570169e-03 -3.91570169e-03 -1.14883637e-03 -1.14883637e-03
 -1.67404798e-03 -1.67404798e-03 -2.44639496e-03 -2.44639496e-03
 -4.44131189e-03 -4.44131189e-03 -1.78629255e-03 -1.78629255e-03
 -5.10844232e-03 -5.10844232e-03 -5.39320397e-03 -5.39320397e-03
 -3.21969607e-04 -3.21969607e-04 -1.89189299e-03 -1.89189299e-03
 -1.07363686e-03 -1.07363686e-03 -4.63204525e-04 -4.63204525e-04
 -6.20925242e-03 -6.20925242e-03 -5.48326105e-04 -5.48326105e-04
 -7.92044922e-03 -7.92044922e-03 -1.67513014e-03 -1.67513014e-03
 -1.91783295e-03 -1.91783295e-03 -2.66217332e-03 -2.66217332e-03
 -8.06691208e-03 -8.06691208e-03 -3.80646330e-03 -3.80646330e-03
 -1.02306806e-03 -1.02306806e-03 -6.63986327e-03 -6.63986327e-03
 -2.49315492e-03 -2.49315492e-03 -5.62818743e-03 -5.62818743e-03
 -4.81508894e-03 -4.81508894e-03 -1.57566769e-02 -1.57566769e-02
 -1.91329013e-03 -1.91329013e-03 -1.39286160e-03 -1.39286160e-03
 -4.21758085e-03 -4.21758085e-03 -3.99219257e-04 -3.99219257e-04
 -7.18139476e-03 -7.18139476e-03 -4.56547781e-03 -4.56547781e-03
 -5.17726669e-03 -5.17726669e-03 -3.46554222e-03 -3.46554222e-03
 -1.61704394e-02 -1.61704394e-02 -1.79248516e-02 -1.79248516e-02
 -4.90204604e-02 -4.90204604e-02 -2.02323367e-02 -2.02323367e-02
 -6.68202655e-03 -6.68202655e-03 -5.17581754e-02 -5.17581754e-02
 -1.38303209e-02 -1.38303209e-02 -5.87194234e-03 -5.87194234e-03
 -1.24390342e-02 -1.24390342e-02 -1.71015258e-02 -1.71015258e-02
 -7.49818211e-03 -7.49818211e-03 -2.96637026e-02 -2.96637026e-02
 -1.38563185e-02 -1.38563185e-02 -8.27955845e-02 -8.27955845e-02
 -8.46787177e-02 -8.46787177e-02 -5.53417005e-02 -5.53417005e-02
 -8.17552761e-02 -8.17552761e-02 -5.25008524e-02 -5.25008524e-02]

Let me know what you think.

vsyrgkanis · 2021-06-29T16:34:01Z

This is the list of negative numbers that the GRF is generating in cell 5 of the Generalized Random Forests notebook. There are not large but they are orders of magnitude larger than the numbers that my dataset created.

[-9.71148020e-05 -9.71148020e-05 -3.71913346e-04 -3.71913346e-04
 -1.81617759e-04 -1.81617759e-04 -1.90164152e-03 -1.90164152e-03
 -3.72337774e-05 -3.72337774e-05 -3.12026815e-03 -3.12026815e-03
 -2.87514374e-03 -2.87514374e-03 -1.64442561e-03 -1.64442561e-03
 -3.89969475e-03 -3.89969475e-03 -4.03462227e-03 -4.03462227e-03
 -3.57581040e-03 -3.57581040e-03 -4.10402059e-03 -4.10402059e-03
 -9.66940790e-05 -9.66940790e-05 -9.64839622e-05 -9.64839622e-05
 -2.56454874e-03 -2.56454874e-03 -7.59402132e-03 -7.59402132e-03
 -3.00155712e-03 -3.00155712e-03 -9.65701081e-04 -9.65701081e-04
 -3.91570169e-03 -3.91570169e-03 -1.14883637e-03 -1.14883637e-03
 -1.67404798e-03 -1.67404798e-03 -2.44639496e-03 -2.44639496e-03
 -4.44131189e-03 -4.44131189e-03 -1.78629255e-03 -1.78629255e-03
 -5.10844232e-03 -5.10844232e-03 -5.39320397e-03 -5.39320397e-03
 -3.21969607e-04 -3.21969607e-04 -1.89189299e-03 -1.89189299e-03
 -1.07363686e-03 -1.07363686e-03 -4.63204525e-04 -4.63204525e-04
 -6.20925242e-03 -6.20925242e-03 -5.48326105e-04 -5.48326105e-04
 -7.92044922e-03 -7.92044922e-03 -1.67513014e-03 -1.67513014e-03
 -1.91783295e-03 -1.91783295e-03 -2.66217332e-03 -2.66217332e-03
 -8.06691208e-03 -8.06691208e-03 -3.80646330e-03 -3.80646330e-03
 -1.02306806e-03 -1.02306806e-03 -6.63986327e-03 -6.63986327e-03
 -2.49315492e-03 -2.49315492e-03 -5.62818743e-03 -5.62818743e-03
 -4.81508894e-03 -4.81508894e-03 -1.57566769e-02 -1.57566769e-02
 -1.91329013e-03 -1.91329013e-03 -1.39286160e-03 -1.39286160e-03
 -4.21758085e-03 -4.21758085e-03 -3.99219257e-04 -3.99219257e-04
 -7.18139476e-03 -7.18139476e-03 -4.56547781e-03 -4.56547781e-03
 -5.17726669e-03 -5.17726669e-03 -3.46554222e-03 -3.46554222e-03
 -1.61704394e-02 -1.61704394e-02 -1.79248516e-02 -1.79248516e-02
 -4.90204604e-02 -4.90204604e-02 -2.02323367e-02 -2.02323367e-02
 -6.68202655e-03 -6.68202655e-03 -5.17581754e-02 -5.17581754e-02
 -1.38303209e-02 -1.38303209e-02 -5.87194234e-03 -5.87194234e-03
 -1.24390342e-02 -1.24390342e-02 -1.71015258e-02 -1.71015258e-02
 -7.49818211e-03 -7.49818211e-03 -2.96637026e-02 -2.96637026e-02
 -1.38563185e-02 -1.38563185e-02 -8.27955845e-02 -8.27955845e-02
 -8.46787177e-02 -8.46787177e-02 -5.53417005e-02 -5.53417005e-02
 -8.17552761e-02 -8.17552761e-02 -5.25008524e-02 -5.25008524e-02]

Let me know what you think.

I'm confused: the cell below plots thje confidence interval and seems to have no problem. Is this triggered by some change here?

econml/grf/_base_grf.py

vsyrgkanis

LGTM

ensure that the pred_var is always greater than zero.

c073ef1

vsyrgkanis approved these changes Jun 8, 2021

View reviewed changes

Merge branch 'master' into master

9933627

kbattocchi requested changes Jun 9, 2021

View reviewed changes

econml/grf/_base_grf.py Outdated Show resolved Hide resolved

arose13 added 2 commits June 21, 2021 21:31

the pred_var should never produce negative values

954453d

Merge remote-tracking branch 'origin/master'

765415b

arose13 requested a review from kbattocchi June 22, 2021 01:34

kbattocchi approved these changes Jun 22, 2021

View reviewed changes

arose13 added 3 commits June 21, 2021 23:52

shorter assert comment (debate whether it should say >0 or >=0)

4477e96

shorter assert comment (debate whether it should say >0 or >=0)

0536e92

Merge branch 'master' into master

84275ba

vsyrgkanis requested changes Jun 29, 2021

View reviewed changes

econml/grf/_base_grf.py Outdated Show resolved Hide resolved

arose13 and others added 3 commits June 29, 2021 12:50

check only the diag elements

0716f08

Merge remote-tracking branch 'origin/master'

be39958

Merge branch 'master' into master

343a967

vsyrgkanis approved these changes Jul 9, 2021

View reviewed changes

kbattocchi enabled auto-merge (squash) July 9, 2021 19:02

kbattocchi added 2 commits July 10, 2021 18:57

Merge branch 'master' into master

47f160a

Merge branch 'master' into master

703b280

kbattocchi disabled auto-merge August 7, 2021 13:50

kbattocchi enabled auto-merge (squash) August 7, 2021 13:50

kbattocchi merged commit 2783e09 into py-why:master Aug 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce `pred_var` is always greater than zero on GRF #480

Enforce `pred_var` is always greater than zero on GRF #480

arose13 commented Jun 8, 2021

vsyrgkanis left a comment

arose13 commented Jun 8, 2021

kbattocchi left a comment

kbattocchi left a comment

kbattocchi Jun 22, 2021

arose13 commented Jun 23, 2021

kbattocchi commented Jun 25, 2021

arose13 commented Jun 29, 2021

vsyrgkanis commented Jun 29, 2021

vsyrgkanis left a comment

Enforce pred_var is always greater than zero on GRF #480

Enforce pred_var is always greater than zero on GRF #480

Conversation

arose13 commented Jun 8, 2021

vsyrgkanis left a comment

Choose a reason for hiding this comment

arose13 commented Jun 8, 2021

kbattocchi left a comment

Choose a reason for hiding this comment

kbattocchi left a comment

Choose a reason for hiding this comment

kbattocchi Jun 22, 2021

Choose a reason for hiding this comment

arose13 commented Jun 23, 2021

kbattocchi commented Jun 25, 2021

arose13 commented Jun 29, 2021

vsyrgkanis commented Jun 29, 2021

vsyrgkanis left a comment

Choose a reason for hiding this comment

Enforce `pred_var` is always greater than zero on GRF #480

Enforce `pred_var` is always greater than zero on GRF #480