Replace all occurrences of get Pandas' get_dummies() with skLearn OneHotEncoder #1135

drawlinson · 2024-01-16T03:59:33Z

An earlier issue #1111 observed inconsistent behaviour from RegressionEstimator subclasses when new data for do() method had different rows than the originally fitted data, which caused categorical variables to be encoded inconsistently. This is because the do() operator allows unseen data to be processed with an existing Estimator.

This issue occurs because categorical encoding was using Pandas' get_dummies(), which does not allow additional data to be encoded using an existing encoder. An alternative, skLearn OneHotEncoder, returns an Encoder object which can be used to encode additional data consistently. skLearn is already a DoWhy dependency. For this reason skLearn is preferred over get_dummies.

This additional change goes further to replace all occurrences of get_dummies with OneHotEncoder, so that if functionality to process additional data is added to other classes in future (e.g. via do operator), the consistency bug won't happen again.

Features added to RegressionEstimator which remember a set of Encoders are pushed down to the base class CausalEstimator.
All CausalEstimator subclasses call reset_encoders() on each fit(), implementing the lifecycle assumption that fit() implies entirely new data and to forget existing data.
get_dummies was also used by the UnobservedCommonCause Refuter, but this usage has no side-effects and references to the encoded data are not retained. It was replaced simply for consistency of using skLearn.
get_dummies was also used by the do-sampler's propensity score utility function binarize_discrete. Elsewhere in these utility functions skLearn LabelEncoder is used. So, for consistency, this occurrence is also replaced by skLearn OneHotEncoder.
Added test to verify that CausalEstimator.estimate_effect() is consistent on data with permuted rows, without fit() again.
Added test to verify that the result of CausalEstimator.do(x) is consistent on data with permuted rows. This fails when encoded with get_dummies() and originally motivated this Issue/PR. The original failure isn't reproduced.
Noticed that RegressionEstimator._do was inconsistent with base class CausalEstimator._do. The arguments were reversed. The base class interface is retained and RegressionEstimator is swapped to match.

After the swap, all these changes are also heavily covered by existing tests, each time an Estimator is created and fitted, or when an effect is estimated.

…es of Pandas' get_dummies with skLearn's OneHotEncoder. Encoder lifespan: Reuses encoders for new estimate_effect() calls, and replaces existing encoders on CausalEstimator.fit(). Additional uses of get_dummies without side-effects or consistent encoding issues in do-Sampler Propensity Scores utilities also replaced for consistency. Signed-off-by: DAVID RAWLINSON <dave@causalwizard.app>

drawlinson · 2024-02-19T10:31:43Z

hi @amit-sharma are you able to take a look at this one? Thanks!

amit-sharma

Thanks for your patience @drawlinson . The PR looks good. I have just one comment that I added.
Also, can you add at least a few tests showing that the encoder remains the same if a user calls estimate_effect twice without calling fit? This is a fairly large PR so it will be good to have a tests for at least a few of the estimators that are changed.

dowhy/causal_estimators/regression_estimator.py

…s-with-sklearn

…bug in arg order for RegressionEstimator._do(). Signed-off-by: DAVID RAWLINSON <dave@causalwizard.app>

drawlinson · 2024-03-16T09:24:35Z

@amit-sharma I added some tests which aim to verify that encoding is consistent despite permuting data row order. It was a bity tricky working within the interfaces of the Estimator classes - I focused on estimate_effect() and do(x). With Regression estimators the effects of common causes are additive, so the ATE is almost unchanged despite changes in these variables! To check for consistency of these variables' encoding using I used the do() operator, the result of which is affected by common causes.

In the process I discovered that the RegressionEstimator implementation of do() has a seemingly long-standing bug where the order of the arguments is reversed:

CausalEstimator base class (treatment_value, dataframe):
def _do(self, x, data_df=None):

RegressionEstimator (dataframe, treatment_value):
def _do(self, data_df: pd.DataFrame, treatment_val):

I've fixed RegressionEstimator to match the base class interface. I searched for all instances of _do( and only needed to fix the implementation of estimate_effect in Regression.

Changed from:
effect_estimate = self._do(data, treatment_value) - self._do(data, control_value)

to:
effect_estimate = self._do(treatment_value, data) - self._do(control_value, data)

I'm sorry this has turned into a big PR but hopefully it's worth it!

drawlinson · 2024-03-16T09:24:59Z

Build docs appears to be failing due to lack of disk space on the worker environment

amit-sharma

Thank you so much, @drawlinson. The PR looks great now, merging.

Also, great that you were able to fix the bug. Much appreciated!

drawlinson mentioned this pull request Jan 16, 2024

Inconsistent encoding with pandas get_dummies causes prediction and effect estimation errors #1111

Closed

amit-sharma reviewed Feb 29, 2024

View reviewed changes

dowhy/causal_estimators/regression_estimator.py Show resolved Hide resolved

drawlinson and others added 2 commits March 5, 2024 17:30

Merge branch 'py-why:main' into replace-all-occurrences-of-get_dummie…

9d9405c

…s-with-sklearn

Add categorical encoding consistency tests for CausalEstimators. Fix …

e23c57c

…bug in arg order for RegressionEstimator._do(). Signed-off-by: DAVID RAWLINSON <dave@causalwizard.app>

krz mentioned this pull request Mar 25, 2024

Support polars data frames #1151

Open

amit-sharma approved these changes Mar 26, 2024

View reviewed changes

amit-sharma merged commit 65f3031 into py-why:main Mar 26, 2024
29 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace all occurrences of get Pandas' get_dummies() with skLearn OneHotEncoder #1135

Replace all occurrences of get Pandas' get_dummies() with skLearn OneHotEncoder #1135

drawlinson commented Jan 16, 2024 •

edited

Loading

drawlinson commented Feb 19, 2024

amit-sharma left a comment

drawlinson commented Mar 16, 2024

drawlinson commented Mar 16, 2024

amit-sharma left a comment

Replace all occurrences of get Pandas' get_dummies() with skLearn OneHotEncoder #1135

Replace all occurrences of get Pandas' get_dummies() with skLearn OneHotEncoder #1135

Conversation

drawlinson commented Jan 16, 2024 • edited Loading

drawlinson commented Feb 19, 2024

amit-sharma left a comment

Choose a reason for hiding this comment

drawlinson commented Mar 16, 2024

drawlinson commented Mar 16, 2024

amit-sharma left a comment

Choose a reason for hiding this comment

drawlinson commented Jan 16, 2024 •

edited

Loading