-
Notifications
You must be signed in to change notification settings - Fork 939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent encoding with pandas get_dummies causes prediction and effect estimation errors #1111
Comments
PR here #1112 |
I think this is resolve with merge of PR 1112 . However, its also an option to remove use of get_dummies entirely if this is considered desirable, using the same util function to replace each occurrence. I can make another PR to complete that process. Keen for thoughts on that... there would be some simplification by pushing encoders down to the base CausalEstimator and things would be more consistent between estimators. |
That's a great idea @drawlinson . Yeah, it is better to remove use of |
I'll get onto that ASAP! |
@amit-sharma PR now available to complete the job... #1135 |
Resolved with PR #1135 |
Describe the bug
DoWhy mostly aims to model effects within an existing, finite set of data, but generalization to unseen data is an important aspect of model validation which many user (including myself) might want to do. DoWhy has 2 ways to apply an existing model to new data.
CausalModel.estimate_effect(fit_estimator=False)
This calls through to
CausalEstimator.estimate_effect
, which seems to assume the model has already been fitted or trained to a dataset and does not retrain it.CausalModel.do(X, ..., fit_estimator=<any value>)
Although this function takes a
fit_estimator
argument, it always assumes the model is fitted to the entire dataset before evaluating the effect on the provided new data, given treatment is X. Why does new data matter?i. DoWhy uses Pandas' get_dummies to automatically encode any dataframe column with non-numerical datatypes:
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
ii.
get_dummies
output depends on the order in which specific values are encountered in the data.iii. The order in which specific values are encountered in the data is not guaranteed on any dataset other than the original
iv.
get_dummies
will therefore return inconsistent encodings when the rows of the data are shuffled or simply different (unseen / generalization dataset). This doesn't necessarily cause a crash; but it can do if the number of unique values in a categorical column differs between training and validation data. More commonly, the encodings are simply different, leading to incorrect outputs from the model.Steps to reproduce the behavior
For example if a column C in data contains strings X and Y:
C = [X, Y, X, X, Y, ...]
... then initially these values might be encoded as:
X: 0,1
Y: 1,0
If they are presented in a different order:
C = [Y, Y, X, X, Y, ...]
... the same strings would be encoded as:
X: 1,0
Y: 0,1
I notice this issue has come up before, because in
regression_estimator.py
interventional_outcomes
function, lines 203-221 the entire original treatment column is prefixed to the new treatment column to ensure it is encoded consistently. The data is then removed before being passed to_build_features()
. However, this only fixes the treatment column; the other columns (such as common cause and effect modifier variables) do not have any mechanism to ensure consistent encoding.Expected behavior
Categorical data types should always be encoded consistently to achieve valid inference.
Version information:
Additional context
Consistent encoding of shuffled or different data is not possible using get_dummies(), because it does not allow an encoder to be re-applied to additional data.
However, SciKit-Learn does have a one-hot encoder whose behaviour is very similar to get_dummies() and does allow the encoder to be re-applied to new data:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
DoWhy already has a dependency on scikit learn, so no additional packages are required.
Since encoding depends on a specific encoder object, the lifecycle of this object must be considered. I suggest that encoders are a property of CausalEstimator objects which are re-created during
CausalEstimator.fit()
and persist until the nextfit()
. This means that estimation of effects afterfit()
would reuse existing encoder[s], ensuring encoding is consistent.I've made a PR #1112 which fixes all RegressionEstimator subclasses as described above.
Note as a side effect of this fix, it is easy to create a
CausalEstimator.predict()
method which allows model inference on arbitrary data; thedo()
method is then simplified to just assignment of the treatment value and a call topredict()
:In addition the get_dummies calls can be replaced with a new util function which approximates the original get_dummies interface, e.g.
OLD
self._observed_common_causes = pd.get_dummies(self._observed_common_causes, drop_first=True)
NEW
self._observed_common_causes = self._encode(self._observed_common_causes, "observed_common_causes")
The same fix could be applied to all uses of get_dummies in DoWhy (there are 13 remaining in other CausalEstimators) but since this PR is already complex and all tests pass, I thought it was worth getting some thoughts on the fix first...
The text was updated successfully, but these errors were encountered: