-
Notifications
You must be signed in to change notification settings - Fork 713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error of crossfit folds splits with DynamicDML #900
Comments
Have you tried passing a StratifiedKFold-object or creating your own cv-splitter? That could help you out in the meantime |
Hi @TimCosemans Thanks for your suggestions! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
I am estimating the effect of high levels of particulate matter (PM2.5) on excess deaths from panel data for 25 municipalities with daily resolution. It means my treatment is a binary variable where T=1, when the level of PM2.5 is high, and T=0, when the level of PM2.5 is low. The outcome is also a binary variable, where Y=0 for non-excess deaths, and Y=1 for excess deaths.
I am using the class DynamicDML to fit my model, but I get this error message: "AttributeError: Provided crossfit folds contain training splits that don't contain all treatments". But, 50% of the data corresponds to observations with T=1, I think it is enough to obtain balanced crossfit folds.
Here is my code with econml version 0.15 and dowhy version 0.10.1
dataset_pm_deaths.csv
`
import dowhy
import econml
from dowhy import CausalModel
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
import scipy.stats as stats
from itertools import product
from econml.utilities import WeightedModelWrapper
from sklearn.model_selection import train_test_split
from econml.panel.dml import DynamicDML
data_all = pd.read_csv("D:/dataset_pm_deaths.csv")
data = data_all[data_all['Year'] >= 2009]
median_pm25 = data['PM25'].median()
data['PM25'] = (data['PM25'] >= median_pm25).astype(int)
data.BC = stats.zscore(data.BC, nan_policy='omit')
data.DMS = stats.zscore(data.DMS, nan_policy='omit')
data.PM = stats.zscore(data.PM, nan_policy='omit')
data.OC = stats.zscore(data.OC, nan_policy='omit')
data.SO2 = stats.zscore(data.SO2, nan_policy='omit')
data.SO4 = stats.zscore(data.SO4, nan_policy='omit')
data0 = data[['excess', 'PM25', 'cod_munici',
'BC', 'DMS', 'PM', 'OC', 'SO2', 'SO4', 'Temperature', 'lead1_PM25']]
data0 = data0.dropna()
Y = data0.excess.to_numpy()
T = data0.PM25.to_numpy()
percentage_high_PM25 = np.mean(T == 1) * 100
W = data0[['BC', 'DMS', 'PM', 'OC', 'SO2', 'SO4', 'Temperature']].to_numpy().reshape(-1, 7)
X = data0[['Temperature', 'lead1_PM25']].to_numpy().reshape(-1, 2)
groups = data0.cod_munici.to_numpy()
estimate0 = DynamicDML(discrete_treatment=True,
featurizer=PolynomialFeatures(degree=3),
linear_first_stages=False, cv=3, random_state=123)
estimate0.fit(Y=Y, T=T, X=X, W=W, inference='auto', groups=groups) # HERE IS THE ERROR
`
The text was updated successfully, but these errors were encountered: