Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stability of ATE estimates #931

Open
ankur-tutlani opened this issue Nov 27, 2024 · 4 comments
Open

Stability of ATE estimates #931

ankur-tutlani opened this issue Nov 27, 2024 · 4 comments

Comments

@ankur-tutlani
Copy link

ankur-tutlani commented Nov 27, 2024

I have a situation where I am getting different ATE estimates with same input dataset and same random seed. If I run today the average ATE number is around 1. If I run after few hours, it increases to 4 or even more. This is for the same treatment (T1) and control value (T0) combination. What could be potentially wrong here? I have one confounder, one treatment and one outcome column. All are continuous. I tried manually passing folds in cv argument but still the stability of estimates is not there. I have tried passing input dataset in a specific order, but again results are not the same. I observed this with both DoubleML and Kernel DML. What should be changed here to get more stable ATE estimates?

from econml.dml import KernelDML
model_y = xgb.XGBRegressor(random_state=578,max_depth=3,n_estimators=100)
model_t = xgb.XGBRegressor(random_state=578,max_depth=3,n_estimators=100)

data1=data1.sort_values(['Y','X']).reset_index(drop=True)
def get_folds(data1, n_splits=10):
    # Calculate the size of each fold
    fold_size = len(data1) // n_splits

    # Create a list to store the indices for each fold
    folds = []

    # Generate the folds
    for i in range(n_splits):
        start_index = i * fold_size
        if i == n_splits - 1:  # Last fold takes the remaining data
            end_index = len(data1)
        else:
            end_index = (i + 1) * fold_size
        test_indices = list(range(start_index, end_index))
        train_indices = list(range(0, start_index)) + list(range(end_index, len(data1)))
        folds.append((train_indices, test_indices))
    
    return folds
    
  class CustomCV:
    def __init__(self, folds):
        self.folds = folds
    
    def __iter__(self):
        return iter(self.folds)

'common_causes2' contains one continuous variable.

for treatment_var in treatment_vars:
    for i in range(10):
        median_dict=median_dicts[i]
        results = []

        for seed in [256344,196334,256154,190331,783475,206736,114695,414272,468332,147567]:

            np.random.seed(seed)
            data1_shuffled = data1.sample(frac=1, random_state=seed).reset_index(drop=True).copy()
            custom_cv = CustomCV(get_folds(data1_shuffled, n_splits=10))

            dml = KernelDML(model_y=model_y, model_t=model_t, discrete_treatment=False, random_state=seed, cv=custom_cv,mc_iters=10)
            dml.fit(Y=data1_shuffled['Y'], T=data1_shuffled[[treatment_var]], X=data1_shuffled[common_causes2])
            dmlestimate = dml.ate(X=data1_shuffled[common_causes2], T0=median_dict[treatment_var]['control_value'],T1=median_dict[treatment_var]['treatment_value'])
            results.append(dmlestimate)
        average_result = np.mean(results)

        foodb=pd.DataFrame({'estimatevalue':average_result},index=[0])
        foodb['treatment_var'] = treatment_var
        foodb['control_value'] = median_dict[treatment_var]['control_value']
        foodb['treatment_value'] = median_dict[treatment_var]['treatment_value']
        db_to_fill=pd.concat([db_to_fill,foodb],ignore_index=True)
        db_to_fill

For DoubleML using the following final model.

from econml.sklearn_extensions.linear_model import StatsModelsRLM
model_final=StatsModelsRLM(fit_intercept=True)

The average value differs a lot. Although there is not much variation in "results". E.g. sometimes I get "results" in the range from 1 to 3. Other times it increases to 8 to 9. that drives the "average_result" value to differ significantly among different runs. This is for the same treatment (T1) and control value (T0) combination. e.g. at one instance with T0=20, T1=25, average value shows 2, while after few hours with the same T0 and T1 values of 20 and 25 respectively, it shows the average value of 10.
I am running this on databricks cluster.
Is there anything wrong in the arguments specified above?

econml-0.15.1

@kbattocchi
Copy link
Collaborator

I don't immediately see anything wrong; I'm not super familiar with DataBricks, but I wonder if maybe they don't guarantee that rows are returned in the same order, or if it's possible that additional rows are added over time?

@ankur-tutlani
Copy link
Author

Thanks for response. I sorted the dataframe to ensure the order remains consistent before running DML or KernelDML.

data1=data1.sort_values(['Y','X']).reset_index(drop=True)

I am not sure what you mean by additional rows are added over time? Can you clarify. data1 is pandas dataframe.

One thing I observed is nuisance_scores_y and nuisance_scores_t are not consistent. Meaning if I run 10-fold cross validation along with 10 mc_iters, I expect the output (nuisance_scores_y and nuisance_scores_t ) should be a list with length of 10. But sometimes its 10, sometimes it's less than 10 like 3 or 7. Also, the values in the list elements (nuisance_scores_y and nuisance_scores_t ) vary significantly across different runs for the same seed and treatment and control combinations.
What can explain this behavior?

@kbattocchi
Copy link
Collaborator

That behavior is very strange: the nuisance scores should always be a nested list where the length of the outer list is mc_iters and the length of each element is the number of folds, and the logic which creates those lists is straightforward (and covered by our tests).

@ankur-tutlani
Copy link
Author

Thanks. the difference in lengths was due to cancelling the job run before full finish. When I let it fully finish then the lengths are consistent.

Another thing i observed, when i passed custom folds in cv argument, the results for nuisance_scores_y and nuisance_scores_t are same, meaning same values are repeated in outer list elements. Although there is variation in values in inner list elements (representing cross validation folds).

kf=KFold(n_splits=5,random_state=125,shuffle=True)
folds = [(np.array(train_idx), np.array(test_idx)) for train_idx, test_idx in kf.split(data1[['treatment','x']].to_numpy(), data1['y'].to_numpy())]  
np.random.seed(123)
dml = DML(model_y=model_y, model_t=model_t, model_final=model_final, discrete_treatment=False, random_state=4512, cv=folds,mc_iters=3)
dml.fit(Y=data1['y'], T=data1[treatment_var], X=data1[common_causes2])

These are results for dml.nuisance_scores_y. As can be seen same values are being repeated across the outer list elements,

[[0.01255048137244219,
   -4.889173034464886,
   0.012428938743914486,
   0.014486015197641922,
   0.009125106072896516],
  [0.01255048137244219,
   -4.889173034464886,
   0.012428938743914486,
   0.014486015197641922,
   0.009125106072896516],
  [0.01255048137244219,
   -4.889173034464886,
   0.012428938743914486,
   0.014486015197641922,
   0.009125106072896516]]

I tested the same with cv=5 argument, in that case results are different.

np.random.seed(123)
dml = DML(model_y=model_y, model_t=model_t, model_final=model_final, discrete_treatment=False, random_state=4512, cv=5,mc_iters=3)
dml.fit(Y=data1['y'], T=data1[treatment_var], X=data1[common_causes2])

Results for dml.nuisance_scores_y using cv=5.

[[0.014724579720036157,
   -3.1868181753587885,
   0.013263713693401846,
   0.010903846148672902,
   0.017286660720682967],
  [0.016891460402227887,
   0.007728961450303995,
   0.015531610923186445,
   0.014271989045886335,
   -5.869680788776508],
  [0.019077224609050147,
   0.009566622312765172,
   0.013149839839521782,
   0.015059794095885981,
   -4.469895765394618]]

As can be seen, results vary across inner as well as outer list elements.

Question: How are these different 5 folds created across different mc_iters? If I specify random_state, then for a particular random seed, there should be only 1 set of 5 folds for train, test indices for that seed. Since the results above are different across different mc_iters, it means different folds are getting generated. If this is correct understanding, how are these folds getting generated? Is it using the random seed specified? Asking because the above output sometimes changing when I run again with the same data and seed. This pattern is observed for both dml.nuisance_scores_y and dml.nuisance_scores_t.

I observed some difference in outputs of dml.nuisance_scores_y,dml.nuisance_scores_t when I run with same seed, and same input dataset. Input data is also passed in a particular sorted order. This is leading to different dml.residuals_ output which in turn is leading to different outputs of dml.model_final_.coef_ and dml.model_final_.intercept_. And that is leading to different ATE estimates on the same data when running the same code again.
These are the models which I am currently using.

from econml.sklearn_extensions.linear_model import StatsModelsRLM
model_y = xgb.XGBRegressor(random_state=578,max_depth=3,n_estimators=100)
model_t = xgb.XGBRegressor(random_state=578,max_depth=3,n_estimators=100)
model_final=StatsModelsRLM(fit_intercept=True)

np.random.seed(123)
dml = DML(model_y=model_y, model_t=model_t, model_final=model_final, discrete_treatment=False, random_state=4512, cv=5,mc_iters=3)
dml.fit(Y=data1['y'], T=data1[treatment_var], X=data1[common_causes2])

dml.ate(X=data1[common_causes2], T0=median_dict[treatment_var]['control_value'],T1=median_dict[treatment_var]['treatment_value'])

Is there anything wrong in the logic being used? How to achieve stability in ATE estimates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants