Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimation with carefully chosen missing values #1295

Open
AshwinParanjape opened this issue Jan 12, 2025 · 1 comment
Open

Estimation with carefully chosen missing values #1295

AshwinParanjape opened this issue Jan 12, 2025 · 1 comment
Labels
question Further information is requested

Comments

@AshwinParanjape
Copy link

Ask your question
I couldn't find information on whether dowhy supports estimation with certain carefully chosen missing pieces of data. Consider the "voucher as instrumental variable" example from the tutorial. The problem assumes each row in the dataframe has values for all 3 observed variables (voucher, education, income). However, the estimand is

### Estimand : 1
Estimand name: iv
Estimand expression:
 ⎡                                            -1⎤
 ⎢    d              ⎛    d                  ⎞  ⎥
E⎢──────────(income)⋅⎜──────────([education])⎟  ⎥
 ⎣d[voucher]         ⎝d[voucher]             ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→income then ¬(U →→{voucher})
Estimand assumption 2, Exclusion: If we remove {voucher}→{education}, then ¬({voucher}→income)

which means if I can estimate the association of voucher -> income and voucher -> education, I should be able to estimate the effect of education on income.

Expected behavior
If I construct a dataset as follows -

# Voucher -> Education dataset, (ability, income is not known)

# confounder
ability = np.random.normal(0, 3, size=n_points)

# instrument
voucher = np.random.normal(2, 1, size=n_points)

# treatment
education = np.random.normal(5, 1, size=n_points) + education_abilty * ability +\
            education_voucher * voucher


data = np.stack([education, voucher]).T
df1 = pd.DataFrame(data, columns = ['education', 'voucher'])

# Voucher -> Income dataset, (education, ability is not known)
ability = np.random.normal(0, 3, size=n_points)

# instrument
voucher = np.random.normal(2, 1, size=n_points)

# treatment
education = np.random.normal(5, 1, size=n_points) + education_abilty * ability + education_voucher * voucher

# outcome
income = np.random.normal(10, 3, size=n_points) + income_abilty * ability + income_education * education

data = np.stack([voucher, income]).T
df2 = pd.DataFrame(data, columns = ['voucher', 'income'])

df = pd.concat([df1, df2], ignore_index=True)

I expect us to be able to run the estimation procedure and get the correct result

#Step 1: Model
model=CausalModel(
        data = df,
        treatment='education',
        outcome='income',
        common_causes=['U'],
        instruments=['voucher']
        )

# Step 2: Identify
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

estimate = model.estimate_effect(identified_estimand,
        method_name="iv.instrumental_variable")

print(estimate)

But the estimate is nan

Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
No such variable(s) found!

### Estimand : 2
Estimand name: iv
Estimand expression:
 ⎡                                            -1⎤
 ⎢    d              ⎛    d                  ⎞  ⎥
E⎢──────────(income)⋅⎜──────────([education])⎟  ⎥
 ⎣d[voucher]         ⎝d[voucher]             ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→income then ¬(U →→{voucher})
Estimand assumption 2, Exclusion: If we remove {voucher}→{education}, then ¬({voucher}→income)

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: iv
Estimand expression:
 ⎡                                            -1⎤
 ⎢    d              ⎛    d                  ⎞  ⎥
E⎢──────────(income)⋅⎜──────────([education])⎟  ⎥
 ⎣d[voucher]         ⎝d[voucher]             ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→income then ¬(U →→{voucher})
Estimand assumption 2, Exclusion: If we remove {voucher}→{education}, then ¬({voucher}→income)

## Realized estimand
Realized estimand: Wald Estimator
Realized estimand type: EstimandType.NONPARAMETRIC_ATE
Estimand expression:
  ⎡   d            ⎤  
 E⎢────────(income)⎥  
  ⎣dvoucher        ⎦  
──────────────────────
 ⎡   d               ⎤
E⎢────────(education)⎥
 ⎣dvoucher           ⎦
Estimand assumption 1, As-if-random: If U→→income then ¬(U →→{voucher})
Estimand assumption 2, Exclusion: If we remove {voucher}→{education}, then ¬({voucher}→income)
Estimand assumption 3, treatment_effect_homogeneity: Each unit's treatment ['education'] is affected in the same way by common causes of ['education'] and ['income']
Estimand assumption 4, outcome_effect_homogeneity: Each unit's outcome ['income'] is affected in the same way by common causes of ['education'] and ['income']

Target units: ate

## Estimate
Mean value: nan

Version information:

  • DoWhy version [e.g. 0.12]

Additional context
I'm new to both the theory of structural causal modeling as well as the implementation in dowhy. Perhaps I'm wrong about this being estimable or perhaps there is a way to specify "unknown" variables in dowhy that I'm not aware of. Please forgive me in either case.

@AshwinParanjape AshwinParanjape added the question Further information is requested label Jan 12, 2025
@AshwinParanjape
Copy link
Author

While trying to understand the source code, I found the place where I believe we should be able to mask out missing values. I created a PR as a proof of concept - #1296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant