Estimation with carefully chosen missing values #1295

AshwinParanjape · 2025-01-12T23:57:07Z

Ask your question
I couldn't find information on whether dowhy supports estimation with certain carefully chosen missing pieces of data. Consider the "voucher as instrumental variable" example from the tutorial. The problem assumes each row in the dataframe has values for all 3 observed variables (voucher, education, income). However, the estimand is

### Estimand : 1
Estimand name: iv
Estimand expression:
 ⎡                                            -1⎤
 ⎢    d              ⎛    d                  ⎞  ⎥
E⎢──────────(income)⋅⎜──────────([education])⎟  ⎥
 ⎣d[voucher]         ⎝d[voucher]             ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→income then ¬(U →→{voucher})
Estimand assumption 2, Exclusion: If we remove {voucher}→{education}, then ¬({voucher}→income)

which means if I can estimate the association of voucher -> income and voucher -> education, I should be able to estimate the effect of education on income.

Expected behavior
If I construct a dataset as follows -

# Voucher -> Education dataset, (ability, income is not known)

# confounder
ability = np.random.normal(0, 3, size=n_points)

# instrument
voucher = np.random.normal(2, 1, size=n_points)

# treatment
education = np.random.normal(5, 1, size=n_points) + education_abilty * ability +\
            education_voucher * voucher


data = np.stack([education, voucher]).T
df1 = pd.DataFrame(data, columns = ['education', 'voucher'])

# Voucher -> Income dataset, (education, ability is not known)
ability = np.random.normal(0, 3, size=n_points)

# instrument
voucher = np.random.normal(2, 1, size=n_points)

# treatment
education = np.random.normal(5, 1, size=n_points) + education_abilty * ability + education_voucher * voucher

# outcome
income = np.random.normal(10, 3, size=n_points) + income_abilty * ability + income_education * education

data = np.stack([voucher, income]).T
df2 = pd.DataFrame(data, columns = ['voucher', 'income'])

df = pd.concat([df1, df2], ignore_index=True)

I expect us to be able to run the estimation procedure and get the correct result

#Step 1: Model
model=CausalModel(
        data = df,
        treatment='education',
        outcome='income',
        common_causes=['U'],
        instruments=['voucher']
        )

# Step 2: Identify
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

estimate = model.estimate_effect(identified_estimand,
        method_name="iv.instrumental_variable")

print(estimate)

But the estimate is nan

Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
No such variable(s) found!

### Estimand : 2
Estimand name: iv
Estimand expression:
 ⎡                                            -1⎤
 ⎢    d              ⎛    d                  ⎞  ⎥
E⎢──────────(income)⋅⎜──────────([education])⎟  ⎥
 ⎣d[voucher]         ⎝d[voucher]             ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→income then ¬(U →→{voucher})
Estimand assumption 2, Exclusion: If we remove {voucher}→{education}, then ¬({voucher}→income)

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: iv
Estimand expression:
 ⎡                                            -1⎤
 ⎢    d              ⎛    d                  ⎞  ⎥
E⎢──────────(income)⋅⎜──────────([education])⎟  ⎥
 ⎣d[voucher]         ⎝d[voucher]             ⎠  ⎦
Estimand assumption 1, As-if-random: If U→→income then ¬(U →→{voucher})
Estimand assumption 2, Exclusion: If we remove {voucher}→{education}, then ¬({voucher}→income)

## Realized estimand
Realized estimand: Wald Estimator
Realized estimand type: EstimandType.NONPARAMETRIC_ATE
Estimand expression:
  ⎡   d            ⎤  
 E⎢────────(income)⎥  
  ⎣dvoucher        ⎦  
──────────────────────
 ⎡   d               ⎤
E⎢────────(education)⎥
 ⎣dvoucher           ⎦
Estimand assumption 1, As-if-random: If U→→income then ¬(U →→{voucher})
Estimand assumption 2, Exclusion: If we remove {voucher}→{education}, then ¬({voucher}→income)
Estimand assumption 3, treatment_effect_homogeneity: Each unit's treatment ['education'] is affected in the same way by common causes of ['education'] and ['income']
Estimand assumption 4, outcome_effect_homogeneity: Each unit's outcome ['income'] is affected in the same way by common causes of ['education'] and ['income']

Target units: ate

## Estimate
Mean value: nan

Version information:

DoWhy version [e.g. 0.12]

Additional context
I'm new to both the theory of structural causal modeling as well as the implementation in dowhy. Perhaps I'm wrong about this being estimable or perhaps there is a way to specify "unknown" variables in dowhy that I'm not aware of. Please forgive me in either case.

The text was updated successfully, but these errors were encountered:

AshwinParanjape · 2025-01-13T00:54:25Z

While trying to understand the source code, I found the place where I believe we should be able to mask out missing values. I created a PR as a proof of concept - #1296

AshwinParanjape added the question Further information is requested label Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimation with carefully chosen missing values #1295

Estimation with carefully chosen missing values #1295

AshwinParanjape commented Jan 12, 2025

AshwinParanjape commented Jan 13, 2025

Estimation with carefully chosen missing values #1295

Estimation with carefully chosen missing values #1295

Comments

AshwinParanjape commented Jan 12, 2025

AshwinParanjape commented Jan 13, 2025