Skip to content

Better handling of treated input in RegressionDiscontinuity #440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks
drbenvincent opened this issue Feb 28, 2025 · 3 comments
Open
2 tasks

Better handling of treated input in RegressionDiscontinuity #440

drbenvincent opened this issue Feb 28, 2025 · 3 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@drbenvincent
Copy link
Collaborator

When doing regression discontinuity analysis, eg.

result = cp.RegressionDiscontinuity(
    df,
    formula="y ~ 1 + x + treated + x:treated",
    model=cp.pymc_models.LinearRegression(sample_kwargs={"random_seed": seed}),
    treatment_threshold=0.5,
)

it looks like treated has to be of type bool. A mysterious error arises if it is instead 0's and 1's coded as int's.

  • Add an extra data validation step
  • Add a test to check that we get an exception if we provide ints
@drbenvincent drbenvincent added bug Something isn't working good first issue Good for newcomers labels Feb 28, 2025
@inhandan
Copy link

inhandan commented Mar 7, 2025

I'd like to solve this. Can you provide a code snippet and error message? Please include definition of the df, and in particular the treated column

@drbenvincent
Copy link
Collaborator Author

Hi @inhandan. Here's a MWE to reproduce the bug:

import causalpy as cp
import pandas as pd
import numpy as np

seed = 42

threshold = 0.5
x = np.random.uniform(0, 1, 100)
treated = np.where(x > threshold, 1, 0)  # dtype is int
y = 2 * x + treated + np.random.normal(0, 1, 100)
df = pd.DataFrame({'x': x, 'treated': treated, 'y': y})

assert df["treated"].dtype == "int64"

result = cp.RegressionDiscontinuity(
    df,
    formula="y ~ 1 + x + treated + x:treated",
    model=cp.pymc_models.LinearRegression(sample_kwargs={"random_seed": seed}),
    treatment_threshold=threshold,
)

But the bug disappears if we set treated as categorical (df["treated"] = pd.Categorical(df["treated"])) or bool (df["treated"] = df["treated"].astype(bool))

I guess the best option is to throw a warning if treated is not categorical or boolean. That puts the onus on the user to ensure the data is being entered correctly. This would probably also be safer and less error prone that trying to coerce treated to categorical or int.

@HPCurtis
Copy link
Contributor

@inhandan have you made any progress with this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants
@drbenvincent @inhandan @HPCurtis and others