Skip to content

API change for the SyntheticControl experiment class #460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

drbenvincent
Copy link
Collaborator

@drbenvincent drbenvincent commented Apr 21, 2025

  • Towards Upgrade synthetic control to model multiple treated units #456
  • This does not yet enable multiple treated units in synthetic control experiments. But it implements important prep work which will enable it. The changes focus on:
  • API changes
  • Bit of a spaghetti situation, but I had to switch from storing dataframes to xarray.DataArrays. This helps with the broadcasting. The model functions get varied input depending on the situation (e.g. experiment), so it was getting complicated. Xarray simplifies broadcasting in functions like PyMCModel.calculate_impact. (A bunch of time was used trying different solutions before it became clear that the xarray approach was the easiest).
  • Because the API is changing, I elected to get rid of some legacy backward compatibility/ depreciation handling stuff. So when we do the next release, this will include breaking API changes and abandoning backward compatibility with an old API. I'm not fussed about this, we are currently at version 0.x, so people should expect the API to change until we reach 1.x.

Remaining taks:

  • Check the multi cell geolift notebook is working as expected
  • Fix failing doctest
  • Resolve error below Using a linear regression model with synthetic control is not a planned use case. If it becomes important then we can deal with that at the time.

Remaining bug, not captured by tests

In the scikit-learn synthetic control notebook, we are getting an error in the second part where we call

result = cp.SyntheticControl(
    df,
    treatment_time,
    control_units=["a", "b", "c", "d", "e", "f", "g"],
    treated_units=["actual"],
    model=LinearRegression(positive=True),
)

There is a broadcasting issue resulting in this:
Screenshot 2025-04-23 at 10 34 27

So I need to think about if doing this (linear model with synthetic control experiment) even make sense. If it is, then I need to add a test to catch this error because it's not covered by tests currently, and fix it.


📚 Documentation preview 📚: https://causalpy--460.org.readthedocs.build/en/460/

Sorry, something went wrong.

@drbenvincent drbenvincent marked this pull request as draft April 21, 2025 10:08
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

codecov bot commented Apr 23, 2025

Codecov Report

Attention: Patch coverage is 97.43590% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.40%. Comparing base (273daa2) to head (2fcb3ba).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
causalpy/experiments/synthetic_control.py 95.65% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #460      +/-   ##
==========================================
- Coverage   94.67%   94.40%   -0.27%     
==========================================
  Files          32       29       -3     
  Lines        2196     2075     -121     
==========================================
- Hits         2079     1959     -120     
+ Misses        117      116       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@drbenvincent drbenvincent marked this pull request as ready for review May 7, 2025 08:58
@drbenvincent drbenvincent requested a review from NathanielF May 7, 2025 19:29
Copy link
Contributor

@NathanielF NathanielF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a quick pass but looks broadly sensible. Just trying to understand how extensive you want to make the API change? Asked some questions but mostly clarifying for my own understanding.

I can see how the API change makes it easier to allow for a matrix of weights in the WeightedFitter.

My one questions was whether you wanted to allow multiple "equations" for the different treatment groups i.e. generate different synthetic controls based on a varying set of inputs. Because If so maybe patsy could still work.... but you'd have to more flexible in the PyMC model than in the API ....

# turn into xarray.DataArray's
self.X = xr.DataArray(
self.X,
dims=["obs_ind", "coeffs"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dims of X as coeffs? Should surely be covariates... coeffs is the output, and while they should be 1:1 naming is a little misleading no?

self.model.fit(X=self.pre_X, y=self.pre_y, coords=COORDS)
COORDS = {
# key must stay as "coeffs" unless we can find a way to auto identify
# the predictor dimension name. "coeffs" is assumed by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok fair enough. I guess easier to be consistent across code base

:param control_units:
A list of control units to be used in the experiment
:param treated_units:
A list of treated units to be used in the experiment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sorry. i forgot this was synthetic control. So treated as target makes sense. The idea would be to enable multiple treatment groups? Do you want to enable different "equations" for the different treatment units? Or would it be just the same fixed controls for each potential treatment unit?

@drbenvincent
Copy link
Collaborator Author

Thanks for the review.

I can see how the API change makes it easier to allow for a matrix of weights in the WeightedFitter.

My one questions was whether you wanted to allow multiple "equations" for the different treatment groups i.e. generate different synthetic controls based on a varying set of inputs. Because If so maybe patsy could still work.... but you'd have to more flexible in the PyMC model than in the API ....

My original idea was to see if patsy allowed something like treated1, treated 2, treated 3 ~ control1 + ... + controlN, but it doesn't do that.

The rationale for going for a list of treated and a list of control units was to a) allow multiple treated units, b) allow the weights in the pymc model to change from a 1D vector to a 2D array. Crucially, if all treated units are predicted by the same set of control units then this would be a relatively trivial change to the shape of the Dirichlet distributed weights.

I have to admit, I didn't think of your idea of providing a list of formulas. This is interesting:

  • ✅ Retains the same basic formula interface
  • ✅ The user gets extra ability to add domain knowledge into the modelling process. They could have prior knowledge that some control outlets make sense to go into the pool for some treated units but not others. You could just let the data decide, but having that ability (to effectively set some coefficients at zero by not including a control unit as a predictor) to provide domain knowledge into the modelling is appealing.
  • ❌ It would be slightly harder to implement in the WeightedSumFitter class. Probably not that hard though - you could probably just iterate over the treated units, building up the model with the right length 1-dimensional Dirichlet distributions as you go.

@NathanielF
Copy link
Contributor

Yes, i think you have the suggestion right. I think it potentially gives you more flexibility. I'm trying to do the same here: pymc-labs/pymc-marketing#1654

But it does kind of move the complexity into the model class as you have to parse and construct the relevant components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants