API change for the `SyntheticControl` experiment class #460

drbenvincent · 2025-04-21T10:08:28Z

Towards Upgrade synthetic control to model multiple treated units #456
This does not yet enable multiple treated units in synthetic control experiments. But it implements important prep work which will enable it. The changes focus on:
API changes
Bit of a spaghetti situation, but I had to switch from storing dataframes to xarray.DataArrays. This helps with the broadcasting. The model functions get varied input depending on the situation (e.g. experiment), so it was getting complicated. Xarray simplifies broadcasting in functions like PyMCModel.calculate_impact. (A bunch of time was used trying different solutions before it became clear that the xarray approach was the easiest).
Because the API is changing, I elected to get rid of some legacy backward compatibility/ depreciation handling stuff. So when we do the next release, this will include breaking API changes and abandoning backward compatibility with an old API. I'm not fussed about this, we are currently at version 0.x, so people should expect the API to change until we reach 1.x.

Remaining taks:

Check the multi cell geolift notebook is working as expected
Fix failing doctest
~~Resolve error below~~ Using a linear regression model with synthetic control is not a planned use case. If it becomes important then we can deal with that at the time.

Remaining bug, not captured by tests

In the scikit-learn synthetic control notebook, we are getting an error in the second part where we call

result = cp.SyntheticControl(
    df,
    treatment_time,
    control_units=["a", "b", "c", "d", "e", "f", "g"],
    treated_units=["actual"],
    model=LinearRegression(positive=True),
)

There is a broadcasting issue resulting in this:

So I need to think about if doing this (linear model with synthetic control experiment) even make sense. If it is, then I need to add a test to catch this error because it's not covered by tests currently, and fix it.

📚 Documentation preview 📚: https://causalpy--460.org.readthedocs.build/en/460/

review-notebook-app · 2025-04-21T10:08:39Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2025-04-23T13:38:27Z

Codecov Report

Attention: Patch coverage is 97.43590% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.40%. Comparing base (273daa2) to head (2fcb3ba).
Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
causalpy/experiments/synthetic_control.py	95.65%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #460      +/-   ##
==========================================
- Coverage   94.67%   94.40%   -0.27%     
==========================================
  Files          32       29       -3     
  Lines        2196     2075     -121     
==========================================
- Hits         2079     1959     -120     
+ Misses        117      116       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

NathanielF

This was a quick pass but looks broadly sensible. Just trying to understand how extensive you want to make the API change? Asked some questions but mostly clarifying for my own understanding.

I can see how the API change makes it easier to allow for a matrix of weights in the WeightedFitter.

My one questions was whether you wanted to allow multiple "equations" for the different treatment groups i.e. generate different synthetic controls based on a varying set of inputs. Because If so maybe patsy could still work.... but you'd have to more flexible in the PyMC model than in the API ....

NathanielF · 2025-05-07T20:14:32Z

causalpy/experiments/diff_in_diff.py

+        # turn into xarray.DataArray's
+        self.X = xr.DataArray(
+            self.X,
+            dims=["obs_ind", "coeffs"],


dims of X as coeffs? Should surely be covariates... coeffs is the output, and while they should be 1:1 naming is a little misleading no?

causalpy/experiments/interrupted_time_series.py

causalpy/experiments/synthetic_control.py

NathanielF · 2025-05-07T20:17:38Z

causalpy/experiments/synthetic_control.py

-            self.model.fit(X=self.pre_X, y=self.pre_y, coords=COORDS)
+            COORDS = {
+                # key must stay as "coeffs" unless we can find a way to auto identify
+                # the predictor dimension name. "coeffs" is assumed by


Ah ok fair enough. I guess easier to be consistent across code base

NathanielF · 2025-05-07T20:24:02Z

causalpy/experiments/synthetic_control.py

+    :param control_units:
+        A list of control units to be used in the experiment
+    :param treated_units:
+        A list of treated units to be used in the experiment


Ok, sorry. i forgot this was synthetic control. So treated as target makes sense. The idea would be to enable multiple treatment groups? Do you want to enable different "equations" for the different treatment units? Or would it be just the same fixed controls for each potential treatment unit?

causalpy/pymc_models.py

causalpy/experiments/interrupted_time_series.py

drbenvincent · 2025-05-09T10:13:02Z

Thanks for the review.

I can see how the API change makes it easier to allow for a matrix of weights in the WeightedFitter.

My one questions was whether you wanted to allow multiple "equations" for the different treatment groups i.e. generate different synthetic controls based on a varying set of inputs. Because If so maybe patsy could still work.... but you'd have to more flexible in the PyMC model than in the API ....

My original idea was to see if patsy allowed something like treated1, treated 2, treated 3 ~ control1 + ... + controlN, but it doesn't do that.

The rationale for going for a list of treated and a list of control units was to a) allow multiple treated units, b) allow the weights in the pymc model to change from a 1D vector to a 2D array. Crucially, if all treated units are predicted by the same set of control units then this would be a relatively trivial change to the shape of the Dirichlet distributed weights.

I have to admit, I didn't think of your idea of providing a list of formulas. This is interesting:

✅ Retains the same basic formula interface
✅ The user gets extra ability to add domain knowledge into the modelling process. They could have prior knowledge that some control outlets make sense to go into the pool for some treated units but not others. You could just let the data decide, but having that ability (to effectively set some coefficients at zero by not including a control unit as a predictor) to provide domain knowledge into the modelling is appealing.
❌ It would be slightly harder to implement in the WeightedSumFitter class. Probably not that hard though - you could probably just iterate over the treated units, building up the model with the right length 1-dimensional Dirichlet distributions as you go.

NathanielF · 2025-05-09T11:54:35Z

Yes, i think you have the suggestion right. I think it potentially gives you more flexibility. I'm trying to do the same here: pymc-labs/pymc-marketing#1654

But it does kind of move the complexity into the model class as you have to parse and construct the relevant components.

drbenvincent added 3 commits April 21, 2025 10:16

initial efforts

7bbff4f

remove print statement

7ece785

obs_indx -> obs_ind (see #459)

Loading
Loading status checks…

98127fd

drbenvincent marked this pull request as draft April 21, 2025 10:08

drbenvincent added 25 commits April 21, 2025 11:10

update API in tests for SyntheticControl class

Loading
Loading status checks…

82d041d

Merge branch 'main' into sc-api-change

Loading
Loading status checks…

3bbabee

tidy up + fixes

Loading
Loading status checks…

1f4d17e

get deprecation tests working again

Loading
Loading status checks…

182aac0

bug fixes

Loading
Loading status checks…

876c154

fix bug with SyntheticControl.get_plot_data_bayesian

3d29fef

use new API in scikit-learn integration test

Loading
Loading status checks…

3eb24a9

update the pymc synthetic control notebooks

Loading
Loading status checks…

bd9beaa

remove test_api_stability

Loading
Loading status checks…

ed62f0a

fix bugs

Loading
Loading status checks…

a148ec3

bug fixing

Loading
Loading status checks…

15454b2

remove api backward compatibility and deprecation tests

Loading
Loading status checks…

b77c3f0

more deprecation removal

Loading
Loading status checks…

c89a147

add additional asserts to integration tests to detect shape problems

Loading
Loading status checks…

fad78d8

remove asserts which weren't doing the job I intended

2137091

start embracing xarray to handle broadcasting

Loading
Loading status checks…

45f1b1a

formatting

2726484

store data in xarray objects in more experiments

b920207

attempt to make LinearRegression doctest pass

Loading
Loading status checks…

a28c5da

revert a change which seems no longer required

Loading
Loading status checks…

642f651

fix some failing tests

Loading
Loading status checks…

a17f5c9

all tests now passing 🎉 (one failing doctest)

Loading
Loading status checks…

9941c02

update api calls in multi cell geolift notebook

Loading
Loading status checks…

6db026e

undo plot colour change that I made when debugging

Loading
Loading status checks…

b49ed7e

final doctest now passes 😍

Loading
Loading status checks…

1f753e9

drbenvincent added 3 commits May 7, 2025 09:19

update notebook on scikit-learn synthetic control example

Loading
Loading status checks…

d855557

rerun synthetic control notebook

c92c117

fix minor issue in multi-cell geolift notebook

Loading
Loading status checks…

2fcb3ba

drbenvincent marked this pull request as ready for review May 7, 2025 08:58

drbenvincent requested a review from NathanielF May 7, 2025 19:29

NathanielF reviewed May 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API change for the `SyntheticControl` experiment class #460

API change for the `SyntheticControl` experiment class #460

drbenvincent commented Apr 21, 2025 •

edited

Loading

review-notebook-app bot commented Apr 21, 2025

codecov bot commented Apr 23, 2025 •

edited

Loading

NathanielF left a comment

NathanielF May 7, 2025

NathanielF May 7, 2025

NathanielF May 7, 2025

drbenvincent commented May 9, 2025

NathanielF commented May 9, 2025

API change for the SyntheticControl experiment class #460

Are you sure you want to change the base?

API change for the SyntheticControl experiment class #460

Conversation

drbenvincent commented Apr 21, 2025 • edited Loading

Remaining taks:

Remaining bug, not captured by tests

review-notebook-app bot commented Apr 21, 2025

codecov bot commented Apr 23, 2025 • edited Loading

Codecov Report

NathanielF left a comment

Choose a reason for hiding this comment

NathanielF May 7, 2025

Choose a reason for hiding this comment

NathanielF May 7, 2025

Choose a reason for hiding this comment

NathanielF May 7, 2025

Choose a reason for hiding this comment

drbenvincent commented May 9, 2025

NathanielF commented May 9, 2025

API change for the `SyntheticControl` experiment class #460

API change for the `SyntheticControl` experiment class #460

drbenvincent commented Apr 21, 2025 •

edited

Loading

codecov bot commented Apr 23, 2025 •

edited

Loading