feat: Causal DoubleMLEstimator (#8) #1715

dylanw-oss · 2022-11-12T01:05:21Z

What changes are proposed in this pull request?

Add package 'com.microsoft.azure.synapse.ml.causal' and implementation LinearDMLEstimator

How is this patch tested?

I have written tests

Does this PR change any dependencies?

No.

Does this PR add a new feature? If so, have you added samples on website?

Yes.

Add package 'com.microsoft.azure.synapse.ml.causal' and implementation LinearDMLEstimator

acrolinxatmsft1 · 2022-11-12T01:05:27Z

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article	Acrolinx score	Word and Phrases Score	Correctness Score	Scorecard	Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md				link	⚠️
website/docs/documentation/estimators/estimators_causal.md				link	⚠️

More information about Acrolinx

github-actions · 2022-11-12T01:05:34Z

Hey @dylanw-oss 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

fix: Fix LightGBM crashes with empty partitions
feat: Make HTTP on Spark back-offs configurable
docs: Update Spark Serving usage
build: Add codecov support
perf: improve LightGBM memory usage
refactor: make python code generation rely on classes
style: Remove nulls from CNTKModel
test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

dylanw-oss · 2022-11-12T01:06:26Z

/azp run

azure-pipelines · 2022-11-12T01:06:32Z

Commenter does not have sufficient privileges for PR 1715 in repo microsoft/SynapseML

dylanw-oss · 2022-11-12T01:10:08Z

@serena-ruan, @mhamilton723, can anyone help to give me permission to run a pipeline?

memoryz · 2022-11-12T02:37:50Z

/azp run

azure-pipelines · 2022-11-12T02:38:00Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov-commenter · 2022-11-12T02:46:55Z

Codecov Report

Merging #1715 (60f09e2) into master (4a25954) will decrease coverage by 0.57%.
The diff coverage is 32.25%.

@@            Coverage Diff             @@
##           master    #1715      +/-   ##
==========================================
- Coverage   86.51%   85.94%   -0.58%     
==========================================
  Files         273      276       +3     
  Lines       14420    14571     +151     
  Branches      769      754      -15     
==========================================
+ Hits        12476    12523      +47     
- Misses       1944     2048     +104

Impacted Files	Coverage Δ
...ft/azure/synapse/ml/causal/DoubleMLEstimator.scala	`8.97% <8.97%> (ø)`
.../azure/synapse/ml/causal/ResidualTransformer.scala	`28.12% <28.12%> (ø)`
...osoft/azure/synapse/ml/causal/DoubleMLParams.scala	`71.05% <71.05%> (ø)`
...azure/synapse/ml/core/schema/SchemaConstants.scala	`100.00% <100.00%> (ø)`
...microsoft/azure/synapse/ml/train/AutoTrainer.scala	`100.00% <100.00%> (ø)`
...osoft/azure/synapse/ml/train/TrainClassifier.scala	`84.32% <100.00%> (+0.11%)`	⬆️
...rosoft/azure/synapse/ml/train/TrainRegressor.scala	`92.30% <100.00%> (+1.92%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

serena-ruan · 2022-11-14T07:39:56Z

@serena-ruan, @mhamilton723, can anyone help to give me permission to run a pipeline?

Hi @dylanw-oss Thanks for this PR!! Could you raise a request to join this team: https://github.com/orgs/microsoft/teams/synapseml So that @mhamilton723 could add you in, the you can run /azp run to trigger the pipeline.

core/src/main/scala/com/microsoft/azure/synapse/ml/causal/LinearDMLEstimator.scala

mhamilton723 · 2022-11-14T14:11:11Z

core/src/main/scala/com/microsoft/azure/synapse/ml/causal/LinearDMLEstimator.scala

+  }
+
+  @DeveloperApi
+  override def transformSchema(schema: StructType): StructType =


This transform schema doesent look right, you sure this doesent add any info to the data dframe?

LinearDMLEstimator transform does nothing by design and isn't supposed to be called by end user.
Previously, I set it throw exception, but it won't pass fuzzing testing, so I changed it to return the original dataset back, in this case I don't think we need transform schema, please correct me if I missing something.

I believe there actually is a way to actually use this model in a natural way and perform a regression. In particular you can think of this pipeline as estimating a prediction variable in two steps. The first is the debiasing operation where you map a dataframe to it's residuals. The second is the prediction of the target residuals.

To form the actual prediction target, you first use your baseline estimate of the target from step 1, then add your predicted residual from step 2.

To give a little more info here:

First use your learned residual models to map the inputs to their residuals, then use your treatment effect model to map the residuals to the treatment. Then append that treadment as the prediction column. (If im missing something here perhaps we can chat to help clarify)

@memoryz , Jason, did you sync with our data scientist and if this is feasible?

If there's no objections, I'll set it as by design.

@mhamilton723, I discussed the feedback in depth with our data scientist @sarahshy, and she confirmed that there is no meaningful natural transformation we can do here. We can implement a natural transformation as you suggested, but the result won't be meaningful and interpretable. Therefore, I suggest we resolve this item as "by design". I can schedule a meeting with @sarahshy if you still have concerns.

core/src/main/scala/com/microsoft/azure/synapse/ml/causal/LinearDMLParams.scala

core/src/main/scala/com/microsoft/azure/synapse/ml/causal/ResidualTransformer.scala

core/src/main/scala/com/microsoft/azure/synapse/ml/core/contracts/Params.scala

core/src/main/scala/com/microsoft/azure/synapse/ml/core/schema/SchemaConstants.scala

website/docs/documentation/estimators/causal/_causalInferenceDML.md

mhamilton723

Lovely work! Really excited to see this go in, please feel free to chat if any of these comments dont make sense!

core/src/main/scala/com/microsoft/azure/synapse/ml/causal/LinearDMLEstimator.scala

acrolinxatmsft1 · 2022-12-06T20:10:45Z

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article	Acrolinx score	Word and Phrases Score	Correctness Score	Scorecard	Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md	100	100	100	link	✅
website/docs/documentation/estimators/estimators_causal.md	100	100	100	link	✅

More information about Acrolinx

acrolinxatmsft1 · 2022-12-06T21:41:45Z

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article	Acrolinx score	Word and Phrases Score	Correctness Score	Scorecard	Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md	100	100	100	link	✅
website/docs/documentation/estimators/estimators_causal.md	100	100	100	link	✅

More information about Acrolinx

dylanw-oss · 2022-12-06T21:46:43Z

/azp run

azure-pipelines · 2022-12-06T21:46:54Z

Azure Pipelines successfully started running 1 pipeline(s).

memoryz

LGTM now.

acrolinxatmsft1 · 2022-12-08T18:56:22Z

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article	Acrolinx score	Word and Phrases Score	Correctness Score	Scorecard	Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md	100	100	100	link	✅
website/docs/documentation/estimators/estimators_causal.md	100	100	100	link	✅

More information about Acrolinx

acrolinxatmsft1 · 2022-12-08T18:57:22Z

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article	Acrolinx score	Word and Phrases Score	Correctness Score	Scorecard	Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md	100	100	100	link	✅
website/docs/documentation/estimators/estimators_causal.md	100	100	100	link	✅

More information about Acrolinx

dylanw-oss · 2022-12-08T19:02:11Z

/azp run

azure-pipelines · 2022-12-08T19:02:21Z

Azure Pipelines successfully started running 1 pipeline(s).

acrolinxatmsft1 · 2022-12-16T17:01:31Z

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article	Acrolinx score	Word and Phrases Score	Correctness Score	Scorecard	Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md	100	100	100	link	✅
website/docs/documentation/estimators/estimators_causal.md	100	100	100	link	✅

More information about Acrolinx

acrolinxatmsft1 · 2022-12-16T18:26:19Z

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article	Acrolinx score	Word and Phrases Score	Correctness Score	Scorecard	Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md	100	100	100	link	✅
website/docs/documentation/estimators/estimators_causal.md	100	100	100	link	✅

More information about Acrolinx

memoryz · 2022-12-16T18:26:24Z

/azp run

azure-pipelines · 2022-12-16T18:26:36Z

Azure Pipelines successfully started running 1 pipeline(s).

memoryz · 2022-12-16T22:13:07Z

@mhamilton723 ready for merge. :)

feat: Causal dmlestimator (#8)

08c79cc

Add package 'com.microsoft.azure.synapse.ml.causal' and implementation LinearDMLEstimator