-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: [DistributionBalanceMeasure] Add implementation + unit tests for custom reference distribution #1885
feat: [DistributionBalanceMeasure] Add implementation + unit tests for custom reference distribution #1885
Conversation
feat: ONNX model inference on Spark (microsoft#1152)
…om reference distribution
Hey @ms-kashyap 👋! We use semantic commit messages to streamline the release process. Examples of commit messages with semantic prefixes:
To test your commit locally, please follow our guild on building from source. |
core/src/main/scala/com/microsoft/azure/synapse/ml/exploratory/DistributionBalanceMeasure.scala
Outdated
Show resolved
Hide resolved
.../test/scala/com/microsoft/azure/synapse/ml/exploratory/DistributionBalanceMeasureSuite.scala
Outdated
Show resolved
Hide resolved
…use isDefined instead of isEmpty
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Codecov Report
@@ Coverage Diff @@
## master #1885 +/- ##
==========================================
- Coverage 86.95% 80.98% -5.98%
==========================================
Files 301 301
Lines 15565 15586 +21
Branches 797 805 +8
==========================================
- Hits 13535 12622 -913
- Misses 2030 2964 +934
... and 20 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
… pyspark error with ArrayMapParam)
… to use it (fixes testGettersAndSetters failure)
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Related Issues/PRs
Original PR for distribution balance measures - #1218
What changes are proposed in this pull request?
Context
Distribution balance measures are distance measures that can be computed between two distributions, typically between a reference and observed distribution.
These measures can help us create responsible AI products, i.e., seeing if a column (such as Gender or Ethnicity or any col) doesn't have a large skew towards one value which may lead to a biased model.
This PR
Until now, these distance measures were computed using a uniform reference distribution, meaning we expect to see each unique value in a column the same amount.
This PR allows users to specify a custom reference distribution in the form of a
<string, double>
map that contains<feature value, feature probability>
pairs.This will particularly be helpful for measuring data drift, where the custom reference distribution is the distribution of a baseline dataset column and the observed distribution is the distribution of the latest-versioned dataset column.
How is this patch tested?
Does this PR change any dependencies?
Does this PR add a new feature? If so, have you added samples on website?
website/docs/documentation
folder.Make sure you choose the correct class
estimators/transformers
and namespace.DocTable
points to correct API link.yarn run start
to make sure the website renders correctly.<!--pytest-codeblocks:cont-->
before each python code blocks to enable auto-tests for python samples.WebsiteSamplesTests
job pass in the pipeline.