Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Introduce Responsible AI section on website (Interpretability + DataBalanceAnalysis) #1241

Merged
merged 25 commits into from
Nov 6, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
8f29786
Merge pull request #1 from Azure/master
ms-kashyap Aug 11, 2021
e2b68cc
Merge remote-tracking branch 'upstream/master'
ms-kashyap Sep 2, 2021
84d1441
Merge remote-tracking branch 'upstream/master'
ms-kashyap Oct 18, 2021
9b5a187
Merge remote-tracking branch 'upstream/master'
ms-kashyap Oct 21, 2021
eab5405
[DataBalanceAnalysis] Add doc and sample notebook
ms-kashyap Oct 22, 2021
33c55e8
Clear outputs in sample notebook
ms-kashyap Oct 22, 2021
3fb604e
Merge from upstream-master
ms-kashyap Oct 27, 2021
75a50c9
Address jasowang PR comments
ms-kashyap Oct 28, 2021
152d295
Merge branch 'master' into kapat/docs
ms-kashyap Oct 31, 2021
b12d500
Merge branch 'master' into kapat/docs
ms-kashyap Nov 2, 2021
4c22452
Merge remote-tracking branch 'upstream/master' into kapat/docs
kashmoneygt Nov 2, 2021
de72151
[DataBalanceAnalysis] Update notebook and doc
kashmoneygt Nov 2, 2021
7333586
Merge remote-tracking branch 'origin/kapat/docs' into kapat/docs
kashmoneygt Nov 2, 2021
5041f5f
[Databricks E2E Tests] Upgrade DBR from 8.3 to 9.1 LTS
kashmoneygt Nov 2, 2021
3ec3bb7
Merge branch 'master' into kapat/docs
ms-kashyap Nov 3, 2021
24b8394
Merge branch 'master' into kapat/docs
ms-kashyap Nov 3, 2021
ad41e11
[Databricks E2E Tests] Revert DBR from 9.1 LTS to 8.3
kashmoneygt Nov 3, 2021
941eb38
Merge remote-tracking branch 'upstream/master' into kapat/docs
kashmoneygt Nov 3, 2021
f665cea
Fix broken link
kashmoneygt Nov 3, 2021
89b55ca
[website] Model Interpretability -> Responsible AI
kashmoneygt Nov 4, 2021
eeecd03
Move ModelInterpretability - Snow Leopard Detection.md to examples
ms-kashyap Nov 4, 2021
6b5c296
Get latest from upstream/master
ms-kashyap Nov 4, 2021
70ee581
Merge remote-tracking branch 'upstream/master' into kapat/docs
ms-kashyap Nov 5, 2021
7565de5
Host DataBalanceAnalysis-AdultCensusIncome cell outputs in blob inste…
ms-kashyap Nov 5, 2021
3b3593b
Replace ModelInterpretability-SnowLeopardDetection with Interpretabil…
ms-kashyap Nov 5, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
663 changes: 0 additions & 663 deletions notebooks/Data Balance Analysis - Adult Census Income.ipynb

This file was deleted.

641 changes: 641 additions & 0 deletions notebooks/DataBalanceAnalysis - Adult Census Income.ipynb

Large diffs are not rendered by default.

8 changes: 5 additions & 3 deletions website/docs/examples/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ sidebar_label: About
- Train and evaluate a flight delay prediction system ([Regression - Flight Delays])
- Finding anomalous data access patterns using the Access Anomalies package of CyberML ([CyberML - Anomalous Access Detection])
- Model interpretation ([Interpretability - Tabular SHAP Explainer], [Interpretability - Image Explainers], [Interpretability - Text Explainers])
- Do Data Balance Analysis to determine how well features and feature values are represented in your dataset ([DataBalanceAnalysis - Adult Census Income])


[Classification - Adult Census]: ../classification/Classification%20-%20Adult%20Census "Classification - Adult Census"
Expand All @@ -47,9 +48,10 @@ sidebar_label: About

[CyberML - Anomalous Access Detection]: ../CyberML%20-%20Anomalous%20Access%20Detection "CyberML - Anomalous Access Detection"

[Interpretability - Tabular SHAP Explainer]: ../model_interpretability/Interpretability%20-%20Tabular%20SHAP%20explainer "Interpretability - Tabular SHAP Explainer"
[Interpretability - Tabular SHAP Explainer]: ../responsible_ai/Interpretability%20-%20Tabular%20SHAP%20explainer "Interpretability - Tabular SHAP Explainer"

[Interpretability - Image Explainers]: ../model_interpretability/Interpretability%20-%20Image%20Explainers "Interpretability - Image Explainers"
[Interpretability - Image Explainers]: ../responsible_ai/Interpretability%20-%20Image%20Explainers "Interpretability - Image Explainers"

[Interpretability - Text Explainers]: ../model_interpretability/Interpretability%20-%20Text%20Explainers "Interpretability - Text Explainers"
[Interpretability - Text Explainers]: ../responsible_ai/Interpretability%20-%20Text%20Explainers "Interpretability - Text Explainers"

[DataBalanceAnalysis - Adult Census Income]: ../responsible_ai/DataBalanceAnalysis%20-%20Adult%20Census%20Income "DataBalanceAnalysis - Adult Census Income"

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
---
title: Interpretability - Explanation Dashboard
hide_title: true
status: stable
---
## Interpretability - Explanation Dashboard

In this example, similar to the "Interpretability - Tabular SHAP explainer" notebook, we use Kernel SHAP to explain a tabular classification model built from the Adults Census dataset and then visualize the explanation in the ExplanationDashboard from https://github.com/microsoft/responsible-ai-widgets.

First we import the packages and define some UDFs we will need later.


```python
import pyspark
from synapse.ml.explainers import *
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.sql.types import *
from pyspark.sql.functions import *
import pandas as pd

vec_access = udf(lambda v, i: float(v[i]), FloatType())
vec2array = udf(lambda vec: vec.toArray().tolist(), ArrayType(FloatType()))
```

Now let's read the data and train a simple binary classification model.


```python
df = spark.read.parquet("wasbs://publicwasb@mmlspark.blob.core.windows.net/AdultCensusIncome.parquet")

labelIndexer = StringIndexer(inputCol="income", outputCol="label", stringOrderType="alphabetAsc").fit(df)
print("Label index assigment: " + str(set(zip(labelIndexer.labels, [0, 1]))))

training = labelIndexer.transform(df)
display(training)
categorical_features = [
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"native-country",
]
categorical_features_idx = [col + "_idx" for col in categorical_features]
categorical_features_enc = [col + "_enc" for col in categorical_features]
numeric_features = ["age", "education-num", "capital-gain", "capital-loss", "hours-per-week"]

strIndexer = StringIndexer(inputCols=categorical_features, outputCols=categorical_features_idx)
onehotEnc = OneHotEncoder(inputCols=categorical_features_idx, outputCols=categorical_features_enc)
vectAssem = VectorAssembler(inputCols=categorical_features_enc + numeric_features, outputCol="features")
lr = LogisticRegression(featuresCol="features", labelCol="label", weightCol="fnlwgt")
pipeline = Pipeline(stages=[strIndexer, onehotEnc, vectAssem, lr])
model = pipeline.fit(training)
```

After the model is trained, we randomly select some observations to be explained.


```python
explain_instances = model.transform(training).orderBy(rand()).limit(5).repartition(200).cache()
display(explain_instances)
```

We create a TabularSHAP explainer, set the input columns to all the features the model takes, specify the model and the target output column we are trying to explain. In this case, we are trying to explain the "probability" output which is a vector of length 2, and we are only looking at class 1 probability. Specify targetClasses to `[0, 1]` if you want to explain class 0 and 1 probability at the same time. Finally we sample 100 rows from the training data for background data, which is used for integrating out features in Kernel SHAP.


```python
shap = TabularSHAP(
inputCols=categorical_features + numeric_features,
outputCol="shapValues",
numSamples=5000,
model=model,
targetCol="probability",
targetClasses=[1],
backgroundData=broadcast(training.orderBy(rand()).limit(100).cache()),
)

shap_df = shap.transform(explain_instances)

```

Once we have the resulting dataframe, we extract the class 1 probability of the model output, the SHAP values for the target class, the original features and the true label. Then we convert it to a pandas dataframe for visisualization.
For each observation, the first element in the SHAP values vector is the base value (the mean output of the background dataset), and each of the following element is the SHAP values for each feature.


```python
shaps = (
shap_df.withColumn("probability", vec_access(col("probability"), lit(1)))
.withColumn("shapValues", vec2array(col("shapValues").getItem(0)))
.select(["shapValues", "probability", "label"] + categorical_features + numeric_features)
)

shaps_local = shaps.toPandas()
shaps_local.sort_values("probability", ascending=False, inplace=True, ignore_index=True)
pd.set_option("display.max_colwidth", None)
shaps_local
```

We can visualize the explanation in the [interpret-community format](https://github.com/interpretml/interpret-community) in the ExplanationDashboard from https://github.com/microsoft/responsible-ai-widgets/


```python
import pandas as pd
import numpy as np

features = categorical_features + numeric_features
features_with_base = ["Base"] + features

rows = shaps_local.shape[0]

local_importance_values = shaps_local[['shapValues']]
eval_data = shaps_local[features]
true_y = np.array(shaps_local[['label']])
```


```python
list_local_importance_values = local_importance_values.values.tolist()
converted_importance_values = []
bias = []
for classarray in list_local_importance_values:
for rowarray in classarray:
converted_list = rowarray.tolist()
bias.append(converted_list[0])
# remove the bias from local importance values
del converted_list[0]
converted_importance_values.append(converted_list)
```

When running Synapse Analytics, please follow instructions here [Package management - Azure Synapse Analytics | Microsoft Docs](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-libraries) to install ["raiwidgets"](https://pypi.org/project/raiwidgets/) and ["interpret-community"](https://pypi.org/project/interpret-community/) packages.


```python
!pip install --upgrade raiwidgets
```


```python
!pip install --upgrade interpret-community
```


```python
from interpret_community.adapter import ExplanationAdapter
adapter = ExplanationAdapter(features, classification=True)
global_explanation = adapter.create_global(converted_importance_values, eval_data, expected_values=bias)
```


```python
# view the global importance values
global_explanation.global_importance_values
```


```python
# view the local importance values
global_explanation.local_importance_values
```


```python
class wrapper(object):
def __init__(self, model):
self.model = model

def predict(self, data):
sparkdata = spark.createDataFrame(data)
return model.transform(sparkdata).select('prediction').toPandas().values.flatten().tolist()

def predict_proba(self, data):
sparkdata = spark.createDataFrame(data)
prediction = model.transform(sparkdata).select('probability').toPandas().values.flatten().tolist()
proba_list = [vector.values.tolist() for vector in prediction]
return proba_list
```


```python
# view the explanation in the ExplanationDashboard
from raiwidgets import ExplanationDashboard
ExplanationDashboard(global_explanation, wrapper(model), dataset=eval_data, true_y=true_y)
```

Your results will look like:

<img src="https://mmlspark.blob.core.windows.net/graphics/rai-dashboard.png" />
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ shap = TabularSHAP(
model=model,
targetCol="probability",
targetClasses=[1],
backgroundData=training.orderBy(rand()).limit(100).cache(),
backgroundData=broadcast(training.orderBy(rand()).limit(100).cache()),
)

shap_df = shap.transform(explain_instances)
Expand Down
2 changes: 1 addition & 1 deletion website/docs/features/onnx/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,5 +45,5 @@ MMLSpark now includes a Spark transformer to bring an trained ONNX model to Apac

## Example

- [Interpretability - Image Explainers](/docs/examples/model_interpretability/Interpretability%20-%20Image%20Explainers)
- [Interpretability - Image Explainers](/docs/examples/responsible_ai/Interpretability%20-%20Image%20Explainers)
- [ONNX - Inference on Spark](/docs/features/onnx/ONNX%20-%20Inference%20on%20Spark)
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
---
title: Data Balance Analysis on Spark
hide_title: true
sidebar_label: Data Balance Analysis
description: Learn how to do Data Balance Analysis on Spark to determine how well features and feature values are represented in your dataset.
---

Expand All @@ -17,7 +19,7 @@ In summary, Data Balance Analysis, used as a step for building ML models has the

## Examples

* [Data Balance Analysis - Adult Census Income](https://github.com/microsoft/SynapseML/blob/master/notebooks/Data%20Balance%20Analysis%20-%20Adult%20Census%20Income.ipynb)
* [Data Balance Analysis - Adult Census Income](../../../examples/responsible_ai/DataBalanceAnalysis%20-%20Adult%20Census%20Income)

## Usage

Expand Down Expand Up @@ -173,22 +175,22 @@ This involves under-sampling from majority class and over-sampling from minority
1. Under-sampling may remove valuable information.
2. Over-sampling may cause overfitting and poor generalization on test set.

![Bar chart undersampling and oversampling](https://mmlspark.blob.core.windows.net/graphics/exploratory/DataBalanceAnalysis_SamplingBar.png)
![Bar chart undersampling and oversampling](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_SamplingBar.png)

There are smarter techniques to under-sample and over-sample in literature and implemented in Python’s [imbalanced-learn](https://imbalanced-learn.org/stable/) package.

For example, we can cluster the records of the majority class, and do the under-sampling by removing records from each cluster, thus seeking to preserve information.

One technique of under-sampling is use of Tomek Links. Tomek links are pairs of very close instances but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. A similar way to under-sample majority class is using Near-Miss. It first calculates the distance between all the points in the larger class with the points in the smaller class. When two points belonging to different classes are very close to each other in the distribution, this algorithm eliminates the datapoint of the larger class thereby trying to balance the distribution.

![Tomek Links](https://mmlspark.blob.core.windows.net/graphics/exploratory/DataBalanceAnalysis_TomekLinks.png)
![Tomek Links](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_TomekLinks.png)

In over-sampling, instead of creating exact copies of the minority class records, we can introduce small variations into those copies, creating more diverse synthetic samples. This technique is called SMOTE (Synthetic Minority Oversampling Technique). It randomly picks a point from the minority class and computes the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

![Synthetic Samples](https://mmlspark.blob.core.windows.net/graphics/exploratory/DataBalanceAnalysis_SyntheticSamples.png)
![Synthetic Samples](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_SyntheticSamples.png)

### Reweighting

There is an expected and observed value in each table cell. The weight is essentially expected / observed value. This is easy to extend to multiple features with more than 2 groups. The weights are then incorporated in loss function of model training.

![Reweighting](https://mmlspark.blob.core.windows.net/graphics/exploratory/DataBalanceAnalysis_Reweight.png)
![Reweighting](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_Reweight.png)
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Model Interpretation on Spark
hide_title: true
sidebar_label: About
sidebar_label: Model Interpretation on Spark
---

# Model Interpretation on Spark
Expand All @@ -26,9 +26,9 @@ Both explainers extends from `org.apache.spark.ml.Transformer`. After setting up

To see examples of model interpretability on Spark in action, take a look at these sample notebooks:

- [Tabular SHAP explainer](/docs/examples/model_interpretability/Interpretability%20-%20Tabular%20SHAP%20explainer)
- [Image explainers](/docs/examples/model_interpretability/Interpretability%20-%20Image%20Explainers)
- [Text explainers](/docs/examples/model_interpretability/Interpretability%20-%20Text%20Explainers)
- [Tabular SHAP explainer](../../../examples/responsible_ai/Interpretability%20-%20Tabular%20SHAP%20explainer)
ms-kashyap marked this conversation as resolved.
Show resolved Hide resolved
- [Image explainers](../../../examples/responsible_ai/Interpretability%20-%20Image%20Explainers)
- [Text explainers](../../../examples/responsible_ai/Interpretability%20-%20Text%20Explainers)

| | Tabular models | Vector models | Image models | Text models |
|------------------------|-----------------------------|---------------------------|-------------------------|-----------------------|
Expand Down
32 changes: 21 additions & 11 deletions website/notebookconvert.py
Original file line number Diff line number Diff line change
@@ -1,39 +1,47 @@
import os
import re


def add_header_to_markdown(folder, md):
name = md[:-3]
with open(os.path.join(folder, md), 'r+', encoding='utf-8') as f:
with open(os.path.join(folder, md), "r+", encoding="utf-8") as f:
content = f.read()
f.truncate(0)
content = re.sub(r'style=\"[\S ]*?\"', '', content)
content = re.sub(r'<style[\S \n.]*?</style>', '', content)
content = re.sub(r"style=\"[\S ]*?\"", "", content)
content = re.sub(r"<style[\S \n.]*?</style>", "", content)
f.seek(0, 0)
f.write("---\ntitle: {}\nhide_title: true\nstatus: stable\n---\n".format(name) + content)
f.close()


def convert_notebook_to_markdown(file_path, outputdir):
print("Converting {} into markdown \n".format(file_path))
convert_cmd = 'jupyter nbconvert --output-dir=\"{}\" --to markdown \"{}\"'.format(outputdir, file_path)
print(f"Converting {file_path} into markdown")
convert_cmd = f'jupyter nbconvert --output-dir="{outputdir}" --to markdown "{file_path}"'
os.system(convert_cmd)
print()


def convert_allnotebooks_in_folder(folder, outputdir):

dic = {
"CognitiveServices - Overview": os.path.join(outputdir, "features"),
"Classification": os.path.join(outputdir, "examples", "classification"),
"CognitiveServices": os.path.join(outputdir, "examples", "cognitive_services"),
"DataBalanceAnalysis": os.path.join(outputdir, "examples", "responsible_ai"),
"DeepLearning": os.path.join(outputdir, "examples", "deep_learning"),
"Interpretability": os.path.join(outputdir, "examples", "model_interpretability"),
"Interpretability - Image Explainers": os.path.join(outputdir, "features", "responsible_ai"),
"Interpretability - Explanation Dashboard": os.path.join(outputdir, "examples", "responsible_ai"),
"Interpretability - Tabular SHAP explainer": os.path.join(outputdir, "examples", "responsible_ai"),
"Interpretability - Text Explainers": os.path.join(outputdir, "examples", "responsible_ai"),
"ModelInterpretability": os.path.join(outputdir, "examples", "responsible_ai"),
ms-kashyap marked this conversation as resolved.
Show resolved Hide resolved
"Regression": os.path.join(outputdir, "examples", "regression"),
"TextAnalytics": os.path.join(outputdir, "examples", "text_analytics"),
"HttpOnSpark": os.path.join(outputdir, "features", "http"),
"LightGBM": os.path.join(outputdir, "features", "lightgbm"),
"ModelInterpretability": os.path.join(outputdir, "features", "model_interpretability"),
"ONNX": os.path.join(outputdir, "features", "onnx"),
"SparkServing": os.path.join(outputdir, "features", "spark_serving"),
"Vowpal Wabbit": os.path.join(outputdir, "features", "vw")
}
"Vowpal Wabbit": os.path.join(outputdir, "features", "vw"),
}

for nb in os.listdir(folder):
if nb.endswith(".ipynb"):
Expand All @@ -44,7 +52,7 @@ def convert_allnotebooks_in_folder(folder, outputdir):
if nb.startswith(k):
finaldir = v
break

if not os.path.exists(finaldir):
os.mkdir(finaldir)

Expand All @@ -55,11 +63,13 @@ def convert_allnotebooks_in_folder(folder, outputdir):
convert_notebook_to_markdown(os.path.join(folder, nb), finaldir)
add_header_to_markdown(finaldir, md)


def main():
cur_path = os.getcwd()
folder = os.path.join(cur_path, "notebooks")
outputdir = os.path.join(cur_path, "website", "docs")
convert_allnotebooks_in_folder(folder, outputdir)

if __name__ == '__main__':

if __name__ == "__main__":
main()
Loading