microsoft · mhamilton723 · Nov 6, 2021 · Aug 11, 2021 · Sep 2, 2021 · Oct 18, 2021
diff --git a/website/docs/examples/about.md b/website/docs/examples/about.md
@@ -25,6 +25,7 @@ sidebar_label: About
 -   Train and evaluate a flight delay prediction system ([Regression - Flight Delays])
 -   Finding anomalous data access patterns using the Access Anomalies package of CyberML ([CyberML - Anomalous Access Detection])
 -   Model interpretation ([Interpretability - Tabular SHAP Explainer], [Interpretability - Image Explainers], [Interpretability - Text Explainers])
+-   Do Data Balance Analysis to determine how well features and feature values are represented in your dataset ([DataBalanceAnalysis - Adult Census Income])
 
 
 [Classification - Adult Census]: ../classification/Classification%20-%20Adult%20Census "Classification - Adult Census"
@@ -47,9 +48,10 @@ sidebar_label: About
 
 [CyberML - Anomalous Access Detection]: ../CyberML%20-%20Anomalous%20Access%20Detection "CyberML - Anomalous Access Detection"
 
-[Interpretability - Tabular SHAP Explainer]: ../model_interpretability/Interpretability%20-%20Tabular%20SHAP%20explainer "Interpretability - Tabular SHAP Explainer"
+[Interpretability - Tabular SHAP Explainer]: ../responsible_ai/Interpretability%20-%20Tabular%20SHAP%20explainer "Interpretability - Tabular SHAP Explainer"
 
-[Interpretability - Image Explainers]: ../model_interpretability/Interpretability%20-%20Image%20Explainers "Interpretability - Image Explainers"
+[Interpretability - Image Explainers]: ../responsible_ai/Interpretability%20-%20Image%20Explainers "Interpretability - Image Explainers"
 
-[Interpretability - Text Explainers]: ../model_interpretability/Interpretability%20-%20Text%20Explainers "Interpretability - Text Explainers"
+[Interpretability - Text Explainers]: ../responsible_ai/Interpretability%20-%20Text%20Explainers "Interpretability - Text Explainers"
 
+[DataBalanceAnalysis - Adult Census Income]: ../responsible_ai/DataBalanceAnalysis%20-%20Adult%20Census%20Income "DataBalanceAnalysis - Adult Census Income"
diff --git a/website/docs/examples/responsible_ai/DataBalanceAnalysis - Adult Census Income.md b/website/docs/examples/responsible_ai/DataBalanceAnalysis - Adult Census Income.md
diff --git a/website/docs/examples/responsible_ai/Interpretability - Explanation Dashboard.md b/website/docs/examples/responsible_ai/Interpretability - Explanation Dashboard.md
@@ -0,0 +1,191 @@
+---
+title: Interpretability - Explanation Dashboard
+hide_title: true
+status: stable
+---
+## Interpretability - Explanation Dashboard
+
+In this example, similar to the "Interpretability - Tabular SHAP explainer" notebook, we use Kernel SHAP to explain a tabular classification model built from the Adults Census dataset and then visualize the explanation in the ExplanationDashboard from https://github.com/microsoft/responsible-ai-widgets.
+
+First we import the packages and define some UDFs we will need later.
+
+
+```python
+import pyspark
+from synapse.ml.explainers import *
+from pyspark.ml import Pipeline
+from pyspark.ml.classification import LogisticRegression
+from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
+from pyspark.sql.types import *
+from pyspark.sql.functions import *
+import pandas as pd
+
+vec_access = udf(lambda v, i: float(v[i]), FloatType())
+vec2array = udf(lambda vec: vec.toArray().tolist(), ArrayType(FloatType()))
+```
+
+Now let's read the data and train a simple binary classification model.
+
+
+```python
+df = spark.read.parquet("wasbs://publicwasb@mmlspark.blob.core.windows.net/AdultCensusIncome.parquet")
+
+labelIndexer = StringIndexer(inputCol="income", outputCol="label", stringOrderType="alphabetAsc").fit(df)
+print("Label index assigment: " + str(set(zip(labelIndexer.labels, [0, 1]))))
+
+training = labelIndexer.transform(df)
+display(training)
+categorical_features = [
+    "workclass",
+    "education",
+    "marital-status",
+    "occupation",
+    "relationship",
+    "race",
+    "sex",
+    "native-country",
+]
+categorical_features_idx = [col + "_idx" for col in categorical_features]
+categorical_features_enc = [col + "_enc" for col in categorical_features]
+numeric_features = ["age", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
+
+strIndexer = StringIndexer(inputCols=categorical_features, outputCols=categorical_features_idx)
+onehotEnc = OneHotEncoder(inputCols=categorical_features_idx, outputCols=categorical_features_enc)
+vectAssem = VectorAssembler(inputCols=categorical_features_enc + numeric_features, outputCol="features")
+lr = LogisticRegression(featuresCol="features", labelCol="label", weightCol="fnlwgt")
+pipeline = Pipeline(stages=[strIndexer, onehotEnc, vectAssem, lr])
+model = pipeline.fit(training)
+```
+
+After the model is trained, we randomly select some observations to be explained.
+
+
+```python
+explain_instances = model.transform(training).orderBy(rand()).limit(5).repartition(200).cache()
+display(explain_instances)
+```
+
+We create a TabularSHAP explainer, set the input columns to all the features the model takes, specify the model and the target output column we are trying to explain. In this case, we are trying to explain the "probability" output which is a vector of length 2, and we are only looking at class 1 probability. Specify targetClasses to `[0, 1]` if you want to explain class 0 and 1 probability at the same time. Finally we sample 100 rows from the training data for background data, which is used for integrating out features in Kernel SHAP.
+
+
+```python
+shap = TabularSHAP(
+    inputCols=categorical_features + numeric_features,
+    outputCol="shapValues",
+    numSamples=5000,
+    model=model,
+    targetCol="probability",
+    targetClasses=[1],
+    backgroundData=broadcast(training.orderBy(rand()).limit(100).cache()),
+)
+
+shap_df = shap.transform(explain_instances)
+
+```
+
+Once we have the resulting dataframe, we extract the class 1 probability of the model output, the SHAP values for the target class, the original features and the true label. Then we convert it to a pandas dataframe for visisualization.
+For each observation, the first element in the SHAP values vector is the base value (the mean output of the background dataset), and each of the following element is the SHAP values for each feature.
+
+
+```python
+shaps = (
+    shap_df.withColumn("probability", vec_access(col("probability"), lit(1)))
+    .withColumn("shapValues", vec2array(col("shapValues").getItem(0)))
+    .select(["shapValues", "probability", "label"] + categorical_features + numeric_features)
+)
+
+shaps_local = shaps.toPandas()
+shaps_local.sort_values("probability", ascending=False, inplace=True, ignore_index=True)
+pd.set_option("display.max_colwidth", None)
+shaps_local
+```
+
+We can visualize the explanation in the [interpret-community format](https://github.com/interpretml/interpret-community) in the ExplanationDashboard from https://github.com/microsoft/responsible-ai-widgets/
+
+
+```python
+import pandas as pd
+import numpy as np
+
+features = categorical_features + numeric_features
+features_with_base = ["Base"] + features
+
+rows = shaps_local.shape[0]
+
+local_importance_values = shaps_local[['shapValues']]
+eval_data = shaps_local[features]
+true_y = np.array(shaps_local[['label']])
+```
+
+
+```python
+list_local_importance_values = local_importance_values.values.tolist()
+converted_importance_values = []
+bias = []
+for classarray in list_local_importance_values:
+    for rowarray in classarray:
+        converted_list = rowarray.tolist()
+        bias.append(converted_list[0])
+        # remove the bias from local importance values
+        del converted_list[0]
+        converted_importance_values.append(converted_list)
+```
+
+When running Synapse Analytics, please follow instructions here [Package management - Azure Synapse Analytics | Microsoft Docs](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-libraries) to install ["raiwidgets"](https://pypi.org/project/raiwidgets/) and ["interpret-community"](https://pypi.org/project/interpret-community/) packages.
+
+
+```python
+!pip install --upgrade raiwidgets
+```
+
+
+```python
+!pip install --upgrade interpret-community
+```
+
+
+```python
+from interpret_community.adapter import ExplanationAdapter
+adapter = ExplanationAdapter(features, classification=True)
+global_explanation = adapter.create_global(converted_importance_values, eval_data, expected_values=bias)
+```
+
+
+```python
+# view the global importance values
+global_explanation.global_importance_values
+```
+
+
+```python
+# view the local importance values
+global_explanation.local_importance_values
+```
+
+
+```python
+class wrapper(object):
+  def __init__(self, model):
+    self.model = model
+
+  def predict(self, data):
+    sparkdata = spark.createDataFrame(data)
+    return model.transform(sparkdata).select('prediction').toPandas().values.flatten().tolist()
+
+  def predict_proba(self, data):
+    sparkdata = spark.createDataFrame(data)
+    prediction = model.transform(sparkdata).select('probability').toPandas().values.flatten().tolist()
+    proba_list = [vector.values.tolist() for vector in prediction]
+    return proba_list
+```
+
+
+```python
+# view the explanation in the ExplanationDashboard
+from raiwidgets import ExplanationDashboard
+ExplanationDashboard(global_explanation, wrapper(model), dataset=eval_data, true_y=true_y)
+```
+
+Your results will look like:
+
+<img src="https://mmlspark.blob.core.windows.net/graphics/rai-dashboard.png" />
diff --git a/...erpretability - Tabular SHAP explainer.md → ...erpretability - Tabular SHAP explainer.md b/...erpretability - Tabular SHAP explainer.md → ...erpretability - Tabular SHAP explainer.md
@@ -76,7 +76,7 @@ shap = TabularSHAP(
     model=model,
     targetCol="probability",
     targetClasses=[1],
-    backgroundData=training.orderBy(rand()).limit(100).cache(),
+    backgroundData=broadcast(training.orderBy(rand()).limit(100).cache()),
 )
 
 shap_df = shap.transform(explain_instances)

diff --git a/...ity/Interpretability - Text Explainers.md → ..._ai/Interpretability - Text Explainers.md b/...ity/Interpretability - Text Explainers.md → ..._ai/Interpretability - Text Explainers.md
diff --git a/...erpretability - Snow Leopard Detection.md → ...erpretability - Snow Leopard Detection.md b/...erpretability - Snow Leopard Detection.md → ...erpretability - Snow Leopard Detection.md
diff --git a/website/docs/features/onnx/about.md b/website/docs/features/onnx/about.md
@@ -45,5 +45,5 @@ MMLSpark now includes a Spark transformer to bring an trained ONNX model to Apac
 
 ## Example
 
-- [Interpretability - Image Explainers](/docs/examples/model_interpretability/Interpretability%20-%20Image%20Explainers)
+- [Interpretability - Image Explainers](/docs/examples/responsible_ai/Interpretability%20-%20Image%20Explainers)
 - [ONNX - Inference on Spark](/docs/features/onnx/ONNX%20-%20Inference%20on%20Spark)
diff --git a/...ures/exploratory/Data Balance Analysis.md → ...s/responsible_ai/Data Balance Analysis.md b/...ures/exploratory/Data Balance Analysis.md → ...s/responsible_ai/Data Balance Analysis.md
@@ -1,5 +1,7 @@
 ---
 title: Data Balance Analysis on Spark
+hide_title: true
+sidebar_label: Data Balance Analysis
 description: Learn how to do Data Balance Analysis on Spark to determine how well features and feature values are represented in your dataset.
 ---
 
@@ -17,7 +19,7 @@ In summary, Data Balance Analysis, used as a step for building ML models has the
 
 ## Examples
 
-* [Data Balance Analysis - Adult Census Income](https://github.com/microsoft/SynapseML/blob/master/notebooks/Data%20Balance%20Analysis%20-%20Adult%20Census%20Income.ipynb)
+* [Data Balance Analysis - Adult Census Income](../../../examples/responsible_ai/DataBalanceAnalysis%20-%20Adult%20Census%20Income)
 
 ## Usage
 
@@ -173,22 +175,22 @@ This involves under-sampling from majority class and over-sampling from minority
   1. Under-sampling may remove valuable information.  
   2. Over-sampling may cause overfitting and poor generalization on test set.
 
-![Bar chart undersampling and oversampling](https://mmlspark.blob.core.windows.net/graphics/exploratory/DataBalanceAnalysis_SamplingBar.png)
+![Bar chart undersampling and oversampling](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_SamplingBar.png)
 
 There are smarter techniques to under-sample and over-sample in literature and implemented in Python’s [imbalanced-learn](https://imbalanced-learn.org/stable/) package.  
 
 For example, we can cluster the records of the majority class, and do the under-sampling by removing records from each cluster, thus seeking to preserve information.  
 
 One technique of under-sampling is use of Tomek Links. Tomek links are pairs of very close instances but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process. A similar way to under-sample majority class is using Near-Miss. It first calculates the distance between all the points in the larger class with the points in the smaller class. When two points belonging to different classes are very close to each other in the distribution, this algorithm eliminates the datapoint of the larger class thereby trying to balance the distribution.
 
-![Tomek Links](https://mmlspark.blob.core.windows.net/graphics/exploratory/DataBalanceAnalysis_TomekLinks.png)
+![Tomek Links](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_TomekLinks.png)
 
 In over-sampling, instead of creating exact copies of the minority class records, we can introduce small variations into those copies, creating more diverse synthetic samples. This technique is called SMOTE (Synthetic Minority Oversampling Technique). It randomly picks a point from the minority class and computes the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.
 
-![Synthetic Samples](https://mmlspark.blob.core.windows.net/graphics/exploratory/DataBalanceAnalysis_SyntheticSamples.png)
+![Synthetic Samples](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_SyntheticSamples.png)
 
 ### Reweighting
 
 There is an expected and observed value in each table cell. The weight is essentially expected / observed value. This is easy to extend to multiple features with more than 2 groups. The weights are then incorporated in loss function of model training.  
 
-![Reweighting](https://mmlspark.blob.core.windows.net/graphics/exploratory/DataBalanceAnalysis_Reweight.png)
+![Reweighting](https://mmlspark.blob.core.windows.net/graphics/responsible_ai/DataBalanceAnalysis_Reweight.png)
diff --git a/...ty/Interpretability - Image Explainers.md → ...ai/Interpretability - Image Explainers.md b/...ty/Interpretability - Image Explainers.md → ...ai/Interpretability - Image Explainers.md
diff --git a/.../features/model_interpretability/about.md → ...sible_ai/Model Interpretation on Spark.md b/.../features/model_interpretability/about.md → ...sible_ai/Model Interpretation on Spark.md
@@ -1,7 +1,7 @@
 ---
 title: Model Interpretation on Spark
 hide_title: true
-sidebar_label: About
+sidebar_label: Model Interpretation on Spark
 ---
 
 # Model Interpretation on Spark
@@ -26,9 +26,9 @@ Both explainers extends from `org.apache.spark.ml.Transformer`. After setting up
 
 To see examples of model interpretability on Spark in action, take a look at these sample notebooks:
 
-- [Tabular SHAP explainer](/docs/examples/model_interpretability/Interpretability%20-%20Tabular%20SHAP%20explainer)
-- [Image explainers](/docs/examples/model_interpretability/Interpretability%20-%20Image%20Explainers)
-- [Text explainers](/docs/examples/model_interpretability/Interpretability%20-%20Text%20Explainers)
+- [Tabular SHAP explainer](../../../examples/responsible_ai/Interpretability%20-%20Tabular%20SHAP%20explainer)
+- [Image explainers](../../../examples/responsible_ai/Interpretability%20-%20Image%20Explainers)
+- [Text explainers](../../../examples/responsible_ai/Interpretability%20-%20Text%20Explainers)
 
 |                        | Tabular models              | Vector models             | Image models            | Text models           |
 |------------------------|-----------------------------|---------------------------|-------------------------|-----------------------|

diff --git a/website/notebookconvert.py b/website/notebookconvert.py
@@ -1,39 +1,47 @@
 import os
 import re
 
+
 def add_header_to_markdown(folder, md):
     name = md[:-3]
-    with open(os.path.join(folder, md), 'r+', encoding='utf-8') as f:
+    with open(os.path.join(folder, md), "r+", encoding="utf-8") as f:
         content = f.read()
         f.truncate(0)
-        content = re.sub(r'style=\"[\S ]*?\"', '', content)
-        content = re.sub(r'<style[\S \n.]*?</style>', '', content)
+        content = re.sub(r"style=\"[\S ]*?\"", "", content)
+        content = re.sub(r"<style[\S \n.]*?</style>", "", content)
         f.seek(0, 0)
         f.write("---\ntitle: {}\nhide_title: true\nstatus: stable\n---\n".format(name) + content)
         f.close()
 
+
 def convert_notebook_to_markdown(file_path, outputdir):
-    print("Converting {} into markdown \n".format(file_path))
-    convert_cmd = 'jupyter nbconvert --output-dir=\"{}\" --to markdown \"{}\"'.format(outputdir, file_path)
+    print(f"Converting {file_path} into markdown")
+    convert_cmd = f'jupyter nbconvert --output-dir="{outputdir}" --to markdown "{file_path}"'
     os.system(convert_cmd)
+    print()
+
 
 def convert_allnotebooks_in_folder(folder, outputdir):
 
     dic = {
         "CognitiveServices - Overview": os.path.join(outputdir, "features"),
         "Classification": os.path.join(outputdir, "examples", "classification"),
         "CognitiveServices": os.path.join(outputdir, "examples", "cognitive_services"),
+        "DataBalanceAnalysis": os.path.join(outputdir, "examples", "responsible_ai"),
         "DeepLearning": os.path.join(outputdir, "examples", "deep_learning"),
-        "Interpretability": os.path.join(outputdir, "examples", "model_interpretability"),
+        "Interpretability - Image Explainers": os.path.join(outputdir, "features", "responsible_ai"),
+        "Interpretability - Explanation Dashboard": os.path.join(outputdir, "examples", "responsible_ai"),
+        "Interpretability - Tabular SHAP explainer": os.path.join(outputdir, "examples", "responsible_ai"),
+        "Interpretability - Text Explainers": os.path.join(outputdir, "examples", "responsible_ai"),
+        "ModelInterpretability": os.path.join(outputdir, "examples", "responsible_ai"),
         "Regression": os.path.join(outputdir, "examples", "regression"),
         "TextAnalytics": os.path.join(outputdir, "examples", "text_analytics"),
         "HttpOnSpark": os.path.join(outputdir, "features", "http"),
         "LightGBM": os.path.join(outputdir, "features", "lightgbm"),
-        "ModelInterpretability": os.path.join(outputdir, "features", "model_interpretability"),
         "ONNX": os.path.join(outputdir, "features", "onnx"),
         "SparkServing": os.path.join(outputdir, "features", "spark_serving"),
-        "Vowpal Wabbit": os.path.join(outputdir, "features", "vw")
-        }
+        "Vowpal Wabbit": os.path.join(outputdir, "features", "vw"),
+    }
 
     for nb in os.listdir(folder):
         if nb.endswith(".ipynb"):
@@ -44,7 +52,7 @@ def convert_allnotebooks_in_folder(folder, outputdir):
                 if nb.startswith(k):
                     finaldir = v
                     break
-            
+
             if not os.path.exists(finaldir):
                 os.mkdir(finaldir)
 
@@ -55,11 +63,13 @@ def convert_allnotebooks_in_folder(folder, outputdir):
             convert_notebook_to_markdown(os.path.join(folder, nb), finaldir)
             add_header_to_markdown(finaldir, md)
 
+
 def main():
     cur_path = os.getcwd()
     folder = os.path.join(cur_path, "notebooks")
     outputdir = os.path.join(cur_path, "website", "docs")
     convert_allnotebooks_in_folder(folder, outputdir)
 
-if __name__ == '__main__':
+
+if __name__ == "__main__":
     main()