Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Adding notebook for ICE explainer #1318

Merged
merged 12 commits into from
Jan 14, 2022

Conversation

ezherdeva
Copy link
Contributor

No description provided.

@ezherdeva
Copy link
Contributor Author

@memoryz
You didn't appear as a reviewer for some reason.

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -0,0 +1 @@
{"cells":[{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"b7488bd3-b1a1-4b4b-a3be-52e447c4a46c","showTitle":false,"title":""}},"source":["## Partial Dependence (PDP) and Individual Conditional Expectation (ICE) plots"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"6d7a6880-7982-41f0-b768-893b50a5fc96","showTitle":false,"title":""}},"source":["In this example, we train a classification model with the Adult Census Income dataset. Then we treat the model as a blackbox model and calculate the PDP and ICE plots for some selected categorical and numeric features. \n","\n","This dataset can be used to predict whether annual income exceeds $50,000/year or not based on demographic data from the 1994 U.S. Census. The dataset we're reading contains 32,561 rows and 14 columns/features.\n","\n","[More info on the dataset here](https://archive.ics.uci.edu/ml/datasets/Adult)\n","\n","We will train a classification model with a target - income >= 50K.\n","\n","---\n","\n","**Partial Dependence Plot (PDP) ** function at a particular feature value represents the average prediction if we force all data points to assume that feature value.\n","\n","**Individual Conditional Expectation (ICE)** plots display one line per instance that shows how the instance’s prediction changes when a feature changes. One line represents the predictions for one instance if we vary the feature of interest.\n","\n","PDP and ICE plots visualize and help to analyze the interaction between the target response and a set of input features of interest. It is essential when you are building a Machine Learning model to understand model behavior and how certain features influences overall prediction. One of the most popular use-cases is analyzing feature importance.\n","\n","---\n","Python dependencies:\n","\n","matplotlib==3.2.2\n","\n","numpy==1.19.2"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"90785380-3bea-4b75-81ea-60203b3daf8c","showTitle":false,"title":""}},"outputs":[],"source":["from pyspark.ml import Pipeline\n","from pyspark.ml.classification import GBTClassifier\n","from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder\n","import pyspark.sql.functions as F\n","from pyspark.ml.evaluation import BinaryClassificationEvaluator\n","\n","from synapse.ml.explainers import ICETransformer\n","\n","import matplotlib.pyplot as plt"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"74344307-ac58-4e04-8d80-9fb9d24a5803","showTitle":false,"title":""}},"source":["### Read and prepare the dataset"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"f42818e8-2361-4a46-aa6b-f94367f91dd1","showTitle":false,"title":""}},"outputs":[],"source":["df = spark.read.parquet(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/AdultCensusIncome.parquet\")\n","display(df)"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"073cae48-beea-4a24-9c3a-2dde9815a4fb","showTitle":false,"title":""}},"outputs":[],"source":["categorical_features = [\"race\", \"workclass\", \"marital-status\", \"education\", \"occupation\", \"relationship\", \"native-country\", \"sex\"]\n","numeric_features = [\"age\", \"education-num\", \"capital-gain\", \"capital-loss\", \"hours-per-week\"]\n","string_indexer_outputs = [feature + \"_idx\" for feature in categorical_features]\n","one_hot_encoder_outputs = [feature + \"_enc\" for feature in categorical_features]\n","\n","pipeline = Pipeline(stages=[\n"," StringIndexer().setInputCol(\"income\").setOutputCol(\"label\").setStringOrderType(\"alphabetAsc\"),\n"," StringIndexer().setInputCols(categorical_features).setOutputCols(string_indexer_outputs),\n"," OneHotEncoder().setInputCols(string_indexer_outputs).setOutputCols(one_hot_encoder_outputs),\n"," VectorAssembler(inputCols=one_hot_encoder_outputs+numeric_features, outputCol=\"features\"),\n"," GBTClassifier(weightCol=\"fnlwgt\")])"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"8258a1fd-0725-421f-8506-20cd256f28c6","showTitle":false,"title":""}},"outputs":[],"source":["display(df.groupBy(\"education-num\").count())"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"81d6bb74-96e9-42ee-b3db-e42c0ea88376","showTitle":false,"title":""}},"source":["### Fit the model and view the predictions"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"94b1aca1-c46e-4ad5-9f80-19016ad99e49","showTitle":false,"title":""}},"outputs":[],"source":["model = pipeline.fit(df)"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"85fcf25c-b3ad-4afd-a455-cef1c9b21cb4","showTitle":false,"title":""}},"source":["Check that model makes sense and has reasonable output. For this, we will check the model performance by calculating the ROC-AUC score."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"bd5b14c1-990d-48b4-b611-46787beed673","showTitle":false,"title":""}},"outputs":[],"source":["data = model.transform(df)\n","display(data.select('income', 'probability', 'prediction'))"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"45d0782a-dc16-4a37-977c-b3d426d650c8","showTitle":false,"title":""}},"outputs":[],"source":["eval_auc = BinaryClassificationEvaluator(labelCol=\"label\", rawPredictionCol=\"prediction\")\n","eval_auc.evaluate(data)"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"8471ec63-9a87-4169-b35e-84f3a54c143a","showTitle":false,"title":""}},"source":["## PDP\n","\n","\\\\(X_S\\\\) - set of input features of interest, \\\\(X_C\\\\) - its complement.\n","\n","The partial dependence of the response \\\\(f\\\\) at a point \\\\(x_S\\\\) is defined as:\n","\n","$$ pd_{X_S}(x_S) = \\mathsf{E} X_C [f(x_S, X_C)] = \\int f(x_S, x_C) p(x_C)dx_C$$\n","\n","where \\\\(f(x_S, x_C)\\\\) is the response function for a given sample whose values are defined by \\\\(x_S\\\\) for the features in \\\\(X_S\\\\) (i.e. the features you want to explain), and by \\\\(x_C\\\\) for the features in \\\\(X_C\\\\) (i.e. features that are not being analyzed).\n","\n","The compuation method estimates the above integaral by computing an average over the dataset \\\\(X\\\\):\n","\n","$$pd_{X_S}(x_S) \\approx \\frac{1}{n_{samples}} \\sum_{i=1}^n f(x_S, x_C^{(i)}) $$\n","\n","where \\\\(x_C^{(i)}\\\\) is the value of the i-th sample for the features in \\\\(X_C\\\\). For each value of \\\\(x_S\\\\), this method requires a full pass over the dataset \\\\(X\\\\).\n","\n","---\n","\n","We will show how features \"sex\", \"education\", \"worklass\", \"occupation\" (categorical feautures) and \"education-num\" and \"age\" (numeric features) affect the prediction of the income exceeds $50,000/year.\n","\n","--- \n","\n","Source: https://christophm.github.io/interpretable-ml-book/pdp.html"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"422b70ee-483d-470b-9105-7da2cefb8fec","showTitle":false,"title":""}},"source":["### Setup the transformer for PDP"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"df3b3215-5418-4c3e-8931-300cb2481884","showTitle":false,"title":""}},"outputs":[],"source":["pdp = ICETransformer(model=model, targetCol=\"probability\", kind=\"average\", targetClasses=[1]).\\\n"," setCategoricalFeatures(categorical_features).\\\n"," setNumericFeatures(numeric_features).setNumSamples(50)"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"68774ede-f4ee-4437-bc40-16c7a2192098","showTitle":false,"title":""}},"source":["PDP is a spark transformer, the function **transform** returns the schema of (1 row * number features to explain) which contains dependence for the given feature in a format: feature_value -> dependence (in our case probability)."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"d79ebaea-69ef-4c07-8bbb-157d6b86ee9b","showTitle":false,"title":""}},"outputs":[],"source":["output_pdp = pdp.transform(df)\n","display(output_pdp)"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"1b025798-2650-4d68-b5cd-136596d343cd","showTitle":false,"title":""}},"source":["### Visualization"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"e160c8bf-c4ff-4442-9e6c-f7615212b151","showTitle":false,"title":""}},"outputs":[],"source":["# Helper functions for visualization\n","\n","def get_pandas_df_from_column(df, col_name):\n"," keys_df = df.select(F.explode(F.map_keys(F.col(col_name)))).distinct()\n"," keys = list(map(lambda row: row[0], keys_df.collect()))\n"," key_cols = list(map(lambda f: F.col(col_name).getItem(f).alias(str(f)), keys))\n"," final_cols = key_cols\n"," pandas_df = df.select(final_cols).toPandas()\n"," return pandas_df\n","\n","def plot_dependence_for_categorical(df, col, col_int=True, figsize=(20, 5)):\n"," dict_values = {}\n"," col_names = list(df.columns)\n","\n"," for col_name in col_names:\n"," dict_values[col_name] = df[col_name][0].toArray()[0]\n"," marklist= sorted(dict_values.items(), key=lambda x: int(x[0]) if col_int else x[0]) \n"," sortdict=dict(marklist)\n","\n"," fig = plt.figure(figsize = figsize)\n"," plt.bar(sortdict.keys(), sortdict.values())\n","\n"," plt.xlabel(col, size=13)\n"," plt.ylabel(\"Dependence\")\n"," plt.title(\"\")\n"," plt.show()\n"," \n","def plot_dependence_for_numeric(df, col, col_int=True, figsize=(20, 5)):\n"," dict_values = {}\n"," col_names = list(df.columns)\n","\n"," for col_name in col_names:\n"," dict_values[col_name] = df[col_name][0].toArray()[0]\n"," marklist= sorted(dict_values.items(), key=lambda x: int(x[0]) if col_int else x[0]) \n"," sortdict=dict(marklist)\n","\n"," fig = plt.figure(figsize = figsize)\n","\n"," \n"," plt.plot(list(sortdict.keys()), list(sortdict.values()))\n","\n"," plt.xlabel(col, size=13)\n"," plt.ylabel(\"Dependence\")\n"," plt.ylim(0.0)\n"," plt.title(\"\")\n"," plt.show()\n"," "]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"2ebe88c8-0176-47da-9c9a-88c1dcfef123","showTitle":false,"title":""}},"source":["#### Example 1: \"Age\"\n","\n","We can observe non-linear dependency. Income rapidly grows from 24-38 age, after 58 it slightly drops and from 68 remains stable."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"332ea73e-d47f-4559-9c21-b9e7ccf6ded6","showTitle":false,"title":""}},"outputs":[],"source":["display(output_pdp)"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"7f1386da-bd06-4dd6-9368-a031c982650c","showTitle":false,"title":""}},"outputs":[],"source":["df_education_num = get_pandas_df_from_column(output_pdp, 'age_dependence')\n","plot_dependence_for_numeric(df_education_num, 'age')"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"d1da8c5b-b4cb-455f-9289-f646bff56c1d","showTitle":false,"title":""}},"source":["#### Example 2: \"marital-status\"\n","\n","According to the result, the model treats \"married-cv-spouse\" as one category and all others as a second category. It looks reasonable, taking into account that GBT has a tree structure.\n","\n","If the model picks \"divorced\" as one category and the rest features as the second category- then most likely there is an error and some bias in data."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"4bbc72bf-307a-4d20-98f1-73673d81bb03","showTitle":false,"title":""}},"outputs":[],"source":["df_occupation = get_pandas_df_from_column(output_pdp, 'workclass_dependence')\n","plot_dependence_for_categorical(df_occupation, 'marital-status', False, figsize=(30, 5))"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"16b524bf-7587-461d-aa50-a88972cf2bb3","showTitle":false,"title":""}},"source":["#### Example 3: \"capital-gain\"\n","\n","Firstly we run PDP with default parameters for rangeMin and rangeMax. We can see that this representation is not useful, it is not granulated enough, because it was dynamically computed from the data. That is why we set rangeMin = 0 and rangeMax = 10000 to visualize more granulated interpretations for the part we're interested in.\n","\n","On the second graph we can observe how capital-gain affects the dependence."]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"2d076473-9760-4634-8465-aa90cb2a40dc","showTitle":false,"title":""}},"outputs":[],"source":["df_education_num = get_pandas_df_from_column(output_pdp, 'capital-gain_dependence')\n","plot_dependence_for_numeric(df_education_num, 'capital-gain_dependence')"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"36853ee2-2b55-4cd5-ab53-b2090d8cb2c5","showTitle":false,"title":""}},"outputs":[],"source":["pdp_cap_gain = ICETransformer(model=model, targetCol=\"probability\", kind=\"average\", targetClasses=[1]).\\\n"," setNumericFeatures([{\"name\": \"capital-gain\", \"numSplits\": 20, \"rangeMin\": 0.0, \"rangeMax\": 10000.0}]).\\\n"," setNumSamples(50)\n","\n","output_pdp_cap_gain = pdp_cap_gain.transform(df)"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"0854f823-00a4-428a-bb0e-9457ca010774","showTitle":false,"title":""}},"outputs":[],"source":["df_education_num = get_pandas_df_from_column(output_pdp_cap_gain, 'capital-gain_dependence')\n","plot_dependence_for_numeric(df_education_num, 'capital-gain_dependence')"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"773be71d-25ba-45ea-bf8f-98a9722e7da6","showTitle":false,"title":""}},"source":["### Conclusions\n","\n","**Advantages:**\n","\n","1) Plots is intuitive.\n","\n","2) PDPs perfectly represent how the feature influences the prediction on average (for not correlated features).\n","\n","3) Plots are easy to implement.\n","\n","**Disadvantages:**\n","\n","1) The realistic maximum number of features in a partial dependence function is two.\n","\n","2) Some PD plots do not show the feature distribution.\n","\n","3) The assumption of independence is the biggest issue with PD plots.\n","\n","4) PD plots only show the average marginal effects."]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"e65a4bca-e276-4ecd-b3ed-8171b4b41867","showTitle":false,"title":""}},"source":["## ICE\n","\n","\\\\(X_S\\\\) - set of input features of interest, \\\\(X_C\\\\) - its complement.\n","\n","\n","The equivalent to a PDP for individual data instances is called individual conditional expectation (ICE) plot. A PDP is the average of the lines of an ICE plot.\n","\n","The values for a line (and one instance) can be computed by keeping all other features the same, creating variants of this instance by replacing the feature’s value with values from a grid and making predictions with the black box model for these newly created instances. \n","\n","For each instance in $$ \\{ (x_{S}^{(i)},x_{C}^{(i)}) \\}_{i=1}^N$$ the curve \\\\(\\hat{f}_S^{(i)}\\\\) is plotted against \\\\(x_S^{(i)} \\\\), while \\\\( x_C^{(i)}\\\\) remains fixed.\n","\n","---\n","\n","\n","We will show the same features as for PDP to show a difference: \"sex\", \"education\", \"worklass\", \"occupation\" (categorical feautures) and \"education-num\" and \"age\" (numeric features)\n","\n","---\n","Source: https://christophm.github.io/interpretable-ml-book/ice.html"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"1fe87d15-5a3f-4c40-b28d-3e751da22900","showTitle":false,"title":""}},"source":["### Setup the transformer for ICE"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"b2c4d0a7-f7ab-4afa-b6fd-4eac4fc8b27b","showTitle":false,"title":""}},"outputs":[],"source":["ice = ICETransformer(model=model, targetCol=\"probability\", targetClasses=[1]).\\\n"," setCategoricalFeatures(categorical_features).setNumericFeatures(numeric_features).setNumSamples(50)"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"a2623d89-237f-4057-8057-9c2075d0c4fd","showTitle":false,"title":""}},"outputs":[],"source":["output = ice.transform(df)\n","display(output)"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"b6272501-44f2-4fff-9abe-b96c8129f943","showTitle":false,"title":""}},"source":["### Visualization"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"6f3dfaa4-a1e0-4308-a702-973325d4c58f","showTitle":false,"title":""}},"outputs":[],"source":["# Helper functions for visualization\n","from math import pi\n","\n","from collections import defaultdict\n","\n","def plot_ice_numeric(df, col, col_int=True, figsize=(20, 10)):\n"," dict_values = defaultdict(list)\n"," col_names = list(df.columns)\n"," num_instances = df.shape[0]\n"," \n"," instances_y = {}\n"," i = 0\n","\n"," for col_name in col_names:\n"," for i in range(num_instances):\n"," dict_values[i].append(df[col_name][i].toArray()[0])\n"," \n"," fig = plt.figure(figsize = figsize)\n"," for i in range(num_instances):\n"," plt.plot(col_names, dict_values[i], \"k\")\n"," \n"," \n"," plt.xlabel(col, size=13)\n"," plt.ylabel(\"Dependence\")\n"," plt.ylim(0.0)\n"," \n"," \n"," \n","def plot_ice_categorical(df, col, col_int=True, figsize=(20, 10)):\n"," dict_values = defaultdict(list)\n"," col_names = list(df.columns)\n"," num_instances = df.shape[0]\n"," \n"," angles = [n / float(df.shape[1]) * 2 * pi for n in range(df.shape[1])]\n"," angles += angles [:1]\n"," \n"," instances_y = {}\n"," i = 0\n","\n"," for col_name in col_names:\n"," for i in range(num_instances):\n"," dict_values[i].append(df[col_name][i].toArray()[0])\n"," \n"," fig = plt.figure(figsize = figsize)\n"," ax = plt.subplot(111, polar=True)\n"," plt.xticks(angles[:-1], col_names)\n"," \n"," for i in range(num_instances):\n"," values = dict_values[i]\n"," values += values[:1]\n"," ax.plot(angles, values, \"k\")\n"," ax.fill(angles, values, 'teal', alpha=0.1)\n","\n"," plt.xlabel(col, size=13)\n"," plt.show()\n"," "]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"856f3cbb-fc02-4279-9b16-059afa79e8d2","showTitle":false,"title":""}},"source":["#### Example 1: Numeric feature: \"Age\"\n","\n","All curves seem to follow the same course, so there are no obvious interactions. That means that the PDP is already a good summary of the relationships between the displayed features and the predicted income >=50K"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"8d794aca-2a9e-4791-a2a3-95f294731ebc","showTitle":false,"title":""}},"outputs":[],"source":["col_name = 'age_dependence'\n","age_dep = get_pandas_df_from_column(output, col_name)\n","\n","plot_ice_numeric(age_dep, col_name, figsize=(30, 10))"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"5cf03cf9-b753-4a45-92ac-7d1ac04449a2","showTitle":false,"title":""}},"source":["Helper function"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"9599a289-c8e7-4643-9f37-7134b185d391","showTitle":false,"title":""}},"outputs":[],"source":["def overlay_ice_with_pdp(df_ice, df_pdp, col, col_int=True, figsize=(20, 5)):\n"," dict_values = defaultdict(list)\n"," col_names_ice = list(df_ice.columns)\n"," num_instances = df_ice.shape[0]\n"," \n"," instances_y = {}\n"," i = 0\n","\n"," for col_name in col_names_ice:\n"," for i in range(num_instances):\n"," dict_values[i].append(df_ice[col_name][i].toArray()[0])\n"," \n"," fig = plt.figure(figsize = figsize)\n"," for i in range(num_instances):\n"," plt.plot(col_names_ice, dict_values[i], \"k\")\n"," \n"," dict_values = {}\n"," col_names = list(df_pdp.columns)\n","\n"," for col_name in col_names:\n"," dict_values[col_name] = df_pdp[col_name][0].toArray()[0]\n"," marklist= sorted(dict_values.items(), key=lambda x: int(x[0]) if col_int else x[0]) \n"," sortdict=dict(marklist)\n"," \n"," plt.plot(col_names_ice, list(sortdict.values()), \"r\", linewidth=5)\n"," \n"," \n"," \n"," plt.xlabel(col, size=13)\n"," plt.ylabel(\"Dependence\")\n"," plt.ylim(0.0)\n"," plt.show()\n"," "]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"97996bd4-074b-4e05-bc11-10620415480e","showTitle":false,"title":""}},"source":["This shows how PDP visualizes the average dependence. Red line - PDP plot, black lines - ICE plots"]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"36bfc955-b6c8-4e9d-a28e-42c7e9889692","showTitle":false,"title":""}},"outputs":[],"source":["col_name = 'age_dependence'\n","overlay_ice_with_pdp(age_dep, df_education_num, col=col_name, figsize=(30, 10))"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"d0a80081-a3f6-41d0-8f6d-b686220dbeaa","showTitle":false,"title":""}},"source":["#### Example 2: Categorical feature: \"occupation\""]},{"cell_type":"code","execution_count":null,"metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"1fee7eb9-6e89-480c-b864-fed2cfbf33f5","showTitle":false,"title":""}},"outputs":[],"source":["col_name = 'occupation_dependence'\n","occupation_dep = get_pandas_df_from_column(output, col_name)\n","\n","\n","plot_ice_categorical(occupation_dep, col_name, figsize=(30, 10))"]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"4f6494ac-6bf5-408c-9ac2-c44b54afba38","showTitle":false,"title":""}},"source":["### Conclusions\n","\n","\n","**Advantages:**\n","\n","1) Plots are intuitive to understand. One line represents the predictions for one instance if we vary the feature of interest.\n","\n","2) ICE curves can uncover more complex relationships.\n","\n","**Disadvantages:**\n","\n","1) ICE curves can only display one feature meaningfully - otherwise you should overlay multiple surfaces.\n","\n","2) Some points in the lines might be invalid data points according to the joint feature distribution. It causes by correlations between features.\n","\n","3) In ICE plots it might not be easy to see the average."]},{"cell_type":"markdown","metadata":{"application/vnd.databricks.v1+cell":{"inputWidgets":{},"nuid":"b974545e-371b-41b0-b9a4-adb3036d9eb8","showTitle":false,"title":""}},"source":["## Summary\n","\n","Partial dependence plots (PDP) and Individual Conditional Expectation (ICE) plots can be used to visualize and analyze interaction between the target response and a set of input features of interest.\n","\n","Both PDPs and ICEs assume that the input features of interest are independent from the complement features, and this assumption is often violated in practice.\n","\n","ICE shows the dependence on average, but if you want to observe features individually - you can use ICE.\n","\n","Using examples above we showed how it can be usefull to draw such plots to analyze how machine learning model made their predictions, what was important and how we can interpret the results."]}],"metadata":{"application/vnd.databricks.v1+notebook":{"dashboards":[],"language":"python","notebookMetadata":{"pythonIndentUnit":2},"notebookName":"PDP-ICE-tutorial-new","notebookOrigID":2416290700869370,"widgets":{}},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Perhaps pump this through a json formatter so the diff lines doing forward will be nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't find how to do it :( But you can hit the "View file" and it beautifully displays the file. The rest comments I fixed

@mhamilton723
Copy link
Collaborator

mhamilton723 commented Dec 23, 2021

pdp = ICETransformer(model=model, targetCol="probability", kind="average", targetClasses=[1]).
setCategoricalFeatures(categorical_features).
setNumericFeatures(numeric_features).setNumSamples(50)

^ Might want to use either setters or init args for consistency here and elsewhere

@mhamilton723
Copy link
Collaborator

plt.title("")

Might not need this line if you don't want titles

@mhamilton723
Copy link
Collaborator

col_name = 'age_dependence'
age_dep = get_pandas_df_from_column(output, col_name)

Could remove some duplication with
age_dep = get_pandas_df_from_column(output, 'age_dependence')

@mhamilton723 mhamilton723 requested a review from memoryz December 23, 2021 04:57
@codecov-commenter
Copy link

codecov-commenter commented Dec 23, 2021

Codecov Report

Merging #1318 (e83ab9a) into master (906b408) will increase coverage by 0.08%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1318      +/-   ##
==========================================
+ Coverage   84.69%   84.77%   +0.08%     
==========================================
  Files         284      287       +3     
  Lines       13916    14231     +315     
  Branches      673      732      +59     
==========================================
+ Hits        11786    12065     +279     
- Misses       2130     2166      +36     
Impacted Files Coverage Δ
...rosoft/azure/synapse/ml/train/TrainRegressor.scala 86.53% <0.00%> (-0.97%) ⬇️
...osoft/azure/synapse/ml/train/TrainClassifier.scala 82.57% <0.00%> (-0.76%) ⬇️
...oft/azure/synapse/ml/core/schema/SparkSchema.scala 82.60% <0.00%> (-0.73%) ⬇️
...oft/azure/synapse/ml/io/http/HTTPTransformer.scala 93.47% <0.00%> (-0.14%) ⬇️
...ala/org/apache/spark/ml/param/DataFrameParam.scala 70.83% <0.00%> (ø)
...osoft/azure/synapse/ml/core/contracts/Params.scala 95.65% <0.00%> (ø)
...azure/synapse/ml/geospatial/AzureMapsHelpers.scala
.../azure/synapse/ml/geospatial/AzureMapsSearch.scala
...rosoft/azure/synapse/ml/geospatial/Geocoders.scala 95.65% <0.00%> (ø)
...synapse/ml/cognitive/TextAnalyticsSDKSchemas.scala 81.19% <0.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b64094...e83ab9a. Read the comment docs.

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@memoryz
Copy link
Contributor

memoryz commented Jan 13, 2022

@ezherdeva please reformat the notebook in jupyter lab so it's easier to code review and comment.

@memoryz
Copy link
Contributor

memoryz commented Jan 13, 2022

also please move the notebook to notebooks/features/responsible_ai folder.

@ezherdeva
Copy link
Contributor Author

ezherdeva commented Jan 14, 2022

I've changed everything according to the comments. Also, I reformatted the JSON view, it's easy to comment now. Please, have a look @memoryz

@ezherdeva
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 1318 in repo microsoft/SynapseML

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723 mhamilton723 merged commit 059732a into microsoft:master Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants