From df31a328fa554864ff25378edfdcfb2be1686edf Mon Sep 17 00:00:00 2001
From: Eric Pham-Hung <ephamhung@ephamhung-mlt.client.nvidia.com>
Date: Mon, 20 Oct 2025 13:14:17 -0700
Subject: [PATCH 1/4] adding extrinsic eval notebook

Signed-off-by: Eric Pham-Hung <ephamhung@ephamhung-mlt.client.nvidia.com>
---
 ..._synthesizer_101_with_extrinsic_eval.ipynb | 990 ++++++++++++++++++
 1 file changed, 990 insertions(+)
 create mode 100644 nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101_with_extrinsic_eval.ipynb
diff --git a/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101_with_extrinsic_eval.ipynb b/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101_with_extrinsic_eval.ipynb
new file mode 100644
index 000000000..8ea636697
--- /dev/null
+++ b/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101_with_extrinsic_eval.ipynb
@@ -0,0 +1,990 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "630e3e17",
+   "metadata": {},
+   "source": [
+    "# 🎛️ NeMo Safe Synthesizer 101: The Basics\n",
+    "\n",
+    "> ⚠️ **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.\n",
+    "\n",
+    "<br> \n",
+    "\n",
+    "In this notebook, we demonstrate how to create a synthetic version of a tabular dataset using the NeMo Microservices Python SDK. The notebook should take about 20 minutes to run.\n",
+    "\n",
+    "After completing this notebook, you'll be able to:\n",
+    "- Use the NeMo Microservices SDK to interact with Safe Synthesizer\n",
+    "- Create novel synthetic data that follows the statistical properties of your input dataset\n",
+    "- Access an evaluation report on synthetic data quality and privacy\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8be84f5d",
+   "metadata": {},
+   "source": [
+    "#### 💾 Install dependencies\n",
+    "\n",
+    "**IMPORTANT** 👉 Ensure you have a NeMo Microservices Platform deployment available. Follow the quickstart or Helm chart instructions in your environment's setup guide. You may need to restart your kernel after installing dependencies.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "9f5d6f5a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/shadeform/GenerativeAIExamples/nemo/NeMo-Safe-Synthesizer/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "from nemo_microservices import NeMoMicroservices\n",
+    "from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder\n",
+    "\n",
+    "import logging\n",
+    "logging.basicConfig(level=logging.WARNING)\n",
+    "logging.getLogger(\"httpx\").setLevel(logging.WARNING)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "53bb2807",
+   "metadata": {},
+   "source": [
+    "### ⚙️ Initialize the NeMo Safe Synthesizer Client\n",
+    "\n",
+    "- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.\n",
+    "- `http://localhost:8080` is the default url for the client's `base_url` in the quickstart.\n",
+    "- If using a managed or remote deployment, ensure correct base URLs and tokens.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "8c15ab93",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client = NeMoMicroservices(\n",
+    "    base_url=\"http://localhost:8080\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "74d72ef7",
+   "metadata": {},
+   "source": [
+    "NeMo DataStore is launched as one of the services, and we'll use it to manage our storage. so we'll set the following:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "ab037a3a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "datastore_config = {\n",
+    "    \"endpoint\": \"http://localhost:3000/v1/hf\",\n",
+    "    \"token\": \"\",\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2d66c819",
+   "metadata": {},
+   "source": [
+    "## 📥 Load input data\n",
+    "\n",
+    "Safe Synthesizer learns the patterns and correlations in your input dataset to produce synthetic data with similar properties. For this tutorial, we will use a small public sample dataset. Replace it with your own data if desired.\n",
+    "\n",
+    "The sample dataset used here is a set of women's clothing reviews, including age, product category, rating, and review text. Some of the reviews contain Personally Identifiable Information (PII), such as height, weight, age, and location."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "daa955b6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/home/shadeform/GenerativeAIExamples/nemo/NeMo-Safe-Synthesizer/.venv/bin/python: No module named pip\n",
+      "/bin/bash: line 1: uv: command not found\n",
+      "Note: you may need to restart the kernel to use updated packages.\n"
+     ]
+    }
+   ],
+   "source": [
+    "%pip install kagglehub || uv pip install kagglehub"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "7204f213",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Downloading from https://www.kaggle.com/api/v1/datasets/download/nicapotato/womens-ecommerce-clothing-reviews?dataset_version_number=1...\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 2.79M/2.79M [00:00<00:00, 70.2MB/s]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Extracting files...\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Clothing ID</th>\n",
+       "      <th>Age</th>\n",
+       "      <th>Title</th>\n",
+       "      <th>Review Text</th>\n",
+       "      <th>Rating</th>\n",
+       "      <th>Recommended IND</th>\n",
+       "      <th>Positive Feedback Count</th>\n",
+       "      <th>Division Name</th>\n",
+       "      <th>Department Name</th>\n",
+       "      <th>Class Name</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>767</td>\n",
+       "      <td>33</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>Absolutely wonderful - silky and sexy and comf...</td>\n",
+       "      <td>4</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>Initmates</td>\n",
+       "      <td>Intimate</td>\n",
+       "      <td>Intimates</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1080</td>\n",
+       "      <td>34</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>Love this dress!  it's sooo pretty.  i happene...</td>\n",
+       "      <td>5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>4</td>\n",
+       "      <td>General</td>\n",
+       "      <td>Dresses</td>\n",
+       "      <td>Dresses</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>1077</td>\n",
+       "      <td>60</td>\n",
+       "      <td>Some major design flaws</td>\n",
+       "      <td>I had such high hopes for this dress and reall...</td>\n",
+       "      <td>3</td>\n",
+       "      <td>0</td>\n",
+       "      <td>0</td>\n",
+       "      <td>General</td>\n",
+       "      <td>Dresses</td>\n",
+       "      <td>Dresses</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1049</td>\n",
+       "      <td>50</td>\n",
+       "      <td>My favorite buy!</td>\n",
+       "      <td>I love, love, love this jumpsuit. it's fun, fl...</td>\n",
+       "      <td>5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>General Petite</td>\n",
+       "      <td>Bottoms</td>\n",
+       "      <td>Pants</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>847</td>\n",
+       "      <td>47</td>\n",
+       "      <td>Flattering shirt</td>\n",
+       "      <td>This shirt is very flattering to all due to th...</td>\n",
+       "      <td>5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>6</td>\n",
+       "      <td>General</td>\n",
+       "      <td>Tops</td>\n",
+       "      <td>Blouses</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   Clothing ID  Age                    Title  \\\n",
+       "0          767   33                      NaN   \n",
+       "1         1080   34                      NaN   \n",
+       "2         1077   60  Some major design flaws   \n",
+       "3         1049   50         My favorite buy!   \n",
+       "4          847   47         Flattering shirt   \n",
+       "\n",
+       "                                         Review Text  Rating  Recommended IND  \\\n",
+       "0  Absolutely wonderful - silky and sexy and comf...       4                1   \n",
+       "1  Love this dress!  it's sooo pretty.  i happene...       5                1   \n",
+       "2  I had such high hopes for this dress and reall...       3                0   \n",
+       "3  I love, love, love this jumpsuit. it's fun, fl...       5                1   \n",
+       "4  This shirt is very flattering to all due to th...       5                1   \n",
+       "\n",
+       "   Positive Feedback Count   Division Name Department Name Class Name  \n",
+       "0                        0       Initmates        Intimate  Intimates  \n",
+       "1                        4         General         Dresses    Dresses  \n",
+       "2                        0         General         Dresses    Dresses  \n",
+       "3                        0  General Petite         Bottoms      Pants  \n",
+       "4                        6         General            Tops    Blouses  "
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import kagglehub\n",
+    "import pandas as pd\n",
+    "\n",
+    "# Download latest version\n",
+    "path = kagglehub.dataset_download(\"nicapotato/womens-ecommerce-clothing-reviews\")\n",
+    "df = pd.read_csv(f\"{path}/Womens Clothing E-Commerce Reviews.csv\", index_col=0)\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87d72c68",
+   "metadata": {},
+   "source": [
+    "## 🏗️ Create a Safe Synthesizer job\n",
+    "\n",
+    "The `SafeSynthesizerBuilder` provides a fluent interface to configure and submit jobs.\n",
+    "\n",
+    "The following code creates and submits a job:\n",
+    "- `SafeSynthesizerBuilder(client)`: initialize with the NeMo Microservices client.\n",
+    "- `.from_data_source(df)`: set the input data source.\n",
+    "- `.with_datastore(datastore_config)`: configure model artifact storage.\n",
+    "- `.with_replace_pii()`: enable automatic replacement of PII.\n",
+    "- `.synthesize()`: train and generate synthetic data.\n",
+    "- `.create_job()`: submit the job to the platform.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "85d9de56",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "job = (\n",
+    "    SafeSynthesizerBuilder(client)\n",
+    "    .from_data_source(df)\n",
+    "    .with_datastore(datastore_config)\n",
+    "    .with_replace_pii()\n",
+    "    .synthesize()\n",
+    "    .create_job()\n",
+    ")\n",
+    "\n",
+    "print(f\"job_id = {job.job_id}\")\n",
+    "job.wait_for_completion()\n",
+    "\n",
+    "print(f\"Job finished with status {job.fetch_status()}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fa2eacb2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# If your notebook shuts down, it's okay, your job is still running on the microservices platform.\n",
+    "# You can get the same job object and interact with it again by uncommenting the following code\n",
+    "# snippet, and modifying it with the job id from the previous cell output.\n",
+    "\n",
+    "# from nemo_microservices.beta.safe_synthesizer.sdk.job import SafeSynthesizerJob\n",
+    "# job = SafeSynthesizerJob(job_id=\"<job id>\", client=client)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "285d4a9d",
+   "metadata": {},
+   "source": [
+    "## 👀 View synthetic data\n",
+    "\n",
+    "After the job completes, fetch the generated synthetic dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "7f25574a",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Clothing ID</th>\n",
+       "      <th>Age</th>\n",
+       "      <th>Title</th>\n",
+       "      <th>Review Text</th>\n",
+       "      <th>Rating</th>\n",
+       "      <th>Recommended IND</th>\n",
+       "      <th>Positive Feedback Count</th>\n",
+       "      <th>Division Name</th>\n",
+       "      <th>Department Name</th>\n",
+       "      <th>Class Name</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>670</td>\n",
+       "      <td>39</td>\n",
+       "      <td>Beautiful in person!</td>\n",
+       "      <td>This top is just beautiful in person! the colo...</td>\n",
+       "      <td>5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>General</td>\n",
+       "      <td>Tops</td>\n",
+       "      <td>Knits</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>186</td>\n",
+       "      <td>49</td>\n",
+       "      <td>Pretty; not flattering</td>\n",
+       "      <td>I love the color of this shirt, and it's very ...</td>\n",
+       "      <td>4</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>General</td>\n",
+       "      <td>Tops</td>\n",
+       "      <td>Blouses</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>382</td>\n",
+       "      <td>46</td>\n",
+       "      <td>Fabulous wool crazy sweater</td>\n",
+       "      <td>This sweater is gorgeous.  the fabric goes wit...</td>\n",
+       "      <td>5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>2</td>\n",
+       "      <td>General</td>\n",
+       "      <td>Jackets</td>\n",
+       "      <td>Jackets</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>186</td>\n",
+       "      <td>38</td>\n",
+       "      <td>Pretty fall top</td>\n",
+       "      <td>I really liked this, all things considered. i ...</td>\n",
+       "      <td>4</td>\n",
+       "      <td>1</td>\n",
+       "      <td>5</td>\n",
+       "      <td>General</td>\n",
+       "      <td>Tops</td>\n",
+       "      <td>Blouses</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>619</td>\n",
+       "      <td>37</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>I ordered the pink color. it's a very pretty t...</td>\n",
+       "      <td>5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>1</td>\n",
+       "      <td>General</td>\n",
+       "      <td>Tops</td>\n",
+       "      <td>Knits</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>995</th>\n",
+       "      <td>228</td>\n",
+       "      <td>56</td>\n",
+       "      <td>White pajamas my favorite pair since when i wa...</td>\n",
+       "      <td>Wish they had more colors. i ordered two pair ...</td>\n",
+       "      <td>5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>General Petite</td>\n",
+       "      <td>Bottoms</td>\n",
+       "      <td>Jackets</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>996</th>\n",
+       "      <td>247</td>\n",
+       "      <td>55</td>\n",
+       "      <td>Great casual shirt for any occasion!</td>\n",
+       "      <td>I love this shirt!! not only is it a great cas...</td>\n",
+       "      <td>5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>2</td>\n",
+       "      <td>General</td>\n",
+       "      <td>Tops</td>\n",
+       "      <td>Knits</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>997</th>\n",
+       "      <td>902</td>\n",
+       "      <td>64</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>General Petite</td>\n",
+       "      <td>Dresses</td>\n",
+       "      <td>Dresses</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>998</th>\n",
+       "      <td>6610</td>\n",
+       "      <td>47</td>\n",
+       "      <td>Amazing color and pattern</td>\n",
+       "      <td>I love the fit of this skirt. i usually wear a...</td>\n",
+       "      <td>5</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>General</td>\n",
+       "      <td>Bottoms</td>\n",
+       "      <td>Skirts</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>999</th>\n",
+       "      <td>392</td>\n",
+       "      <td>42</td>\n",
+       "      <td>Not happy with the finished product</td>\n",
+       "      <td>Great color and style but the cut is not exact...</td>\n",
+       "      <td>4</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>General</td>\n",
+       "      <td>Tops</td>\n",
+       "      <td>Blouses</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>1000 rows × 10 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "     Clothing ID  Age                                              Title  \\\n",
+       "0            670   39                               Beautiful in person!   \n",
+       "1            186   49                             Pretty; not flattering   \n",
+       "2            382   46                        Fabulous wool crazy sweater   \n",
+       "3            186   38                                    Pretty fall top   \n",
+       "4            619   37                                                NaN   \n",
+       "..           ...  ...                                                ...   \n",
+       "995          228   56  White pajamas my favorite pair since when i wa...   \n",
+       "996          247   55               Great casual shirt for any occasion!   \n",
+       "997          902   64                                                NaN   \n",
+       "998         6610   47                          Amazing color and pattern   \n",
+       "999          392   42                Not happy with the finished product   \n",
+       "\n",
+       "                                           Review Text  Rating  \\\n",
+       "0    This top is just beautiful in person! the colo...       5   \n",
+       "1    I love the color of this shirt, and it's very ...       4   \n",
+       "2    This sweater is gorgeous.  the fabric goes wit...       5   \n",
+       "3    I really liked this, all things considered. i ...       4   \n",
+       "4    I ordered the pink color. it's a very pretty t...       5   \n",
+       "..                                                 ...     ...   \n",
+       "995  Wish they had more colors. i ordered two pair ...       5   \n",
+       "996  I love this shirt!! not only is it a great cas...       5   \n",
+       "997                                                NaN       5   \n",
+       "998  I love the fit of this skirt. i usually wear a...       5   \n",
+       "999  Great color and style but the cut is not exact...       4   \n",
+       "\n",
+       "     Recommended IND  Positive Feedback Count   Division Name Department Name  \\\n",
+       "0                  1                        0         General            Tops   \n",
+       "1                  1                        0         General            Tops   \n",
+       "2                  1                        2         General         Jackets   \n",
+       "3                  1                        5         General            Tops   \n",
+       "4                  1                        1         General            Tops   \n",
+       "..               ...                      ...             ...             ...   \n",
+       "995                1                        0  General Petite         Bottoms   \n",
+       "996                1                        2         General            Tops   \n",
+       "997                1                        0  General Petite         Dresses   \n",
+       "998                1                        0         General         Bottoms   \n",
+       "999                1                        0         General            Tops   \n",
+       "\n",
+       "    Class Name  \n",
+       "0        Knits  \n",
+       "1      Blouses  \n",
+       "2      Jackets  \n",
+       "3      Blouses  \n",
+       "4        Knits  \n",
+       "..         ...  \n",
+       "995    Jackets  \n",
+       "996      Knits  \n",
+       "997    Dresses  \n",
+       "998     Skirts  \n",
+       "999    Blouses  \n",
+       "\n",
+       "[1000 rows x 10 columns]"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Fetch the synthetic data created by the job\n",
+    "synthetic_df = job.fetch_data()\n",
+    "synthetic_df\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b25f152",
+   "metadata": {},
+   "source": [
+    "## 📊 View evaluation report\n",
+    "\n",
+    "An evaluation comparing the synthetic data to the input data is performed automatically. You can:\n",
+    "\n",
+    "- **Inspect key scores**: overall synthetic data quality and privacy.\n",
+    "- **Download the full HTML report**: includes charts and detailed metrics.\n",
+    "- **Display the report inline**: useful when viewing in notebook environments.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "7b691127",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Synthetic data quality score (0-10, higher is better): 8.9\n",
+      "Data privacy score (0-10, higher is better): 8.5\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Print selected information from the job summary\n",
+    "summary = job.fetch_summary()\n",
+    "print(\n",
+    "    f\"Synthetic data quality score (0-10, higher is better): {summary.synthetic_data_quality_score}\"\n",
+    ")\n",
+    "print(f\"Data privacy score (0-10, higher is better): {summary.data_privacy_score}\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "39e62ea9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Download the full evaluation report to your local machine\n",
+    "job.save_report(\"evaluation_report.html\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "45f7e22b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Fetch and display the full evaluation report inline\n",
+    "# job.display_report_in_notebook()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd1e4925-3620-4b31-bc17-16f74d10fbb5",
+   "metadata": {},
+   "source": [
+    "## 🧪 Extrinsic Evaluation \n",
+    "\n",
+    "This section details the **extrinsic evaluation** process, where the quality of the synthetic data is assessed based on how well a model trained on it performs on a real-world task. This comparison is critical for validating the synthetic data's utility.\n",
+    "\n",
+    "- **Train Benchmark Model**: A model is trained on a small, fixed subset of the **original data** to establish a performance baseline.\n",
+    "- **Train Synthetic Model**: A second model, using the same structure, is trained on the **entire synthetic dataset**.\n",
+    "- **Compare Performance**: Both models are evaluated against the same **fixed holdout test set** ($\\mathbf{X_{test}, y_{test}}$).\n",
+    "- **Inspect Key Metrics**: The comparison focuses on key metrics like **ROC AUC** and **F1-Score** to determine if the synthetic model performs comparably to the benchmark."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "37b6df30-6627-4a40-8604-e905ada571b7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
+    "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
+    "from sklearn.compose import ColumnTransformer\n",
+    "from sklearn.pipeline import Pipeline\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.metrics import classification_report, accuracy_score, roc_auc_score\n",
+    "from sklearn.base import clone\n",
+    "\n",
+    "# --- 1. Define Features (X) and Target (y) ---\n",
+    "\n",
+    "# Separate features (X) from the binary target variable (y).\n",
+    "X = df.drop('Recommended IND', axis=1)\n",
+    "y = df['Recommended IND']\n",
+    "\n",
+    "# Fill any missing text values with an empty string to prevent TfidfVectorizer errors.\n",
+    "X['Review Text'] = X['Review Text'].fillna('')\n",
+    "X['Title'] = X['Title'].fillna('')\n",
+    "\n",
+    "# --- 2. Fixed-Size Non-Overlapping Data Split (1000 Train, 500 Test) ---\n",
+    "\n",
+    "# Sample 1000 random indices for the training set.\n",
+    "train_indices = X.sample(n=1000, random_state=1).index\n",
+    "X_train = X.loc[train_indices]\n",
+    "y_train = y.loc[train_indices]\n",
+    "\n",
+    "# Create a temporary DataFrame containing only the remaining, unused rows.\n",
+    "remaining_X = X.drop(train_indices)\n",
+    "\n",
+    "# Sample 500 random indices from the remaining data for the holdout test set.\n",
+    "test_indices = remaining_X.sample(n=500, random_state=1).index\n",
+    "X_test = X.loc[test_indices]\n",
+    "y_test = y.loc[test_indices]\n",
+    "\n",
+    "# --- 3. Define Feature Types for Preprocessing ---\n",
+    "\n",
+    "# List the columns for each type of transformation.\n",
+    "text_features = ['Review Text']\n",
+    "numerical_features = ['Age', 'Rating', 'Positive Feedback Count']\n",
+    "categorical_features = ['Division Name', 'Department Name', 'Class Name']\n",
+    "\n",
+    "# --- 4. Define Preprocessing Steps ---\n",
+    "\n",
+    "# Text: Convert text to numerical vectors (max 5000 terms).\n",
+    "text_transformer = TfidfVectorizer(stop_words='english', max_features=5000)\n",
+    "# Numerical: Standardize numerical features (mean=0, std=1).\n",
+    "numerical_transformer = StandardScaler()\n",
+    "# Categorical: One-hot encode categories, ignoring unseen categories in the test set.\n",
+    "categorical_transformer = OneHotEncoder(handle_unknown='ignore')\n",
+    "\n",
+    "# --- 5. Create Column Transformer (The Preprocessor) ---\n",
+    "\n",
+    "# Apply different transformations to the respective column groups.\n",
+    "preprocessor = ColumnTransformer(\n",
+    "    transformers=[\n",
+    "        # TfidfVectorizer only takes one column input.\n",
+    "        ('text', text_transformer, text_features[0]),\n",
+    "        ('num', numerical_transformer, numerical_features),\n",
+    "        ('cat', categorical_transformer, categorical_features)\n",
+    "    ],\n",
+    "    # Exclude non-feature columns (like 'Clothing ID' and 'Title') from the model input.\n",
+    "    remainder='drop'\n",
+    ")\n",
+    "\n",
+    "# --- 6. Define the Classifier and Full Pipeline ---\n",
+    "\n",
+    "# Choose the classification algorithm (Logistic Regression for recommendation prediction).\n",
+    "model = LogisticRegression(solver='liblinear', random_state=42)\n",
+    "\n",
+    "# Build the final Pipeline: Preprocessing followed by Classification.\n",
+    "full_pipeline = Pipeline(steps=[\n",
+    "    ('preprocessor', preprocessor),\n",
+    "    ('classifier', model)\n",
+    "])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "id": "ee747c80-d42f-4ec5-b27b-2b2462436b92",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "--- Training Benchmark Model on Original Data (1000 rows) ---\n",
+      "Benchmark training and evaluation complete.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Assuming the full_pipeline structure (preprocessor + classifier) has been defined.\n",
+    "\n",
+    "# --- 1. TRAIN BENCHMARK MODEL ON ORIGINAL DATA ---\n",
+    "\n",
+    "# Assign the pipeline to a specific name for the original model.\n",
+    "# NOTE: In a real comparison, you should clone the pipeline here to ensure the original structure is used without modification.\n",
+    "original_pipeline = full_pipeline \n",
+    "print(\"\\n--- Training Benchmark Model on Original Data (1000 rows) ---\")\n",
+    "# Train the pipeline using the fixed, 1000-row original training set.\n",
+    "original_pipeline.fit(X_train, y_train)\n",
+    "\n",
+    "# --- 2. EVALUATE BENCHMARK MODEL ON FIXED TEST SET ---\n",
+    "\n",
+    "# Predict class labels on the 500-row fixed holdout test set.\n",
+    "y_pred_original = original_pipeline.predict(X_test)\n",
+    "# Predict probabilities and extract the probability for the positive class (Recommended=1), needed for ROC AUC.\n",
+    "y_prob_original = original_pipeline.predict_proba(X_test)[:, 1]\n",
+    "\n",
+    "# --- 3. STORE BENCHMARK RESULTS ---\n",
+    "\n",
+    "# Initialize a dictionary to store all model comparison results.\n",
+    "results = {}\n",
+    "results['Original'] = {\n",
+    "    # Calculate and store standard metrics using the fixed test set results.\n",
+    "    'Accuracy': accuracy_score(y_test, y_pred_original),\n",
+    "    'ROC AUC': roc_auc_score(y_test, y_prob_original),\n",
+    "    # Store the full classification report as a dictionary for detailed metric access (Precision, Recall, F1).\n",
+    "    'Classification Report': classification_report(y_test, y_pred_original, output_dict=True)\n",
+    "}\n",
+    "print(\"Benchmark training and evaluation complete.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "id": "cf3f1d59-8c46-4d84-b813-a4adf88a3422",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "--- Training Model on Synthetic Data ---\n",
+      "Synthetic training and evaluation complete.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Prepare the synthetic data features and target\n",
+    "# Extract features (X_synthetic) and fill missing text values.\n",
+    "X_synthetic = synthetic_df.drop('Recommended IND', axis=1).fillna({'Review Text': '', 'Title': ''})\n",
+    "# Extract the target variable (y_synthetic).\n",
+    "y_synthetic = synthetic_df['Recommended IND']\n",
+    "\n",
+    "# Clone the original pipeline structure (preprocessor and classifier) to ensure a clean, unfitted start.\n",
+    "synthetic_pipeline = clone(full_pipeline) \n",
+    "\n",
+    "# ----------------- TRAIN ON SYNTHETIC DATA -----------------\n",
+    "print(\"\\n--- Training Model on Synthetic Data ---\")\n",
+    "# Fit the pipeline using the entire synthetic dataset.\n",
+    "synthetic_pipeline.fit(X_synthetic, y_synthetic)\n",
+    "\n",
+    "# --- 2. EVALUATE SYNTHETIC MODEL ON FIXED TEST SET ---\n",
+    "\n",
+    "# Make predictions using the synthetic-trained model against the fixed 500-row holdout set (X_test).\n",
+    "y_pred_synthetic = synthetic_pipeline.predict(X_test)\n",
+    "# Extract the probability for the positive class (Recommended=1) for ROC AUC calculation.\n",
+    "y_prob_synthetic = synthetic_pipeline.predict_proba(X_test)[:, 1]\n",
+    "\n",
+    "# --- 3. STORE SYNTHETIC RESULTS ---\n",
+    "\n",
+    "# Store the performance metrics in the results dictionary for side-by-side comparison.\n",
+    "results['Synthetic'] = {\n",
+    "    'Accuracy': accuracy_score(y_test, y_pred_synthetic),\n",
+    "    'ROC AUC': roc_auc_score(y_test, y_prob_synthetic),\n",
+    "    # Store the detailed classification report as a dictionary.\n",
+    "    'Classification Report': classification_report(y_test, y_pred_synthetic, output_dict=True)\n",
+    "}\n",
+    "print(\"Synthetic training and evaluation complete.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "id": "d83e681e-aac2-44d0-83cb-1d93002a725d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "============================================================\n",
+      "             SIDE-BY-SIDE MODEL COMPARISON\n",
+      "             (Tested on 500-Row Holdout Set)\n",
+      "============================================================\n",
+      "|                     |   Original (Benchmark) |   Synthetic |\n",
+      "|:--------------------|-----------------------:|------------:|\n",
+      "| Train Size          |              1000.0000 |   1000.0000 |\n",
+      "| Accuracy            |                 0.9480 |      0.9380 |\n",
+      "| ROC AUC Score       |                 0.9776 |      0.9821 |\n",
+      "| Precision (Class 1) |                 0.9640 |      0.9505 |\n",
+      "| Recall (Class 1)    |                 0.9734 |      0.9758 |\n",
+      "\n",
+      "============================================================\n",
+      "Key Finding:\n",
+      "The Synthetic Model performs AS WELL OR BETTER than the Original Benchmark.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\\n\" + \"=\"*60)\n",
+    "print(\"             SIDE-BY-SIDE MODEL COMPARISON\")\n",
+    "print(\"             (Tested on 500-Row Holdout Set)\")\n",
+    "print(\"=\"*60)\n",
+    "\n",
+    "# --- 1. PREPARE COMPARISON DATA ---\n",
+    "\n",
+    "# Compile key performance metrics from both the Original and Synthetic models into a single dictionary.\n",
+    "summary_data = {\n",
+    "    'Model': ['Original (Benchmark)', 'Synthetic'],\n",
+    "    # Record the size of the training data used for each model.\n",
+    "    'Train Size': [len(X_train), len(X_synthetic)],\n",
+    "    # Gather Accuracy and ROC AUC score for overall performance.\n",
+    "    'Accuracy': [results['Original']['Accuracy'], results['Synthetic']['Accuracy']],\n",
+    "    'ROC AUC Score': [results['Original']['ROC AUC'], results['Synthetic']['ROC AUC']],\n",
+    "    # Include key classification metrics (Precision and Recall) for the positive class (Recommended=1).\n",
+    "    'Precision (Class 1)': [results['Original']['Classification Report']['1']['precision'], results['Synthetic']['Classification Report']['1']['precision']],\n",
+    "    'Recall (Class 1)': [results['Original']['Classification Report']['1']['recall'], results['Synthetic']['Classification Report']['1']['recall']],\n",
+    "}\n",
+    "\n",
+    "# --- 2. DISPLAY SUMMARY TABLE ---\n",
+    "\n",
+    "# Convert the summary dictionary to a pandas DataFrame for structured display.\n",
+    "summary_df = pd.DataFrame(summary_data).set_index('Model').T\n",
+    "# Label the row index of the transposed DataFrame as 'Metric'.\n",
+    "summary_df.columns.name = 'Metric'\n",
+    "\n",
+    "# Print the comparison table in Markdown format for clean terminal output, formatting metrics to 4 decimal places.\n",
+    "print(summary_df.to_markdown(floatfmt=\".4f\"))\n",
+    "\n",
+    "print(\"\\n\" + \"=\"*60)\n",
+    "\n",
+    "# --- 3. INTERPRETATION ---\n",
+    "\n",
+    "# Provide a simple, text-based conclusion based on the most robust metric (ROC AUC).\n",
+    "print(\"Key Finding:\")\n",
+    "# Compare ROC AUC scores to determine if the synthetic data generalized as well as the original data.\n",
+    "if results['Synthetic']['ROC AUC'] >= results['Original']['ROC AUC']:\n",
+    "    print(\"The Synthetic Model performs AS WELL OR BETTER than the Original Benchmark.\")\n",
+    "else:\n",
+    "    print(\"The Synthetic Model's performance is slightly lower than the Original Benchmark.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c9b55961-ddac-4d91-aa4d-9646fb72c7be",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "My Virtual Env",
+   "language": "python",
+   "name": "myenv"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

From 81d8174f56e2fe91da38cb60170a020c65fb89d7 Mon Sep 17 00:00:00 2001
From: Eric Pham-Hung <ephamhung@nvidia.com>
Date: Tue, 4 Nov 2025 16:48:33 -0800
Subject: [PATCH 2/4] updated traintest split and comments

Signed-off-by: Eric Pham-Hung <ephamhung@nvidia.com>
---
 ..._synthesizer_101_with_extrinsic_eval.ipynb | 693 +++---------------
 1 file changed, 104 insertions(+), 589 deletions(-)

diff --git a/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101_with_extrinsic_eval.ipynb b/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101_with_extrinsic_eval.ipynb
index 8ea636697..af6ca1d8a 100644
--- a/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101_with_extrinsic_eval.ipynb
+++ b/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101_with_extrinsic_eval.ipynb
@@ -31,19 +31,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "id": "9f5d6f5a",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/shadeform/GenerativeAIExamples/nemo/NeMo-Safe-Synthesizer/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
-      "  from .autonotebook import tqdm as notebook_tqdm\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "import pandas as pd\n",
     "from nemo_microservices import NeMoMicroservices\n",
@@ -68,7 +59,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
    "id": "8c15ab93",
    "metadata": {},
    "outputs": [],
@@ -88,14 +79,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
    "id": "ab037a3a",
    "metadata": {},
    "outputs": [],
    "source": [
     "datastore_config = {\n",
     "    \"endpoint\": \"http://localhost:3000/v1/hf\",\n",
-    "    \"token\": \"\",\n",
+    "    \"token\": \"placeholder\",\n",
     "}"
    ]
   },
@@ -113,197 +104,52 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
    "id": "daa955b6",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "/home/shadeform/GenerativeAIExamples/nemo/NeMo-Safe-Synthesizer/.venv/bin/python: No module named pip\n",
-      "/bin/bash: line 1: uv: command not found\n",
-      "Note: you may need to restart the kernel to use updated packages.\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
-    "%pip install kagglehub || uv pip install kagglehub"
+    "# %uv pip install kagglehub, scikit-learn, tabulate"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": null,
    "id": "7204f213",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Downloading from https://www.kaggle.com/api/v1/datasets/download/nicapotato/womens-ecommerce-clothing-reviews?dataset_version_number=1...\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "100%|██████████| 2.79M/2.79M [00:00<00:00, 70.2MB/s]"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Extracting files...\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n"
-     ]
-    },
-    {
-     "data": {
-      "text/html": [
-       "<div>\n",
-       "<style scoped>\n",
-       "    .dataframe tbody tr th:only-of-type {\n",
-       "        vertical-align: middle;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe tbody tr th {\n",
-       "        vertical-align: top;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe thead th {\n",
-       "        text-align: right;\n",
-       "    }\n",
-       "</style>\n",
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "  <thead>\n",
-       "    <tr style=\"text-align: right;\">\n",
-       "      <th></th>\n",
-       "      <th>Clothing ID</th>\n",
-       "      <th>Age</th>\n",
-       "      <th>Title</th>\n",
-       "      <th>Review Text</th>\n",
-       "      <th>Rating</th>\n",
-       "      <th>Recommended IND</th>\n",
-       "      <th>Positive Feedback Count</th>\n",
-       "      <th>Division Name</th>\n",
-       "      <th>Department Name</th>\n",
-       "      <th>Class Name</th>\n",
-       "    </tr>\n",
-       "  </thead>\n",
-       "  <tbody>\n",
-       "    <tr>\n",
-       "      <th>0</th>\n",
-       "      <td>767</td>\n",
-       "      <td>33</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>Absolutely wonderful - silky and sexy and comf...</td>\n",
-       "      <td>4</td>\n",
-       "      <td>1</td>\n",
-       "      <td>0</td>\n",
-       "      <td>Initmates</td>\n",
-       "      <td>Intimate</td>\n",
-       "      <td>Intimates</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>1</th>\n",
-       "      <td>1080</td>\n",
-       "      <td>34</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>Love this dress!  it's sooo pretty.  i happene...</td>\n",
-       "      <td>5</td>\n",
-       "      <td>1</td>\n",
-       "      <td>4</td>\n",
-       "      <td>General</td>\n",
-       "      <td>Dresses</td>\n",
-       "      <td>Dresses</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>2</th>\n",
-       "      <td>1077</td>\n",
-       "      <td>60</td>\n",
-       "      <td>Some major design flaws</td>\n",
-       "      <td>I had such high hopes for this dress and reall...</td>\n",
-       "      <td>3</td>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>General</td>\n",
-       "      <td>Dresses</td>\n",
-       "      <td>Dresses</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>3</th>\n",
-       "      <td>1049</td>\n",
-       "      <td>50</td>\n",
-       "      <td>My favorite buy!</td>\n",
-       "      <td>I love, love, love this jumpsuit. it's fun, fl...</td>\n",
-       "      <td>5</td>\n",
-       "      <td>1</td>\n",
-       "      <td>0</td>\n",
-       "      <td>General Petite</td>\n",
-       "      <td>Bottoms</td>\n",
-       "      <td>Pants</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>4</th>\n",
-       "      <td>847</td>\n",
-       "      <td>47</td>\n",
-       "      <td>Flattering shirt</td>\n",
-       "      <td>This shirt is very flattering to all due to th...</td>\n",
-       "      <td>5</td>\n",
-       "      <td>1</td>\n",
-       "      <td>6</td>\n",
-       "      <td>General</td>\n",
-       "      <td>Tops</td>\n",
-       "      <td>Blouses</td>\n",
-       "    </tr>\n",
-       "  </tbody>\n",
-       "</table>\n",
-       "</div>"
-      ],
-      "text/plain": [
-       "   Clothing ID  Age                    Title  \\\n",
-       "0          767   33                      NaN   \n",
-       "1         1080   34                      NaN   \n",
-       "2         1077   60  Some major design flaws   \n",
-       "3         1049   50         My favorite buy!   \n",
-       "4          847   47         Flattering shirt   \n",
-       "\n",
-       "                                         Review Text  Rating  Recommended IND  \\\n",
-       "0  Absolutely wonderful - silky and sexy and comf...       4                1   \n",
-       "1  Love this dress!  it's sooo pretty.  i happene...       5                1   \n",
-       "2  I had such high hopes for this dress and reall...       3                0   \n",
-       "3  I love, love, love this jumpsuit. it's fun, fl...       5                1   \n",
-       "4  This shirt is very flattering to all due to th...       5                1   \n",
-       "\n",
-       "   Positive Feedback Count   Division Name Department Name Class Name  \n",
-       "0                        0       Initmates        Intimate  Intimates  \n",
-       "1                        4         General         Dresses    Dresses  \n",
-       "2                        0         General         Dresses    Dresses  \n",
-       "3                        0  General Petite         Bottoms      Pants  \n",
-       "4                        6         General            Tops    Blouses  "
-      ]
-     },
-     "execution_count": 5,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "outputs": [],
    "source": [
     "import kagglehub\n",
     "import pandas as pd\n",
     "\n",
     "# Download latest version\n",
     "path = kagglehub.dataset_download(\"nicapotato/womens-ecommerce-clothing-reviews\")\n",
-    "df = pd.read_csv(f\"{path}/Womens Clothing E-Commerce Reviews.csv\", index_col=0)\n",
-    "df.head()"
+    "raw_df = pd.read_csv(f\"{path}/Womens Clothing E-Commerce Reviews.csv\", index_col=0)\n",
+    "raw_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c6c331b7",
+   "metadata": {},
+   "source": [
+    "We create a holdout dataset that will only be used for evaluating the end classifier"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "162876c3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "df, test_df = train_test_split(raw_df, test_size=0.2, random_state=42)\n",
+    "\n",
+    "print(f\"Original df length: {len(raw_df)}\")\n",
+    "print(f\"Training df length: {len(df)}\")\n",
+    "print(f\"Testing df length:  {len(test_df)}\")"
    ]
   },
   {
@@ -337,6 +183,7 @@
     "    .with_datastore(datastore_config)\n",
     "    .with_replace_pii()\n",
     "    .synthesize()\n",
+    "    .with_generate(num_records=15000)\n",
     "    .create_job()\n",
     ")\n",
     "\n",
@@ -373,253 +220,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
    "id": "7f25574a",
    "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<div>\n",
-       "<style scoped>\n",
-       "    .dataframe tbody tr th:only-of-type {\n",
-       "        vertical-align: middle;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe tbody tr th {\n",
-       "        vertical-align: top;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe thead th {\n",
-       "        text-align: right;\n",
-       "    }\n",
-       "</style>\n",
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "  <thead>\n",
-       "    <tr style=\"text-align: right;\">\n",
-       "      <th></th>\n",
-       "      <th>Clothing ID</th>\n",
-       "      <th>Age</th>\n",
-       "      <th>Title</th>\n",
-       "      <th>Review Text</th>\n",
-       "      <th>Rating</th>\n",
-       "      <th>Recommended IND</th>\n",
-       "      <th>Positive Feedback Count</th>\n",
-       "      <th>Division Name</th>\n",
-       "      <th>Department Name</th>\n",
-       "      <th>Class Name</th>\n",
-       "    </tr>\n",
-       "  </thead>\n",
-       "  <tbody>\n",
-       "    <tr>\n",
-       "      <th>0</th>\n",
-       "      <td>670</td>\n",
-       "      <td>39</td>\n",
-       "      <td>Beautiful in person!</td>\n",
-       "      <td>This top is just beautiful in person! the colo...</td>\n",
-       "      <td>5</td>\n",
-       "      <td>1</td>\n",
-       "      <td>0</td>\n",
-       "      <td>General</td>\n",
-       "      <td>Tops</td>\n",
-       "      <td>Knits</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>1</th>\n",
-       "      <td>186</td>\n",
-       "      <td>49</td>\n",
-       "      <td>Pretty; not flattering</td>\n",
-       "      <td>I love the color of this shirt, and it's very ...</td>\n",
-       "      <td>4</td>\n",
-       "      <td>1</td>\n",
-       "      <td>0</td>\n",
-       "      <td>General</td>\n",
-       "      <td>Tops</td>\n",
-       "      <td>Blouses</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>2</th>\n",
-       "      <td>382</td>\n",
-       "      <td>46</td>\n",
-       "      <td>Fabulous wool crazy sweater</td>\n",
-       "      <td>This sweater is gorgeous.  the fabric goes wit...</td>\n",
-       "      <td>5</td>\n",
-       "      <td>1</td>\n",
-       "      <td>2</td>\n",
-       "      <td>General</td>\n",
-       "      <td>Jackets</td>\n",
-       "      <td>Jackets</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>3</th>\n",
-       "      <td>186</td>\n",
-       "      <td>38</td>\n",
-       "      <td>Pretty fall top</td>\n",
-       "      <td>I really liked this, all things considered. i ...</td>\n",
-       "      <td>4</td>\n",
-       "      <td>1</td>\n",
-       "      <td>5</td>\n",
-       "      <td>General</td>\n",
-       "      <td>Tops</td>\n",
-       "      <td>Blouses</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>4</th>\n",
-       "      <td>619</td>\n",
-       "      <td>37</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>I ordered the pink color. it's a very pretty t...</td>\n",
-       "      <td>5</td>\n",
-       "      <td>1</td>\n",
-       "      <td>1</td>\n",
-       "      <td>General</td>\n",
-       "      <td>Tops</td>\n",
-       "      <td>Knits</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>...</th>\n",
-       "      <td>...</td>\n",
-       "      <td>...</td>\n",
-       "      <td>...</td>\n",
-       "      <td>...</td>\n",
-       "      <td>...</td>\n",
-       "      <td>...</td>\n",
-       "      <td>...</td>\n",
-       "      <td>...</td>\n",
-       "      <td>...</td>\n",
-       "      <td>...</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>995</th>\n",
-       "      <td>228</td>\n",
-       "      <td>56</td>\n",
-       "      <td>White pajamas my favorite pair since when i wa...</td>\n",
-       "      <td>Wish they had more colors. i ordered two pair ...</td>\n",
-       "      <td>5</td>\n",
-       "      <td>1</td>\n",
-       "      <td>0</td>\n",
-       "      <td>General Petite</td>\n",
-       "      <td>Bottoms</td>\n",
-       "      <td>Jackets</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>996</th>\n",
-       "      <td>247</td>\n",
-       "      <td>55</td>\n",
-       "      <td>Great casual shirt for any occasion!</td>\n",
-       "      <td>I love this shirt!! not only is it a great cas...</td>\n",
-       "      <td>5</td>\n",
-       "      <td>1</td>\n",
-       "      <td>2</td>\n",
-       "      <td>General</td>\n",
-       "      <td>Tops</td>\n",
-       "      <td>Knits</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>997</th>\n",
-       "      <td>902</td>\n",
-       "      <td>64</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>5</td>\n",
-       "      <td>1</td>\n",
-       "      <td>0</td>\n",
-       "      <td>General Petite</td>\n",
-       "      <td>Dresses</td>\n",
-       "      <td>Dresses</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>998</th>\n",
-       "      <td>6610</td>\n",
-       "      <td>47</td>\n",
-       "      <td>Amazing color and pattern</td>\n",
-       "      <td>I love the fit of this skirt. i usually wear a...</td>\n",
-       "      <td>5</td>\n",
-       "      <td>1</td>\n",
-       "      <td>0</td>\n",
-       "      <td>General</td>\n",
-       "      <td>Bottoms</td>\n",
-       "      <td>Skirts</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>999</th>\n",
-       "      <td>392</td>\n",
-       "      <td>42</td>\n",
-       "      <td>Not happy with the finished product</td>\n",
-       "      <td>Great color and style but the cut is not exact...</td>\n",
-       "      <td>4</td>\n",
-       "      <td>1</td>\n",
-       "      <td>0</td>\n",
-       "      <td>General</td>\n",
-       "      <td>Tops</td>\n",
-       "      <td>Blouses</td>\n",
-       "    </tr>\n",
-       "  </tbody>\n",
-       "</table>\n",
-       "<p>1000 rows × 10 columns</p>\n",
-       "</div>"
-      ],
-      "text/plain": [
-       "     Clothing ID  Age                                              Title  \\\n",
-       "0            670   39                               Beautiful in person!   \n",
-       "1            186   49                             Pretty; not flattering   \n",
-       "2            382   46                        Fabulous wool crazy sweater   \n",
-       "3            186   38                                    Pretty fall top   \n",
-       "4            619   37                                                NaN   \n",
-       "..           ...  ...                                                ...   \n",
-       "995          228   56  White pajamas my favorite pair since when i wa...   \n",
-       "996          247   55               Great casual shirt for any occasion!   \n",
-       "997          902   64                                                NaN   \n",
-       "998         6610   47                          Amazing color and pattern   \n",
-       "999          392   42                Not happy with the finished product   \n",
-       "\n",
-       "                                           Review Text  Rating  \\\n",
-       "0    This top is just beautiful in person! the colo...       5   \n",
-       "1    I love the color of this shirt, and it's very ...       4   \n",
-       "2    This sweater is gorgeous.  the fabric goes wit...       5   \n",
-       "3    I really liked this, all things considered. i ...       4   \n",
-       "4    I ordered the pink color. it's a very pretty t...       5   \n",
-       "..                                                 ...     ...   \n",
-       "995  Wish they had more colors. i ordered two pair ...       5   \n",
-       "996  I love this shirt!! not only is it a great cas...       5   \n",
-       "997                                                NaN       5   \n",
-       "998  I love the fit of this skirt. i usually wear a...       5   \n",
-       "999  Great color and style but the cut is not exact...       4   \n",
-       "\n",
-       "     Recommended IND  Positive Feedback Count   Division Name Department Name  \\\n",
-       "0                  1                        0         General            Tops   \n",
-       "1                  1                        0         General            Tops   \n",
-       "2                  1                        2         General         Jackets   \n",
-       "3                  1                        5         General            Tops   \n",
-       "4                  1                        1         General            Tops   \n",
-       "..               ...                      ...             ...             ...   \n",
-       "995                1                        0  General Petite         Bottoms   \n",
-       "996                1                        2         General            Tops   \n",
-       "997                1                        0  General Petite         Dresses   \n",
-       "998                1                        0         General         Bottoms   \n",
-       "999                1                        0         General            Tops   \n",
-       "\n",
-       "    Class Name  \n",
-       "0        Knits  \n",
-       "1      Blouses  \n",
-       "2      Jackets  \n",
-       "3      Blouses  \n",
-       "4        Knits  \n",
-       "..         ...  \n",
-       "995    Jackets  \n",
-       "996      Knits  \n",
-       "997    Dresses  \n",
-       "998     Skirts  \n",
-       "999    Blouses  \n",
-       "\n",
-       "[1000 rows x 10 columns]"
-      ]
-     },
-     "execution_count": 7,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "outputs": [],
    "source": [
     "# Fetch the synthetic data created by the job\n",
     "synthetic_df = job.fetch_data()\n",
@@ -642,19 +246,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": null,
    "id": "7b691127",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Synthetic data quality score (0-10, higher is better): 8.9\n",
-      "Data privacy score (0-10, higher is better): 8.5\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "# Print selected information from the job summary\n",
     "summary = job.fetch_summary()\n",
@@ -666,7 +261,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": null,
    "id": "39e62ea9",
    "metadata": {},
    "outputs": [],
@@ -677,7 +272,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": null,
    "id": "45f7e22b",
    "metadata": {},
    "outputs": [],
@@ -703,13 +298,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 47,
+   "execution_count": null,
    "id": "37b6df30-6627-4a40-8604-e905ada571b7",
    "metadata": {},
    "outputs": [],
    "source": [
+    "# This script defines a scikit-learn pipeline for a classification task.\n",
     "import numpy as np\n",
-    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.model_selection import train_test_split \n",
     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
     "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
     "from sklearn.compose import ColumnTransformer\n",
@@ -718,243 +314,162 @@
     "from sklearn.metrics import classification_report, accuracy_score, roc_auc_score\n",
     "from sklearn.base import clone\n",
     "\n",
-    "# --- 1. Define Features (X) and Target (y) ---\n",
-    "\n",
-    "# Separate features (X) from the binary target variable (y).\n",
-    "X = df.drop('Recommended IND', axis=1)\n",
-    "y = df['Recommended IND']\n",
-    "\n",
-    "# Fill any missing text values with an empty string to prevent TfidfVectorizer errors.\n",
-    "X['Review Text'] = X['Review Text'].fillna('')\n",
-    "X['Title'] = X['Title'].fillna('')\n",
-    "\n",
-    "# --- 2. Fixed-Size Non-Overlapping Data Split (1000 Train, 500 Test) ---\n",
+    "X_train = df.drop('Recommended IND', axis=1)\n",
+    "y_train = df['Recommended IND']\n",
     "\n",
-    "# Sample 1000 random indices for the training set.\n",
-    "train_indices = X.sample(n=1000, random_state=1).index\n",
-    "X_train = X.loc[train_indices]\n",
-    "y_train = y.loc[train_indices]\n",
+    "X_train['Review Text'] = X_train['Review Text'].fillna('')\n",
+    "X_train['Title'] = X_train['Title'].fillna('')\n",
     "\n",
-    "# Create a temporary DataFrame containing only the remaining, unused rows.\n",
-    "remaining_X = X.drop(train_indices)\n",
+    "X_test = test_df.drop('Recommended IND', axis=1)\n",
+    "y_test = test_df['Recommended IND']\n",
     "\n",
-    "# Sample 500 random indices from the remaining data for the holdout test set.\n",
-    "test_indices = remaining_X.sample(n=500, random_state=1).index\n",
-    "X_test = X.loc[test_indices]\n",
-    "y_test = y.loc[test_indices]\n",
+    "X_test['Review Text'] = X_test['Review Text'].fillna('')\n",
+    "X_test['Title'] = X_test['Title'].fillna('')\n",
     "\n",
-    "# --- 3. Define Feature Types for Preprocessing ---\n",
-    "\n",
-    "# List the columns for each type of transformation.\n",
     "text_features = ['Review Text']\n",
     "numerical_features = ['Age', 'Rating', 'Positive Feedback Count']\n",
     "categorical_features = ['Division Name', 'Department Name', 'Class Name']\n",
     "\n",
-    "# --- 4. Define Preprocessing Steps ---\n",
-    "\n",
-    "# Text: Convert text to numerical vectors (max 5000 terms).\n",
     "text_transformer = TfidfVectorizer(stop_words='english', max_features=5000)\n",
-    "# Numerical: Standardize numerical features (mean=0, std=1).\n",
     "numerical_transformer = StandardScaler()\n",
-    "# Categorical: One-hot encode categories, ignoring unseen categories in the test set.\n",
-    "categorical_transformer = OneHotEncoder(handle_unknown='ignore')\n",
-    "\n",
-    "# --- 5. Create Column Transformer (The Preprocessor) ---\n",
+    "categorical_transformer = OneHotEncoder(handle_unknown='ignore') \n",
     "\n",
-    "# Apply different transformations to the respective column groups.\n",
     "preprocessor = ColumnTransformer(\n",
     "    transformers=[\n",
-    "        # TfidfVectorizer only takes one column input.\n",
-    "        ('text', text_transformer, text_features[0]),\n",
+    "        ('text', text_transformer, text_features[0]), \n",
     "        ('num', numerical_transformer, numerical_features),\n",
     "        ('cat', categorical_transformer, categorical_features)\n",
     "    ],\n",
-    "    # Exclude non-feature columns (like 'Clothing ID' and 'Title') from the model input.\n",
-    "    remainder='drop'\n",
+    "    remainder='drop' \n",
     ")\n",
     "\n",
-    "# --- 6. Define the Classifier and Full Pipeline ---\n",
-    "\n",
-    "# Choose the classification algorithm (Logistic Regression for recommendation prediction).\n",
     "model = LogisticRegression(solver='liblinear', random_state=42)\n",
     "\n",
-    "# Build the final Pipeline: Preprocessing followed by Classification.\n",
     "full_pipeline = Pipeline(steps=[\n",
     "    ('preprocessor', preprocessor),\n",
     "    ('classifier', model)\n",
-    "])"
+    "])\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 49,
+   "execution_count": null,
    "id": "ee747c80-d42f-4ec5-b27b-2b2462436b92",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "--- Training Benchmark Model on Original Data (1000 rows) ---\n",
-      "Benchmark training and evaluation complete.\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
-    "# Assuming the full_pipeline structure (preprocessor + classifier) has been defined.\n",
-    "\n",
-    "# --- 1. TRAIN BENCHMARK MODEL ON ORIGINAL DATA ---\n",
+    "# Train and evaluate a benchmark model pipeline, storing its performance metrics.\n",
+    "from sklearn.metrics import classification_report, accuracy_score, roc_auc_score\n",
     "\n",
-    "# Assign the pipeline to a specific name for the original model.\n",
-    "# NOTE: In a real comparison, you should clone the pipeline here to ensure the original structure is used without modification.\n",
     "original_pipeline = full_pipeline \n",
     "print(\"\\n--- Training Benchmark Model on Original Data (1000 rows) ---\")\n",
-    "# Train the pipeline using the fixed, 1000-row original training set.\n",
     "original_pipeline.fit(X_train, y_train)\n",
     "\n",
-    "# --- 2. EVALUATE BENCHMARK MODEL ON FIXED TEST SET ---\n",
-    "\n",
-    "# Predict class labels on the 500-row fixed holdout test set.\n",
     "y_pred_original = original_pipeline.predict(X_test)\n",
-    "# Predict probabilities and extract the probability for the positive class (Recommended=1), needed for ROC AUC.\n",
     "y_prob_original = original_pipeline.predict_proba(X_test)[:, 1]\n",
     "\n",
-    "# --- 3. STORE BENCHMARK RESULTS ---\n",
-    "\n",
-    "# Initialize a dictionary to store all model comparison results.\n",
     "results = {}\n",
     "results['Original'] = {\n",
-    "    # Calculate and store standard metrics using the fixed test set results.\n",
     "    'Accuracy': accuracy_score(y_test, y_pred_original),\n",
     "    'ROC AUC': roc_auc_score(y_test, y_prob_original),\n",
-    "    # Store the full classification report as a dictionary for detailed metric access (Precision, Recall, F1).\n",
     "    'Classification Report': classification_report(y_test, y_pred_original, output_dict=True)\n",
     "}\n",
-    "print(\"Benchmark training and evaluation complete.\")"
+    "print(\"Benchmark training and evaluation complete.\")\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 50,
+   "execution_count": null,
    "id": "cf3f1d59-8c46-4d84-b813-a4adf88a3422",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "--- Training Model on Synthetic Data ---\n",
-      "Synthetic training and evaluation complete.\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
-    "# Prepare the synthetic data features and target\n",
-    "# Extract features (X_synthetic) and fill missing text values.\n",
+    "# Train a new model pipeline on synthetic data and evaluates it against the test set.\n",
+    "from sklearn.base import clone\n",
+    "from sklearn.metrics import classification_report, accuracy_score, roc_auc_score\n",
+    "\n",
     "X_synthetic = synthetic_df.drop('Recommended IND', axis=1).fillna({'Review Text': '', 'Title': ''})\n",
-    "# Extract the target variable (y_synthetic).\n",
     "y_synthetic = synthetic_df['Recommended IND']\n",
     "\n",
-    "# Clone the original pipeline structure (preprocessor and classifier) to ensure a clean, unfitted start.\n",
     "synthetic_pipeline = clone(full_pipeline) \n",
     "\n",
-    "# ----------------- TRAIN ON SYNTHETIC DATA -----------------\n",
     "print(\"\\n--- Training Model on Synthetic Data ---\")\n",
-    "# Fit the pipeline using the entire synthetic dataset.\n",
     "synthetic_pipeline.fit(X_synthetic, y_synthetic)\n",
     "\n",
-    "# --- 2. EVALUATE SYNTHETIC MODEL ON FIXED TEST SET ---\n",
-    "\n",
-    "# Make predictions using the synthetic-trained model against the fixed 500-row holdout set (X_test).\n",
     "y_pred_synthetic = synthetic_pipeline.predict(X_test)\n",
-    "# Extract the probability for the positive class (Recommended=1) for ROC AUC calculation.\n",
     "y_prob_synthetic = synthetic_pipeline.predict_proba(X_test)[:, 1]\n",
     "\n",
-    "# --- 3. STORE SYNTHETIC RESULTS ---\n",
-    "\n",
-    "# Store the performance metrics in the results dictionary for side-by-side comparison.\n",
     "results['Synthetic'] = {\n",
     "    'Accuracy': accuracy_score(y_test, y_pred_synthetic),\n",
     "    'ROC AUC': roc_auc_score(y_test, y_prob_synthetic),\n",
-    "    # Store the detailed classification report as a dictionary.\n",
     "    'Classification Report': classification_report(y_test, y_pred_synthetic, output_dict=True)\n",
     "}\n",
-    "print(\"Synthetic training and evaluation complete.\")"
+    "print(\"Synthetic training and evaluation complete.\")\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 51,
+   "execution_count": null,
    "id": "d83e681e-aac2-44d0-83cb-1d93002a725d",
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\n",
-      "============================================================\n",
-      "             SIDE-BY-SIDE MODEL COMPARISON\n",
-      "             (Tested on 500-Row Holdout Set)\n",
-      "============================================================\n",
-      "|                     |   Original (Benchmark) |   Synthetic |\n",
-      "|:--------------------|-----------------------:|------------:|\n",
-      "| Train Size          |              1000.0000 |   1000.0000 |\n",
-      "| Accuracy            |                 0.9480 |      0.9380 |\n",
-      "| ROC AUC Score       |                 0.9776 |      0.9821 |\n",
-      "| Precision (Class 1) |                 0.9640 |      0.9505 |\n",
-      "| Recall (Class 1)    |                 0.9734 |      0.9758 |\n",
-      "\n",
-      "============================================================\n",
-      "Key Finding:\n",
-      "The Synthetic Model performs AS WELL OR BETTER than the Original Benchmark.\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
+    "# Compare the performance of the original and synthetic models and prints a summary.\n",
+    "import pandas as pd\n",
+    "\n",
     "print(\"\\n\" + \"=\"*60)\n",
     "print(\"             SIDE-BY-SIDE MODEL COMPARISON\")\n",
-    "print(\"             (Tested on 500-Row Holdout Set)\")\n",
+    "print(f\"             (Tested on {len(test_df)}-Row Holdout Set)\")\n",
     "print(\"=\"*60)\n",
     "\n",
-    "# --- 1. PREPARE COMPARISON DATA ---\n",
-    "\n",
-    "# Compile key performance metrics from both the Original and Synthetic models into a single dictionary.\n",
     "summary_data = {\n",
     "    'Model': ['Original (Benchmark)', 'Synthetic'],\n",
-    "    # Record the size of the training data used for each model.\n",
     "    'Train Size': [len(X_train), len(X_synthetic)],\n",
-    "    # Gather Accuracy and ROC AUC score for overall performance.\n",
     "    'Accuracy': [results['Original']['Accuracy'], results['Synthetic']['Accuracy']],\n",
     "    'ROC AUC Score': [results['Original']['ROC AUC'], results['Synthetic']['ROC AUC']],\n",
-    "    # Include key classification metrics (Precision and Recall) for the positive class (Recommended=1).\n",
     "    'Precision (Class 1)': [results['Original']['Classification Report']['1']['precision'], results['Synthetic']['Classification Report']['1']['precision']],\n",
     "    'Recall (Class 1)': [results['Original']['Classification Report']['1']['recall'], results['Synthetic']['Classification Report']['1']['recall']],\n",
     "}\n",
     "\n",
-    "# --- 2. DISPLAY SUMMARY TABLE ---\n",
-    "\n",
-    "# Convert the summary dictionary to a pandas DataFrame for structured display.\n",
     "summary_df = pd.DataFrame(summary_data).set_index('Model').T\n",
-    "# Label the row index of the transposed DataFrame as 'Metric'.\n",
     "summary_df.columns.name = 'Metric'\n",
     "\n",
-    "# Print the comparison table in Markdown format for clean terminal output, formatting metrics to 4 decimal places.\n",
     "print(summary_df.to_markdown(floatfmt=\".4f\"))\n",
     "\n",
     "print(\"\\n\" + \"=\"*60)\n",
     "\n",
-    "# --- 3. INTERPRETATION ---\n",
-    "\n",
-    "# Provide a simple, text-based conclusion based on the most robust metric (ROC AUC).\n",
     "print(\"Key Finding:\")\n",
-    "# Compare ROC AUC scores to determine if the synthetic data generalized as well as the original data.\n",
     "if results['Synthetic']['ROC AUC'] >= results['Original']['ROC AUC']:\n",
     "    print(\"The Synthetic Model performs AS WELL OR BETTER than the Original Benchmark.\")\n",
     "else:\n",
-    "    print(\"The Synthetic Model's performance is slightly lower than the Original Benchmark.\")"
+    "    print(\"The Synthetic Model's performance is slightly lower than the Original Benchmark.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "169f443d",
+   "metadata": {},
+   "source": [
+    "Your end result should look similar to this:\n",
+    "\n",
+    "============================================================\n",
+    "\n",
+    "             SIDE-BY-SIDE MODEL COMPARISON\n",
+    "             (Tested on 4698-Row Holdout Set)\n",
+    "============================================================\n",
+    "|                     |   Original (Benchmark) |   Synthetic |\n",
+    "|:--------------------|-----------------------:|------------:|\n",
+    "| Train Size          |                 18,788 |      15,000 |\n",
+    "| Accuracy            |                 0.9404 |      0.9278 |\n",
+    "| ROC AUC Score       |                 0.9782 |      0.9762 |\n",
+    "| Precision (Class 1) |                 0.9626 |      0.9423 |\n",
+    "| Recall (Class 1)    |                 0.9646 |      0.9714 |\n",
+    "\n",
+    "============================================================\n",
+    "\n",
+    "Key Finding:\n",
+    "\n",
+    "The Synthetic Model's performance is slightly lower than the Original Benchmark.\n"
    ]
   },
   {

From 25cc972bbbeb971531df35d1e920c7820891502e Mon Sep 17 00:00:00 2001
From: Eric Pham-Hung <ephamhung@nvidia.com>
Date: Tue, 4 Nov 2025 16:51:39 -0800
Subject: [PATCH 3/4] moved to advanced

Signed-off-by: Eric Pham-Hung <ephamhung@nvidia.com>
---
 .../safe_synthesizer_101_with_extrinsic_eval.ipynb   | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)
 rename nemo/NeMo-Safe-Synthesizer/{intro => advanced}/safe_synthesizer_101_with_extrinsic_eval.ipynb (97%)

diff --git a/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101_with_extrinsic_eval.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/safe_synthesizer_101_with_extrinsic_eval.ipynb
similarity index 97%
rename from nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101_with_extrinsic_eval.ipynb
rename to nemo/NeMo-Safe-Synthesizer/advanced/safe_synthesizer_101_with_extrinsic_eval.ipynb
index af6ca1d8a..d8f73c74c 100644
--- a/nemo/NeMo-Safe-Synthesizer/intro/safe_synthesizer_101_with_extrinsic_eval.ipynb
+++ b/nemo/NeMo-Safe-Synthesizer/advanced/safe_synthesizer_101_with_extrinsic_eval.ipynb
@@ -452,11 +452,6 @@
    "source": [
     "Your end result should look similar to this:\n",
     "\n",
-    "============================================================\n",
-    "\n",
-    "             SIDE-BY-SIDE MODEL COMPARISON\n",
-    "             (Tested on 4698-Row Holdout Set)\n",
-    "============================================================\n",
     "|                     |   Original (Benchmark) |   Synthetic |\n",
     "|:--------------------|-----------------------:|------------:|\n",
     "| Train Size          |                 18,788 |      15,000 |\n",
@@ -464,12 +459,7 @@
     "| ROC AUC Score       |                 0.9782 |      0.9762 |\n",
     "| Precision (Class 1) |                 0.9626 |      0.9423 |\n",
     "| Recall (Class 1)    |                 0.9646 |      0.9714 |\n",
-    "\n",
-    "============================================================\n",
-    "\n",
-    "Key Finding:\n",
-    "\n",
-    "The Synthetic Model's performance is slightly lower than the Original Benchmark.\n"
+    "\n"
    ]
   },
   {

From 37fcd1200c3cb2b862d5c1da3eeb19d6076d240b Mon Sep 17 00:00:00 2001
From: Eric Pham-Hung <ephamhung@nvidia.com>
Date: Tue, 4 Nov 2025 16:58:53 -0800
Subject: [PATCH 4/4] updating intro

---
 ..._eval.ipynb => extrinsic_evaluation.ipynb} | 21 +++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)
 rename nemo/NeMo-Safe-Synthesizer/advanced/{safe_synthesizer_101_with_extrinsic_eval.ipynb => extrinsic_evaluation.ipynb} (92%)

diff --git a/nemo/NeMo-Safe-Synthesizer/advanced/safe_synthesizer_101_with_extrinsic_eval.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/extrinsic_evaluation.ipynb
similarity index 92%
rename from nemo/NeMo-Safe-Synthesizer/advanced/safe_synthesizer_101_with_extrinsic_eval.ipynb
rename to nemo/NeMo-Safe-Synthesizer/advanced/extrinsic_evaluation.ipynb
index d8f73c74c..83da15e59 100644
--- a/nemo/NeMo-Safe-Synthesizer/advanced/safe_synthesizer_101_with_extrinsic_eval.ipynb
+++ b/nemo/NeMo-Safe-Synthesizer/advanced/extrinsic_evaluation.ipynb
@@ -5,18 +5,27 @@
    "id": "630e3e17",
    "metadata": {},
    "source": [
-    "# 🎛️ NeMo Safe Synthesizer 101: The Basics\n",
+    "# 🎛️ NeMo Safe Synthesizer 101: Extrinsic Evaluation\n",
     "\n",
     "> ⚠️ **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.\n",
     "\n",
     "<br> \n",
     "\n",
-    "In this notebook, we demonstrate how to create a synthetic version of a tabular dataset using the NeMo Microservices Python SDK. The notebook should take about 20 minutes to run.\n",
+    "In this notebook, we build off the foundational concepts from the *NeMo Safe Synthesizer 101: Data Generation* notebook. While the first notebook focused on *how* to generate synthetic data, this one focuses on **how to measure its quality and utility** for real-world applications.\n",
     "\n",
-    "After completing this notebook, you'll be able to:\n",
-    "- Use the NeMo Microservices SDK to interact with Safe Synthesizer\n",
-    "- Create novel synthetic data that follows the statistical properties of your input dataset\n",
-    "- Access an evaluation report on synthetic data quality and privacy\n"
+    "We'll do this using a common method called **extrinsic evaluation**, which involves testing the synthetic data's performance on a downstream machine learning task.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 🎯 What is Extrinsic Evaluation?\n",
+    "\n",
+    "Extrinsic evaluation measures the **utility** of synthetic data by using it to train a model for a specific task. This contrasts with *intrinsic* evaluation, which might only measure the statistical similarity between the synthetic and real data.\n",
+    "\n",
+    "In this notebook, we'll use a **simple classification task** as our benchmark. The core idea is to answer the question:\n",
+    "\n",
+    "> \"Can a model trained **only** on our *synthetic data* achieve comparable performance to a model trained on the *real data*?\"\n",
+    "\n",
+    "If the answer is yes, it's a strong signal that our synthetic data has successfully captured the important patterns, relationships, and statistical properties of the original dataset. This is the \"Train-on-Synthetic, Test-on-Real\" approach."
    ]
   },
   {

	Clothing ID	Age	Title	Review Text	Rating	Recommended IND	Positive Feedback Count	Division Name	Department Name	Class Name
0	767	33	NaN	Absolutely wonderful - silky and sexy and comf...	4	1	0	Initmates	Intimate	Intimates
1	1080	34	NaN	Love this dress! it's sooo pretty. i happene...	5	1	4	General	Dresses	Dresses
2	1077	60	Some major design flaws	I had such high hopes for this dress and reall...	3	0	0	General	Dresses	Dresses
3	1049	50	My favorite buy!	I love, love, love this jumpsuit. it's fun, fl...	5	1	0	General Petite	Bottoms	Pants
4	847	47	Flattering shirt	This shirt is very flattering to all due to th...	5	1	6	General	Tops	Blouses
	Clothing ID	Age	Title	Review Text	Rating	Recommended IND	Positive Feedback Count	Division Name	Department Name	Class Name
0	670	39	Beautiful in person!	This top is just beautiful in person! the colo...	5	1	0	General	Tops	Knits
1	186	49	Pretty; not flattering	I love the color of this shirt, and it's very ...	4	1	0	General	Tops	Blouses
2	382	46	Fabulous wool crazy sweater	This sweater is gorgeous. the fabric goes wit...	5	1	2	General	Jackets	Jackets
3	186	38	Pretty fall top	I really liked this, all things considered. i ...	4	1	5	General	Tops	Blouses
4	619	37	NaN	I ordered the pink color. it's a very pretty t...	5	1	1	General	Tops	Knits
...	...	...	...	...	...	...	...	...	...	...
995	228	56	White pajamas my favorite pair since when i wa...	Wish they had more colors. i ordered two pair ...	5	1	0	General Petite	Bottoms	Jackets
996	247	55	Great casual shirt for any occasion!	I love this shirt!! not only is it a great cas...	5	1	2	General	Tops	Knits
997	902	64	NaN	NaN	5	1	0	General Petite	Dresses	Dresses
998	6610	47	Amazing color and pattern	I love the fit of this skirt. i usually wear a...	5	1	0	General	Bottoms	Skirts
999	392	42	Not happy with the finished product	Great color and style but the cut is not exact...	4	1	0	General	Tops	Blouses