diff --git a/nemo/NeMo-Safe-Synthesizer/advanced/extrinsic_evaluation.ipynb b/nemo/NeMo-Safe-Synthesizer/advanced/extrinsic_evaluation.ipynb
new file mode 100644
index 00000000..83da15e5
--- /dev/null
+++ b/nemo/NeMo-Safe-Synthesizer/advanced/extrinsic_evaluation.ipynb
@@ -0,0 +1,504 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "630e3e17",
+ "metadata": {},
+ "source": [
+ "# ๐๏ธ NeMo Safe Synthesizer 101: Extrinsic Evaluation\n",
+ "\n",
+ "> โ ๏ธ **Warning**: NeMo Safe Synthesizer is in Early Access and not recommended for production use.\n",
+ "\n",
+ "
\n",
+ "\n",
+ "In this notebook, we build off the foundational concepts from the *NeMo Safe Synthesizer 101: Data Generation* notebook. While the first notebook focused on *how* to generate synthetic data, this one focuses on **how to measure its quality and utility** for real-world applications.\n",
+ "\n",
+ "We'll do this using a common method called **extrinsic evaluation**, which involves testing the synthetic data's performance on a downstream machine learning task.\n",
+ "\n",
+ "---\n",
+ "\n",
+ "## ๐ฏ What is Extrinsic Evaluation?\n",
+ "\n",
+ "Extrinsic evaluation measures the **utility** of synthetic data by using it to train a model for a specific task. This contrasts with *intrinsic* evaluation, which might only measure the statistical similarity between the synthetic and real data.\n",
+ "\n",
+ "In this notebook, we'll use a **simple classification task** as our benchmark. The core idea is to answer the question:\n",
+ "\n",
+ "> \"Can a model trained **only** on our *synthetic data* achieve comparable performance to a model trained on the *real data*?\"\n",
+ "\n",
+ "If the answer is yes, it's a strong signal that our synthetic data has successfully captured the important patterns, relationships, and statistical properties of the original dataset. This is the \"Train-on-Synthetic, Test-on-Real\" approach."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8be84f5d",
+ "metadata": {},
+ "source": [
+ "#### ๐พ Install dependencies\n",
+ "\n",
+ "**IMPORTANT** ๐ Ensure you have a NeMo Microservices Platform deployment available. Follow the quickstart or Helm chart instructions in your environment's setup guide. You may need to restart your kernel after installing dependencies.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9f5d6f5a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "from nemo_microservices import NeMoMicroservices\n",
+ "from nemo_microservices.beta.safe_synthesizer.builder import SafeSynthesizerBuilder\n",
+ "\n",
+ "import logging\n",
+ "logging.basicConfig(level=logging.WARNING)\n",
+ "logging.getLogger(\"httpx\").setLevel(logging.WARNING)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "53bb2807",
+ "metadata": {},
+ "source": [
+ "### โ๏ธ Initialize the NeMo Safe Synthesizer Client\n",
+ "\n",
+ "- The Python SDK provides a wrapper around the NeMo Microservices Platform APIs.\n",
+ "- `http://localhost:8080` is the default url for the client's `base_url` in the quickstart.\n",
+ "- If using a managed or remote deployment, ensure correct base URLs and tokens.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8c15ab93",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "client = NeMoMicroservices(\n",
+ " base_url=\"http://localhost:8080\",\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "74d72ef7",
+ "metadata": {},
+ "source": [
+ "NeMo DataStore is launched as one of the services, and we'll use it to manage our storage. so we'll set the following:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ab037a3a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "datastore_config = {\n",
+ " \"endpoint\": \"http://localhost:3000/v1/hf\",\n",
+ " \"token\": \"placeholder\",\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2d66c819",
+ "metadata": {},
+ "source": [
+ "## ๐ฅ Load input data\n",
+ "\n",
+ "Safe Synthesizer learns the patterns and correlations in your input dataset to produce synthetic data with similar properties. For this tutorial, we will use a small public sample dataset. Replace it with your own data if desired.\n",
+ "\n",
+ "The sample dataset used here is a set of women's clothing reviews, including age, product category, rating, and review text. Some of the reviews contain Personally Identifiable Information (PII), such as height, weight, age, and location."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "daa955b6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# %uv pip install kagglehub, scikit-learn, tabulate"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7204f213",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import kagglehub\n",
+ "import pandas as pd\n",
+ "\n",
+ "# Download latest version\n",
+ "path = kagglehub.dataset_download(\"nicapotato/womens-ecommerce-clothing-reviews\")\n",
+ "raw_df = pd.read_csv(f\"{path}/Womens Clothing E-Commerce Reviews.csv\", index_col=0)\n",
+ "raw_df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c6c331b7",
+ "metadata": {},
+ "source": [
+ "We create a holdout dataset that will only be used for evaluating the end classifier"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "162876c3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "\n",
+ "df, test_df = train_test_split(raw_df, test_size=0.2, random_state=42)\n",
+ "\n",
+ "print(f\"Original df length: {len(raw_df)}\")\n",
+ "print(f\"Training df length: {len(df)}\")\n",
+ "print(f\"Testing df length: {len(test_df)}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "87d72c68",
+ "metadata": {},
+ "source": [
+ "## ๐๏ธ Create a Safe Synthesizer job\n",
+ "\n",
+ "The `SafeSynthesizerBuilder` provides a fluent interface to configure and submit jobs.\n",
+ "\n",
+ "The following code creates and submits a job:\n",
+ "- `SafeSynthesizerBuilder(client)`: initialize with the NeMo Microservices client.\n",
+ "- `.from_data_source(df)`: set the input data source.\n",
+ "- `.with_datastore(datastore_config)`: configure model artifact storage.\n",
+ "- `.with_replace_pii()`: enable automatic replacement of PII.\n",
+ "- `.synthesize()`: train and generate synthetic data.\n",
+ "- `.create_job()`: submit the job to the platform.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "85d9de56",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "job = (\n",
+ " SafeSynthesizerBuilder(client)\n",
+ " .from_data_source(df)\n",
+ " .with_datastore(datastore_config)\n",
+ " .with_replace_pii()\n",
+ " .synthesize()\n",
+ " .with_generate(num_records=15000)\n",
+ " .create_job()\n",
+ ")\n",
+ "\n",
+ "print(f\"job_id = {job.job_id}\")\n",
+ "job.wait_for_completion()\n",
+ "\n",
+ "print(f\"Job finished with status {job.fetch_status()}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fa2eacb2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# If your notebook shuts down, it's okay, your job is still running on the microservices platform.\n",
+ "# You can get the same job object and interact with it again by uncommenting the following code\n",
+ "# snippet, and modifying it with the job id from the previous cell output.\n",
+ "\n",
+ "# from nemo_microservices.beta.safe_synthesizer.sdk.job import SafeSynthesizerJob\n",
+ "# job = SafeSynthesizerJob(job_id=\"\", client=client)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "285d4a9d",
+ "metadata": {},
+ "source": [
+ "## ๐ View synthetic data\n",
+ "\n",
+ "After the job completes, fetch the generated synthetic dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7f25574a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Fetch the synthetic data created by the job\n",
+ "synthetic_df = job.fetch_data()\n",
+ "synthetic_df\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2b25f152",
+ "metadata": {},
+ "source": [
+ "## ๐ View evaluation report\n",
+ "\n",
+ "An evaluation comparing the synthetic data to the input data is performed automatically. You can:\n",
+ "\n",
+ "- **Inspect key scores**: overall synthetic data quality and privacy.\n",
+ "- **Download the full HTML report**: includes charts and detailed metrics.\n",
+ "- **Display the report inline**: useful when viewing in notebook environments.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7b691127",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Print selected information from the job summary\n",
+ "summary = job.fetch_summary()\n",
+ "print(\n",
+ " f\"Synthetic data quality score (0-10, higher is better): {summary.synthetic_data_quality_score}\"\n",
+ ")\n",
+ "print(f\"Data privacy score (0-10, higher is better): {summary.data_privacy_score}\")\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "39e62ea9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Download the full evaluation report to your local machine\n",
+ "job.save_report(\"evaluation_report.html\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "45f7e22b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Fetch and display the full evaluation report inline\n",
+ "# job.display_report_in_notebook()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dd1e4925-3620-4b31-bc17-16f74d10fbb5",
+ "metadata": {},
+ "source": [
+ "## ๐งช Extrinsic Evaluation \n",
+ "\n",
+ "This section details the **extrinsic evaluation** process, where the quality of the synthetic data is assessed based on how well a model trained on it performs on a real-world task. This comparison is critical for validating the synthetic data's utility.\n",
+ "\n",
+ "- **Train Benchmark Model**: A model is trained on a small, fixed subset of the **original data** to establish a performance baseline.\n",
+ "- **Train Synthetic Model**: A second model, using the same structure, is trained on the **entire synthetic dataset**.\n",
+ "- **Compare Performance**: Both models are evaluated against the same **fixed holdout test set** ($\\mathbf{X_{test}, y_{test}}$).\n",
+ "- **Inspect Key Metrics**: The comparison focuses on key metrics like **ROC AUC** and **F1-Score** to determine if the synthetic model performs comparably to the benchmark."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "37b6df30-6627-4a40-8604-e905ada571b7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# This script defines a scikit-learn pipeline for a classification task.\n",
+ "import numpy as np\n",
+ "from sklearn.model_selection import train_test_split \n",
+ "from sklearn.feature_extraction.text import TfidfVectorizer\n",
+ "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
+ "from sklearn.compose import ColumnTransformer\n",
+ "from sklearn.pipeline import Pipeline\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn.metrics import classification_report, accuracy_score, roc_auc_score\n",
+ "from sklearn.base import clone\n",
+ "\n",
+ "X_train = df.drop('Recommended IND', axis=1)\n",
+ "y_train = df['Recommended IND']\n",
+ "\n",
+ "X_train['Review Text'] = X_train['Review Text'].fillna('')\n",
+ "X_train['Title'] = X_train['Title'].fillna('')\n",
+ "\n",
+ "X_test = test_df.drop('Recommended IND', axis=1)\n",
+ "y_test = test_df['Recommended IND']\n",
+ "\n",
+ "X_test['Review Text'] = X_test['Review Text'].fillna('')\n",
+ "X_test['Title'] = X_test['Title'].fillna('')\n",
+ "\n",
+ "text_features = ['Review Text']\n",
+ "numerical_features = ['Age', 'Rating', 'Positive Feedback Count']\n",
+ "categorical_features = ['Division Name', 'Department Name', 'Class Name']\n",
+ "\n",
+ "text_transformer = TfidfVectorizer(stop_words='english', max_features=5000)\n",
+ "numerical_transformer = StandardScaler()\n",
+ "categorical_transformer = OneHotEncoder(handle_unknown='ignore') \n",
+ "\n",
+ "preprocessor = ColumnTransformer(\n",
+ " transformers=[\n",
+ " ('text', text_transformer, text_features[0]), \n",
+ " ('num', numerical_transformer, numerical_features),\n",
+ " ('cat', categorical_transformer, categorical_features)\n",
+ " ],\n",
+ " remainder='drop' \n",
+ ")\n",
+ "\n",
+ "model = LogisticRegression(solver='liblinear', random_state=42)\n",
+ "\n",
+ "full_pipeline = Pipeline(steps=[\n",
+ " ('preprocessor', preprocessor),\n",
+ " ('classifier', model)\n",
+ "])\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ee747c80-d42f-4ec5-b27b-2b2462436b92",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Train and evaluate a benchmark model pipeline, storing its performance metrics.\n",
+ "from sklearn.metrics import classification_report, accuracy_score, roc_auc_score\n",
+ "\n",
+ "original_pipeline = full_pipeline \n",
+ "print(\"\\n--- Training Benchmark Model on Original Data (1000 rows) ---\")\n",
+ "original_pipeline.fit(X_train, y_train)\n",
+ "\n",
+ "y_pred_original = original_pipeline.predict(X_test)\n",
+ "y_prob_original = original_pipeline.predict_proba(X_test)[:, 1]\n",
+ "\n",
+ "results = {}\n",
+ "results['Original'] = {\n",
+ " 'Accuracy': accuracy_score(y_test, y_pred_original),\n",
+ " 'ROC AUC': roc_auc_score(y_test, y_prob_original),\n",
+ " 'Classification Report': classification_report(y_test, y_pred_original, output_dict=True)\n",
+ "}\n",
+ "print(\"Benchmark training and evaluation complete.\")\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cf3f1d59-8c46-4d84-b813-a4adf88a3422",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Train a new model pipeline on synthetic data and evaluates it against the test set.\n",
+ "from sklearn.base import clone\n",
+ "from sklearn.metrics import classification_report, accuracy_score, roc_auc_score\n",
+ "\n",
+ "X_synthetic = synthetic_df.drop('Recommended IND', axis=1).fillna({'Review Text': '', 'Title': ''})\n",
+ "y_synthetic = synthetic_df['Recommended IND']\n",
+ "\n",
+ "synthetic_pipeline = clone(full_pipeline) \n",
+ "\n",
+ "print(\"\\n--- Training Model on Synthetic Data ---\")\n",
+ "synthetic_pipeline.fit(X_synthetic, y_synthetic)\n",
+ "\n",
+ "y_pred_synthetic = synthetic_pipeline.predict(X_test)\n",
+ "y_prob_synthetic = synthetic_pipeline.predict_proba(X_test)[:, 1]\n",
+ "\n",
+ "results['Synthetic'] = {\n",
+ " 'Accuracy': accuracy_score(y_test, y_pred_synthetic),\n",
+ " 'ROC AUC': roc_auc_score(y_test, y_prob_synthetic),\n",
+ " 'Classification Report': classification_report(y_test, y_pred_synthetic, output_dict=True)\n",
+ "}\n",
+ "print(\"Synthetic training and evaluation complete.\")\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d83e681e-aac2-44d0-83cb-1d93002a725d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Compare the performance of the original and synthetic models and prints a summary.\n",
+ "import pandas as pd\n",
+ "\n",
+ "print(\"\\n\" + \"=\"*60)\n",
+ "print(\" SIDE-BY-SIDE MODEL COMPARISON\")\n",
+ "print(f\" (Tested on {len(test_df)}-Row Holdout Set)\")\n",
+ "print(\"=\"*60)\n",
+ "\n",
+ "summary_data = {\n",
+ " 'Model': ['Original (Benchmark)', 'Synthetic'],\n",
+ " 'Train Size': [len(X_train), len(X_synthetic)],\n",
+ " 'Accuracy': [results['Original']['Accuracy'], results['Synthetic']['Accuracy']],\n",
+ " 'ROC AUC Score': [results['Original']['ROC AUC'], results['Synthetic']['ROC AUC']],\n",
+ " 'Precision (Class 1)': [results['Original']['Classification Report']['1']['precision'], results['Synthetic']['Classification Report']['1']['precision']],\n",
+ " 'Recall (Class 1)': [results['Original']['Classification Report']['1']['recall'], results['Synthetic']['Classification Report']['1']['recall']],\n",
+ "}\n",
+ "\n",
+ "summary_df = pd.DataFrame(summary_data).set_index('Model').T\n",
+ "summary_df.columns.name = 'Metric'\n",
+ "\n",
+ "print(summary_df.to_markdown(floatfmt=\".4f\"))\n",
+ "\n",
+ "print(\"\\n\" + \"=\"*60)\n",
+ "\n",
+ "print(\"Key Finding:\")\n",
+ "if results['Synthetic']['ROC AUC'] >= results['Original']['ROC AUC']:\n",
+ " print(\"The Synthetic Model performs AS WELL OR BETTER than the Original Benchmark.\")\n",
+ "else:\n",
+ " print(\"The Synthetic Model's performance is slightly lower than the Original Benchmark.\")\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "169f443d",
+ "metadata": {},
+ "source": [
+ "Your end result should look similar to this:\n",
+ "\n",
+ "| | Original (Benchmark) | Synthetic |\n",
+ "|:--------------------|-----------------------:|------------:|\n",
+ "| Train Size | 18,788 | 15,000 |\n",
+ "| Accuracy | 0.9404 | 0.9278 |\n",
+ "| ROC AUC Score | 0.9782 | 0.9762 |\n",
+ "| Precision (Class 1) | 0.9626 | 0.9423 |\n",
+ "| Recall (Class 1) | 0.9646 | 0.9714 |\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c9b55961-ddac-4d91-aa4d-9646fb72c7be",
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "My Virtual Env",
+ "language": "python",
+ "name": "myenv"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.14"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}