Arize-ai · fjcasti1 · Nov 21, 2022 · Nov 19, 2022 · Nov 19, 2022 · Nov 19, 2022
diff --git a/.gitignore b/.gitignore
@@ -4,6 +4,5 @@ node_modules
 dist
 *__pycache__*
 **/.ipynb_checkpoints/
-
-# PyTest cache
-**/.pytest_cache
+examples/fixtures/*
+**/.pytest_cache
diff --git a/examples/dataset.ipynb b/examples/dataset.ipynb
@@ -0,0 +1,201 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0364599c",
+   "metadata": {},
+   "source": [
+    "# Phoenix Dataset Object\n",
+    "\n",
+    "This small tutorial is to demonstrate how we can use the 🔥🐦 Phoenix `Dataset` object. \n",
+    "\n",
+    "This object currently is composed of a dataframe and a schema. Data can be consumed from:\n",
+    "* Pandas DataFrame directly\n",
+    "* From local files: csv & hdf5"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "28f8890a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "from phoenix.datasets import Dataset, Schema, EmbeddingColumnNames"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "02c7d1e2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_filename = \"NLP_sentiment_classification_language_drift\"\n",
+    "\n",
+    "df1 = pd.read_csv(f\"./fixtures/{test_filename}.csv\")\n",
+    "df1.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cc50b386",
+   "metadata": {},
+   "source": [
+    "Define the schema same as you would in our SDK"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ae788250",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "features = [\n",
+    "    'reviewer_age',\n",
+    "    'reviewer_gender',\n",
+    "    'product_category',\n",
+    "    'language',\n",
+    "]\n",
+    "\n",
+    "embedding_features = {\n",
+    "    \"embedding_feature\": EmbeddingColumnNames(\n",
+    "        vector_column_name=\"text_vector\",  # Will be name of embedding feature in the app\n",
+    "        data_column_name=\"text\",\n",
+    "    ),\n",
+    "}\n",
+    "\n",
+    "# Define a Schema() object for Arize to pick up data from the correct columns for logging\n",
+    "schema = Schema(\n",
+    "    prediction_id_column_name=\"prediction_id\",\n",
+    "    timestamp_column_name=\"prediction_ts\",\n",
+    "    prediction_label_column_name=\"pred_label\",\n",
+    "    actual_label_column_name=\"label\",\n",
+    "    feature_column_names=features,\n",
+    "    embedding_feature_column_names=embedding_features\n",
+    ")\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d35bed3c",
+   "metadata": {},
+   "source": [
+    "You are ready to define a `Dataset`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "af7450a5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Defined directly from dataframe\n",
+    "dataset1 = Dataset(df1,schema)\n",
+    "dataset2 = Dataset.from_dataframe(df1, schema)\n",
+    "# Defined from csv\n",
+    "dataset3 = Dataset.from_csv(f\"./fixtures/{test_filename}.csv\", schema=schema)\n",
+    "# Defined from hdf5\n",
+    "dataset4 = Dataset.from_hdf(f\"./fixtures/{test_filename}.hdf5\", schema=schema, key=\"training\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4ce754b1",
+   "metadata": {},
+   "source": [
+    "The following is an issue we need to investigate. We see that all datasets are equal. At first glance that seems ok. But, when loading a csv file, the embeddings are read as strings (issue to fix is filed). Hence the following condition should not be True"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8ca50f81",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset1==dataset2==dataset3==dataset4"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7a9f0f11",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df2 = df1.copy()\n",
+    "df2.rename(\n",
+    "    columns={\n",
+    "        \"prediction_ts\":\"timestamp\",\n",
+    "        \"label\":\"actual_label\"\n",
+    "    },\n",
+    "    inplace=True\n",
+    ")\n",
+    "df2.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a5ec5454",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define a Schema() object for Arize to pick up data from the correct columns for logging\n",
+    "schema = Schema(\n",
+    "    prediction_id_column_name=\"prediction_id\",\n",
+    "    timestamp_column_name=\"timestamp\",\n",
+    "    prediction_label_column_name=\"pred_label\",\n",
+    "    actual_label_column_name=\"actual_label\",\n",
+    "    feature_column_names=features,\n",
+    "    embedding_feature_column_names=embedding_features\n",
+    ")\n",
+    "dataset5 = Dataset(df1,schema)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f74d293f",
+   "metadata": {},
+   "source": [
+    "This is another issue. In this case we have different dataframes with different schemas. However the Dataset objects are equal?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8128eda1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset1==dataset5"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}