Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add dataset example #38

Merged
merged 6 commits into from
Nov 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,5 @@ node_modules
dist
*__pycache__*
**/.ipynb_checkpoints/

# PyTest cache
**/.pytest_cache
examples/fixtures/*
**/.pytest_cache
201 changes: 201 additions & 0 deletions examples/dataset.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0364599c",
"metadata": {},
"source": [
"# Phoenix Dataset Object\n",
"\n",
"This small tutorial is to demonstrate how we can use the 🔥🐦 Phoenix `Dataset` object. \n",
"\n",
"This object currently is composed of a dataframe and a schema. Data can be consumed from:\n",
"* Pandas DataFrame directly\n",
"* From local files: csv & hdf5"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "28f8890a",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from phoenix.datasets import Dataset, Schema, EmbeddingColumnNames"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "02c7d1e2",
"metadata": {},
"outputs": [],
"source": [
"test_filename = \"NLP_sentiment_classification_language_drift\"\n",
"\n",
"df1 = pd.read_csv(f\"./fixtures/{test_filename}.csv\")\n",
"df1.head()"
]
},
{
"cell_type": "markdown",
"id": "cc50b386",
"metadata": {},
"source": [
"Define the schema same as you would in our SDK"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ae788250",
"metadata": {},
"outputs": [],
"source": [
"features = [\n",
" 'reviewer_age',\n",
" 'reviewer_gender',\n",
" 'product_category',\n",
" 'language',\n",
"]\n",
"\n",
"embedding_features = {\n",
" \"embedding_feature\": EmbeddingColumnNames(\n",
" vector_column_name=\"text_vector\", # Will be name of embedding feature in the app\n",
" data_column_name=\"text\",\n",
" ),\n",
"}\n",
"\n",
"# Define a Schema() object for Arize to pick up data from the correct columns for logging\n",
"schema = Schema(\n",
" prediction_id_column_name=\"prediction_id\",\n",
" timestamp_column_name=\"prediction_ts\",\n",
" prediction_label_column_name=\"pred_label\",\n",
" actual_label_column_name=\"label\",\n",
" feature_column_names=features,\n",
" embedding_feature_column_names=embedding_features\n",
")\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "d35bed3c",
"metadata": {},
"source": [
"You are ready to define a `Dataset`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af7450a5",
"metadata": {},
"outputs": [],
"source": [
"# Defined directly from dataframe\n",
"dataset1 = Dataset(df1,schema)\n",
"dataset2 = Dataset.from_dataframe(df1, schema)\n",
"# Defined from csv\n",
"dataset3 = Dataset.from_csv(f\"./fixtures/{test_filename}.csv\", schema=schema)\n",
"# Defined from hdf5\n",
"dataset4 = Dataset.from_hdf(f\"./fixtures/{test_filename}.hdf5\", schema=schema, key=\"training\")"
]
},
{
"cell_type": "markdown",
"id": "4ce754b1",
"metadata": {},
"source": [
"The following is an issue we need to investigate. We see that all datasets are equal. At first glance that seems ok. But, when loading a csv file, the embeddings are read as strings (issue to fix is filed). Hence the following condition should not be True"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ca50f81",
"metadata": {},
"outputs": [],
"source": [
"dataset1==dataset2==dataset3==dataset4"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7a9f0f11",
"metadata": {},
"outputs": [],
"source": [
"df2 = df1.copy()\n",
"df2.rename(\n",
" columns={\n",
" \"prediction_ts\":\"timestamp\",\n",
" \"label\":\"actual_label\"\n",
" },\n",
" inplace=True\n",
")\n",
"df2.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a5ec5454",
"metadata": {},
"outputs": [],
"source": [
"# Define a Schema() object for Arize to pick up data from the correct columns for logging\n",
"schema = Schema(\n",
" prediction_id_column_name=\"prediction_id\",\n",
" timestamp_column_name=\"timestamp\",\n",
" prediction_label_column_name=\"pred_label\",\n",
" actual_label_column_name=\"actual_label\",\n",
" feature_column_names=features,\n",
" embedding_feature_column_names=embedding_features\n",
")\n",
"dataset5 = Dataset(df1,schema)"
]
},
{
"cell_type": "markdown",
"id": "f74d293f",
"metadata": {},
"source": [
"This is another issue. In this case we have different dataframes with different schemas. However the Dataset objects are equal?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8128eda1",
"metadata": {},
"outputs": [],
"source": [
"dataset1==dataset5"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading