Merge pull request #389 from Arize-ai/docs

docs: sync 3-16-2023
Arize-ai · Mar 16, 2023 · cec4073 · cec4073
2 parents 6ff80d5 + 15df6bd
commit cec4073
Show file tree

Hide file tree

Showing 12 changed files with 290 additions and 80 deletions.
diff --git a/docs/.gitbook/assets/Screenshot 2023-03-15 at 4.31.32 PM.png b/docs/.gitbook/assets/Screenshot 2023-03-15 at 4.31.32 PM.png
diff --git a/docs/.gitbook/assets/Screenshot 2023-03-15 at 4.33.25 PM.png b/docs/.gitbook/assets/Screenshot 2023-03-15 at 4.33.25 PM.png
diff --git a/docs/.gitbook/assets/image.png b/docs/.gitbook/assets/image.png
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -12,16 +12,15 @@
 
 ## Tutorials
 
+* [Generative LLM](tutorials/generative-llm.md)
 * [NLP (Sentiment Classification)](tutorials/nlp-sentiment-classification.md)
 * [CV (Image Classification)](tutorials/cv-image-classification.md)
-* [Generative LLM](tutorials/generative-llm.md)
 * [Tabular (Credit Card Fraud)](tutorials/tabular-credit-card-fraud.md)
 
 ## How-To
 
 * [Install and Import Phoenix](how-to/install-and-import-phoenix.md)
 * [Define Your Schema](how-to/define-your-schema.md)
-* [Define Your Dataset(s)](how-to/define-your-dataset-s.md)
 * [Manage the App](how-to/manage-the-app.md)
 * [Use the App](how-to/use-the-app.md)
 

diff --git a/docs/concepts/open-inference.md b/docs/concepts/open-inference.md
@@ -1,2 +1,158 @@
 # Open Inference
 
+## Overview
+
+The open inference data format is designed to provide an open interoperable data format for model inference files. Our goal is for modern ML systems, such as model servers and ML Observability platforms, to interface with each other using a common data format.\
+
+
+The goal of this is to define a specification for production inference logs that can be used on top of many file formats including Parquet, Avro, CSV and JSON. It will also support future formats such as Lance.
+
+<figure><img src="https://lh5.googleusercontent.com/tI-YqWsq9RbkrlhfDzbyVHMj45wPvtvyoEiwfkL-s6UM0eR2Xb2AN6YqR4jWaRrFZdBlgrsw_cJsMWbQBlfFMNELWSmYcM2b0B2WFYNVGhH_iS9I1R7QNzd_7XIpTUrcl8DP0lKIi1fEl--Ad81FZb4" alt=""><figcaption><p><strong>Inference Table in Inference Store</strong></p></figcaption></figure>
+
+An inference store is a common approach to store model inferences, normally stored in a data lake or data warehouse.&#x20;
+
+### Model Types Covered
+
+**NLP**
+
+* Text Generative - Prompt and Response
+* Text Classification
+*   NER Span Categorization
+
+    ****
+
+**Tabular:**
+
+* Regression
+* Classification&#x20;
+* Classification + Score
+* Multi-Classification
+* Ranking&#x20;
+* Multi-Output/Label
+* Time Series Forecasting
+
+**CV**
+
+* Classification
+* Bounding Box
+*   Segmentation
+
+
+
+### Inferences Overview&#x20;
+
+In an inference store the prediction ID is a unique identifier for a model prediction event. The prediction ID defines the inputs to the model, model outputs, latently linked ground truth (actuals), meta data (tags) and model internals (embeddings and/or SHAP).&#x20;
+
+In this section we will review a flat (non nested structure) prediction event, the following sections will cover how to handle nested structures.
+
+<figure><img src="https://lh6.googleusercontent.com/tMg0bwe276Z7d-SGuieLBP5J4NPqyZCPSsc1vfSl_KCG_soLufPJ98syu4ijsFoZfa2BMejjtarmVS3G9UbJvwHhvHzBsJL_xGNV6HrOi3ai4XbKQiaP89MrroPiuKeAVjM-9soaJb4_mbDkuhVfMwI" alt=""><figcaption><p>Prediction Inference Event Data</p></figcaption></figure>
+
+
+
+<figure><img src="https://lh6.googleusercontent.com/F1ll2Q1FfyaNmgQbTlWHIr-BrhRO9dvEbPhVC8Sqeq9bxcrh6ZCw-NfdTtsbdeis2wm_opRxblKdbNA_MYkzkian6qWBRme-JM1JbgUuBFPcSCrZHkGnfQrhdzioG7JlgIsY8TV0uERbqDjHPsMLl7w" alt=""><figcaption><p>LLM Inference Data</p></figcaption></figure>
+
+A prediction event can represent a prompt response pair for LLMs where the conversation ID maintains the thread of conversation.&#x20;
+
+\
+
+
+<figure><img src="https://lh4.googleusercontent.com/PudqLcuRTJnYhXXS0hZj7XCA14TDWv-3WkUKyoSa0drmFZcG0KZo0WPQ8ieavNz-RuHUzvva-Ei1Z4D2kL0ROpkYkf5TlkNZr-sZBo1n5YmcBtEd2qGcxkDhW6dHahwirRaeOuatZzlaN6aa7Ym2jDI" alt=""><figcaption><p>Core Model Inference Data</p></figcaption></figure>
+
+The core components of an inference event are the:
+
+* Model input (features/prompt)
+* Model output (prediction/response)
+* Ground truth (actuals or latent actuals)
+* Model ID
+* Model Version
+* Environment&#x20;
+* Conversation ID
+
+Additional data that may be contained include:
+
+* Metadata&#x20;
+* SHAP values&#x20;
+* Embeddings&#x20;
+* Raw links to data&#x20;
+* Bounding boxes
+
+The fundamental storage unit in an inference store is an inference event. These events are stored in groups that are logically separated by model ID, model version and environment.
+
+<figure><img src="https://lh5.googleusercontent.com/V_xkGRjd6sa54rbJjtIrtp8pj-T89ZR-ev2TS4Ri0Mbz80V95sqORa482oCohD-fVtzI2qftoer75BBgPyLPDLaP9n4d6458Ahzo55sRfDJv8VwpqrcflYiVjyKQ-8d9Ja6lV91-fSkuuCEwnBy0-Bs" alt=""><figcaption><p>Model Data and Version</p></figcaption></figure>
+
+Environment describes where the model is running for example we use environments of training, validation/test and production to describe different places you run a model.&#x20;
+
+\
+The production environment is commonly a streaming-like environment. It is streaming in the sense that a production dataset has no beginning or end. The data can be added to it continuously. In most production use cases data is added in small mini batches or real time event-by-event.
+
+The training and validation environments are commonly used to send data in batches. These batches define a group of data for analysis purposes. It’s common in validation/test and training to have the timestamp be optional. &#x20;
+
+**Note**: historical backtesting data comparisons on time series data can require non-runtime settings for timestamp use for training and validation
+
+The model ID is a unique human readable identifier for a model within a workspace - it completely separates the model data between logical instances.&#x20;
+
+The model version is a logical separator for metrics and analysis used to look at different builds of a model. A model version can capture common changes such as weight updates and feature additions.
+
+### Ground Truth&#x20;
+
+Unlike Infra observability, the inference store needs some mutability. There needs to be some way in which ground truth is added or updated for a prediction event.&#x20;
+
+Ground truth is required in the data in order to analyze performance metrics such as precision, recall, AUC, LogLoss, and Accuracy.
+
+Latent ground truth data may need to be “joined” to a prediction ID to enable performance visualization. In Phoenix, the library requires ground truth to be pre-joined to prediction data. In an ML Observability system such as Arize the joining of ground truth is typically done by the system itself.
+
+<figure><img src="../.gitbook/assets/Screenshot 2023-03-15 at 4.31.32 PM.png" alt=""><figcaption><p>Latent Ground Truth</p></figcaption></figure>
+
+The above image shows a common use case in ML Observability in which latent ground truth is received by a system and linked back to the original prediction based on a prediction ID.
+
+<figure><img src="../.gitbook/assets/Screenshot 2023-03-15 at 4.33.25 PM.png" alt=""><figcaption><p>Latent MetaData (Tags)</p></figcaption></figure>
+
+In addition to ground truth, latent metadata is also required to be linked to a prediction ID. Latent metadata can be critical to analyze model results using additional data tags linked to the original prediction ID.
+
+Examples of Metadata (Tags):
+
+* Loan default amount&#x20;
+* Loan status&#x20;
+* Revenue from conversion or click
+* Server region&#x20;
+
+### Nested Predictions (Flattening Hierarchy)
+
+Images bounding box, NLP NER, and Image segmentation
+
+The above picture shows how a nested set of detections can occur for a single image in the prediction body with bounding boxes within the image itself.&#x20;
+
+A model may have multiple inputs with different embeddings and images for each generating a prediction class. An example might be an insurance claim event with multiple images and a single prediction estimate for the claim.
+
+The above prediction shows hierarchical data. The current version of Phoenix is designed to ingest a flat structure so teams will need to flatten the above hierarchy. An example of flattening is below.
+
+<figure><img src="https://lh4.googleusercontent.com/eBFKL2_rOcikfJC47gzwCz6XZ0ktNDU9ILc5A6Owj7_5hLH9ep_yDOkFF-rNMePozy04ERj6RkB-RpclWf3BZLv_ZQ0SGPk4jpBIGW-BzoX8Yc6n2tR6C1bHiZqBeA_hCbV8AbCUR4-rbL8-v2YSCJY" alt=""><figcaption><p>Hierarchical Data Flattened</p></figcaption></figure>
+
+The example above shows an exploded representation of the hierarchical data. \<todo fix, once team reviews approach internally>
+
+### Examples: Supported Schemas&#x20;
+
+#### NLP - LLM Generative/Summarization/Translation
+
+#### NLP - Classification   &#x20;
+
+#### Regression
+
+#### Classification
+
+<figure><img src="../.gitbook/assets/image.png" alt=""><figcaption></figcaption></figure>
+
+#### &#x20;Classification + Score
+
+#### Ranking
+
+
+
+#### CV - Classification&#x20;
+
+
+
+####
+
+
+
diff --git a/docs/concepts/phoenix-basics.md b/docs/concepts/phoenix-basics.md
@@ -7,8 +7,8 @@ description: Learn the foundational concepts of the Phoenix API
 This section introduces _datasets_ and _schemas,_ the starting concepts needed to use Phoenix.
 
 {% hint style="info" %}
-* For comprehensive descriptions of `phoenix.Dataset` and `phoenix.Schema`, see the [API reference](../reference/api/).                                             &#x20;
-* For code examples, see the [dataset](../how-to/define-your-dataset-s.md) and [schema](../how-to/define-your-schema.md) how-to guides.
+* For comprehensive descriptions of `phoenix.Dataset` and `phoenix.Schema`, see the [API reference](../reference/api/).
+* For code examples, see the [dataset](broken-reference) and [schema](../how-to/define-your-schema.md) how-to guides.
 {% endhint %}
 
 ## Datasets
@@ -49,61 +49,13 @@ A _Phoenix schema_ is an instance of `phoenix.Schema` that maps the columns of y
 
 For example, if you have a DataFrame containing Fisher's Iris data that looks like this:
 
-<table border="1" class="dataframe">
-  <thead>
-    <tr style="text-align: right;">
-      <th>sepal_length</th>
-      <th>sepal_width</th>
-      <th>petal_length</th>
-      <th>petal_width</th>
-      <th>target</th>
-      <th>prediction</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td>7.7</td>
-      <td>3.0</td>
-      <td>6.1</td>
-      <td>2.3</td>
-      <td>virginica</td>
-      <td>versicolor</td>
-    </tr>
-    <tr>
-      <td>5.4</td>
-      <td>3.9</td>
-      <td>1.7</td>
-      <td>0.4</td>
-      <td>setosa</td>
-      <td>setosa</td>
-    </tr>
-    <tr>
-      <td>6.3</td>
-      <td>3.3</td>
-      <td>4.7</td>
-      <td>1.6</td>
-      <td>versicolor</td>
-      <td>versicolor</td>
-    </tr>
-    <tr>
-      <td>6.2</td>
-      <td>3.4</td>
-      <td>5.4</td>
-      <td>2.3</td>
-      <td>virginica</td>
-      <td>setosa</td>
-    </tr>
-    <tr>
-      <td>5.8</td>
-      <td>2.7</td>
-      <td>5.1</td>
-      <td>1.9</td>
-      <td>virginica</td>
-      <td>virginica</td>
-    </tr>
-  </tbody>
-</table>
-
+| sepal\_length | sepal\_width | petal\_length | petal\_width | target     | prediction |
+| ------------- | ------------ | ------------- | ------------ | ---------- | ---------- |
+| 7.7           | 3.0          | 6.1           | 2.3          | virginica  | versicolor |
+| 5.4           | 3.9          | 1.7           | 0.4          | setosa     | setosa     |
+| 6.3           | 3.3          | 4.7           | 1.6          | versicolor | versicolor |
+| 6.2           | 3.4          | 5.4           | 2.3          | virginica  | setosa     |
+| 5.8           | 2.7          | 5.1           | 1.9          | virginica  | virginica  |
 
 your schema might look like this:
 

diff --git a/docs/how-to/define-your-dataset-s.md b/docs/how-to/define-your-dataset-s.md
diff --git a/docs/how-to/define-your-schema.md b/docs/how-to/define-your-schema.md
@@ -1,14 +1,20 @@
 ---
-description: Learn how to create your model schema for common data formats
+description: How to create Phoenix schemas for common data formats
 ---
 
 # Define Your Schema
 
-This section shows you how to define your model schema with concrete examples.
+Given a Pandas DataFrame `df` and a `schema` object describing the format of that DataFrame, you can define a dataset named "data" with
+
+```python
+ds = px.Dataset(df, schema, "data")
+```
+
+This guide shows you how to match your schema to your DataFrame with concrete examples.
 
 {% hint style="info" %}
-* For a conceptual overview of the Phoenix API, including a high-level introduction to the notion of a schema, see [Phoenix Basics](../concepts/phoenix-basics.md#schemas).
-* For a comprehensive description of `phoenix.Schema`, including detailed descriptions of each field, see the [API reference](../reference/api/phoenix.schema).
+* For a conceptual overview of the Phoenix API, including a high-level introduction to the notion of datasets and schemas, see [Phoenix Basics](../concepts/phoenix-basics.md#schemas).
+* For a comprehensive description of `phoenix.Dataset` and `phoenix.Schema`, see the [API reference](../reference/api/).
 {% endhint %}
 
 ## Predictions and Ground Truth
@@ -181,7 +187,7 @@ The features in this example are [implicitly inferred](define-your-schema.md#imp
 {% endhint %}
 
 {% hint style="warning" %}
-Ensure that all embedding vectors for a particular embedding feature are one-dimensional arrays of the same length, or else, Phoenix will throw an error.
+Ensure that all embedding vectors for a particular embedding feature are one-dimensional arrays of the same length, otherwise, Phoenix will throw an error.
 {% endhint %}
 
 ### Embeddings of Images
@@ -248,8 +254,36 @@ schema = px.Schema(
 
 ### Multiple Embedding Features
 
-
+Sometimes it is useful to have more than one embedding feature. The example below shows a multi-modal application in which one embedding represents the textual description and another embedding represents the image associated with products on an e-commerce site.
 
 #### DataFrame
 
+| name                             | description                                                                                                                                        | description\_vector         | image                                       | image\_vector                     |
+| -------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------- | ------------------------------------------- | --------------------------------- |
+| Magic Lamp                       | Enjoy the most comfortable setting every time for working, studying, relaxing or getting ready to sleep.                                           | \[2.47, -0.01, -0.22, 0.93] | /path/to/your/first/image0.jpeg             | \[2.42, 1.95, 0.81, 2.60, 0.27]   |
+| Ergo Desk Chair                  | The perfect mesh chair, meticulously developed to deliver maximum comfort and high quality.                                                        | \[-0.25, 0.07, 2.90, 1.57]  | /path/to/your/second/image1.jpeg            | \[3.17, 2.75, 1.39, 0.44, 3.30]   |
+| Cloud Nine Mattress              | Our Cloud Nine Mattress combines cool comfort with maximum affordability.                                                                          | \[1.36, -0.88, -0.45, 0.84] | https://\<your-domain-here>.com/image2.jpeg | \[-0.22, 0.87, 1.10, -0.78, 1.25] |
+| Dr. Fresh's Spearmint Toothpaste | Natural toothpaste helps remove surface stains for a brighter, whiter smile with anti-plaque formula                                               | \[-0.39, 1.29, 0.92, 2.51]  | https://\<your-domain-here>.com/image3.jpeg | \[1.95, 2.66, 3.97, 0.90, 2.86]   |
+| Ultra-Fuzzy Bath Mat             | The bath mats are made up of 1.18-inch height premium thick, soft and fluffy microfiber, making it great for bathroom, vanity, and master bedroom. | \[0.37, 3.22, 1.29, 0.65]   | https://\<your-domain-here>.com/image4.jpeg | \[0.77, 1.79, 0.52, 3.79, 0.47]   |
+
 #### Schema
+
+```python
+schema = px.Schema(
+    tag_column_names=["name"],
+    embedding_feature_column_names={
+        "description_embedding": px.EmbeddingColumnNames(
+            vector_column_name="description_vector",
+            raw_data_column_name="description",
+        ),
+        "image_embedding": px.EmbeddingColumnNames(
+            vector_column_name="image_vector",
+            link_to_data_column_name="image",
+        ),
+    },
+)
+```
+
+{% hint style="info" %}
+Distinct embedding features may have embedding vectors of differing length. The text embeddings in the above example have length 4 while the image embeddings have length 5.
+{% endhint %}
diff --git a/docs/how-to/install-and-import-phoenix.md b/docs/how-to/install-and-import-phoenix.md
@@ -1,17 +1,21 @@
+---
+description: How to fly with Phoenix
+---
+
 # Install and Import Phoenix
 
-Install Phoenix in your Jupyter or Colab environment with
+In your Jupyter or Colab environment, run
 
 ```
 pip install arize-phoenix
 ```
 
-{% hint style="info" %}
-Phoenix is supported on Python ≥3.8, <3.11.
-{% endhint %}
-
-Once installed, import Phoenix with
+to install Phoenix and its dependencies. Once installed, import Phoenix in your notebook with
 
-```
+```python
 import phoenix as px
 ```
+
+{% hint style="info" %}
+Phoenix is supported on Python ≥3.8, <3.11.
+{% endhint %}