🗺️ Persistence #1689

mikeldking · 2023-10-31T21:37:14Z

As a user of phoenix, I would like a persistent backend - notably a way to

Resume phoenix on previous collection of data
Keep track of evaluation results

Spikes

Server

UI

Metrics / Observability

Infra

Remote Session management

Performance

Notebook-Side Persistence

[persistence] launch_app should check for the existence of a persistence directory #2814

Docs

Breaking Changes

Testing

Open Questions

Storage of embeddings
Controlling sqlite version explicitly
trace retention / management needed?

The text was updated successfully, but these errors were encountered:

aazizisoufiane · 2023-11-15T13:20:29Z

I think at least for the second need you can see here: https://docs.arize.com/phoenix/integrations/llamaindex#traces

axiomofjoy · 2023-12-22T04:09:35Z

I am investigating using Parquet as the file format. Here's a snippet to add custom metadata to a Parquet file:

"""
Snippet to write custom metadata to a single parquet file.

NB: "Pyarrow maps the file-wide metadata to a field in the table's schema
named metadata. Regrettably there is not (yet) documentation on this."

From https://stackoverflow.com/questions/52122674/how-to-write-parquet-metadata-with-pyarrow
"""

import json

import pandas as pd
import pyarrow
from pyarrow import parquet

dataframe = pd.DataFrame(
    {
        "field0": [1, 2, 3],
        "eval0": ["a", "b", "c"],
    }
)
OPENINFERENCE_METADATA_KEY = b"openinference"
openinference_metadata = {
    "version": "v0",
    "evaluation_ids": ["eval0", "eval1"],
}

original_table = pyarrow.Table.from_pandas(dataframe)
print("Metadata:")
print("=========")
print(original_table.schema.metadata)
print()

updated_write_table = original_table.replace_schema_metadata(
    {
        OPENINFERENCE_METADATA_KEY: json.dumps(openinference_metadata),
        **original_table.schema.metadata,
    }
)
parquet.write_table(updated_write_table, "test.parquet")
updated_read_table = parquet.read_table("test.parquet")
print("Metadata:")
print("=========")
print(updated_read_table.schema.metadata)
print()

updated_metadata = updated_read_table.schema.metadata
updated_metadata.pop(OPENINFERENCE_METADATA_KEY)
assert updated_metadata == original_table.schema.metadata

axiomofjoy · 2023-12-22T04:45:07Z

Notes on Parquet and PyArrow:

Large "row groups" are suggested for quicker analytical queries (on the order of a GB). Many of our datasets will be far smaller than this. There are potentially performance consequences at query time for writing small Parquet files frequently.
Parquet files are immutable. As far as I can tell, there is no notion of updating just the file metadata.
It's possible to augment the metadata of individual Parquet files (see above). Another pattern used by Spark and Dask actually is to write a separate metadata file at _common_metadata to describe all the Parquet files in an Arrow dataset (a single metadata file describing multiple Parquet files).
Arrow supports directory partitioning. It looks straightforward to partition, for example, on date.
Arrow also provides nice file system interfaces to the various cloud storage providers.

stdweird · 2024-03-14T15:19:28Z

@axiomofjoy what kind of backends will you target? i see some code related to file backends, but why not sql databases (given the .to_sql for dataframes, and probably other/better methods); or native json support in most solutions. given phoenix coupling with RAG, most people will already have a vectordb that should work.

axiomofjoy · 2024-03-21T16:30:20Z

@mikeldking Can you provide an update for @stdweird?

mikeldking · 2024-03-21T16:36:10Z

@axiomofjoy what kind of backends will you target? i see some code related to file backends, but why not sql databases (given the .to_sql for dataframes, and probably other/better methods); or native json support in most solutions. given phoenix coupling with RAG, most people will already have a vectordb that should work.

@stdweird - good point - I think we see some limitations with sql backends and so we are currently benchmarking different backends. In general we will probably have a storage interface and you will be able to choose your storage mechanism but for now we are working on keeping the backend pretty lean and figuring out the interface as we go

mikeldking · 2024-05-13T15:58:47Z

🥳

mikeldking added this to phoenix roadmap Oct 31, 2023

github-project-automation bot added this to phoenix Oct 31, 2023

github-project-automation bot moved this to 📘 Todo in phoenix Oct 31, 2023

mikeldking self-assigned this Oct 31, 2023

mikeldking changed the title ~~🗺️ data persistence~~ 🗺️ Trace and Evals Persistence Oct 31, 2023

axiomofjoy mentioned this issue Dec 19, 2023

[ENHANCEMENT] phoenix as standalone server like skywalking, application send data to phoenix server, phoenix server persistent the data to database #1976

Closed

mikeldking moved this to Todo in phoenix roadmap Dec 28, 2023

mikeldking changed the title ~~🗺️ Trace and Evals Persistence~~ 🗺️ Remote Phoenix support and persistence Jan 26, 2024

RogerHYang moved this from 📘 Todo to 👨‍💻 In progress in phoenix Mar 14, 2024

mikeldking changed the title ~~🗺️ Remote Phoenix support and persistence~~ 🗺️ Persistence Mar 29, 2024

mikeldking mentioned this issue Apr 1, 2024

feat(persistence): sql persistence #2737

Merged

mikeldking assigned RogerHYang Apr 3, 2024

mikeldking closed this as completed May 13, 2024

github-project-automation bot moved this from 👨‍💻 In progress to ✅ Done in phoenix May 13, 2024

github-project-automation bot moved this from Todo to Done in phoenix roadmap May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🗺️ Persistence #1689

🗺️ Persistence #1689

mikeldking commented Oct 31, 2023 •

edited

Loading

aazizisoufiane commented Nov 15, 2023

axiomofjoy commented Dec 22, 2023 •

edited

Loading

axiomofjoy commented Dec 22, 2023 •

edited

Loading

stdweird commented Mar 14, 2024

axiomofjoy commented Mar 21, 2024

mikeldking commented Mar 21, 2024

mikeldking commented May 13, 2024

🗺️ Persistence #1689

🗺️ Persistence #1689

Comments

mikeldking commented Oct 31, 2023 • edited Loading

Spikes

Server

UI

Metrics / Observability

Infra

Remote Session management

Performance

Notebook-Side Persistence

Docs

Breaking Changes

Testing

Open Questions

aazizisoufiane commented Nov 15, 2023

axiomofjoy commented Dec 22, 2023 • edited Loading

axiomofjoy commented Dec 22, 2023 • edited Loading

stdweird commented Mar 14, 2024

axiomofjoy commented Mar 21, 2024

mikeldking commented Mar 21, 2024

mikeldking commented May 13, 2024

mikeldking commented Oct 31, 2023 •

edited

Loading

axiomofjoy commented Dec 22, 2023 •

edited

Loading

axiomofjoy commented Dec 22, 2023 •

edited

Loading