Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🗺️ Persistence #1689

Closed
mikeldking opened this issue Oct 31, 2023 · 7 comments
Closed

🗺️ Persistence #1689

mikeldking opened this issue Oct 31, 2023 · 7 comments
Assignees

Comments

@mikeldking
Copy link
Contributor

mikeldking commented Oct 31, 2023

As a user of phoenix, I would like a persistent backend - notably a way to

  • Resume phoenix on previous collection of data
  • Keep track of evaluation results

Spikes

Server

UI

Metrics / Observability

Infra

Remote Session management

Performance

Notebook-Side Persistence

Docs

Breaking Changes

Testing

Open Questions

  • Storage of embeddings
  • Controlling sqlite version explicitly
  • trace retention / management needed?
@github-project-automation github-project-automation bot moved this to 📘 Todo in phoenix Oct 31, 2023
@mikeldking mikeldking self-assigned this Oct 31, 2023
@mikeldking mikeldking changed the title 🗺️ data persistence 🗺️ Trace and Evals Persistence Oct 31, 2023
@aazizisoufiane
Copy link

I think at least for the second need you can see here: https://docs.arize.com/phoenix/integrations/llamaindex#traces

@axiomofjoy
Copy link
Contributor

axiomofjoy commented Dec 22, 2023

I am investigating using Parquet as the file format. Here's a snippet to add custom metadata to a Parquet file:

"""
Snippet to write custom metadata to a single parquet file.

NB: "Pyarrow maps the file-wide metadata to a field in the table's schema
named metadata. Regrettably there is not (yet) documentation on this."

From https://stackoverflow.com/questions/52122674/how-to-write-parquet-metadata-with-pyarrow
"""

import json

import pandas as pd
import pyarrow
from pyarrow import parquet

dataframe = pd.DataFrame(
    {
        "field0": [1, 2, 3],
        "eval0": ["a", "b", "c"],
    }
)
OPENINFERENCE_METADATA_KEY = b"openinference"
openinference_metadata = {
    "version": "v0",
    "evaluation_ids": ["eval0", "eval1"],
}

original_table = pyarrow.Table.from_pandas(dataframe)
print("Metadata:")
print("=========")
print(original_table.schema.metadata)
print()

updated_write_table = original_table.replace_schema_metadata(
    {
        OPENINFERENCE_METADATA_KEY: json.dumps(openinference_metadata),
        **original_table.schema.metadata,
    }
)
parquet.write_table(updated_write_table, "test.parquet")
updated_read_table = parquet.read_table("test.parquet")
print("Metadata:")
print("=========")
print(updated_read_table.schema.metadata)
print()

updated_metadata = updated_read_table.schema.metadata
updated_metadata.pop(OPENINFERENCE_METADATA_KEY)
assert updated_metadata == original_table.schema.metadata

@axiomofjoy
Copy link
Contributor

axiomofjoy commented Dec 22, 2023

Notes on Parquet and PyArrow:

  • Large "row groups" are suggested for quicker analytical queries (on the order of a GB). Many of our datasets will be far smaller than this. There are potentially performance consequences at query time for writing small Parquet files frequently.
  • Parquet files are immutable. As far as I can tell, there is no notion of updating just the file metadata.
  • It's possible to augment the metadata of individual Parquet files (see above). Another pattern used by Spark and Dask actually is to write a separate metadata file at _common_metadata to describe all the Parquet files in an Arrow dataset (a single metadata file describing multiple Parquet files).
  • Arrow supports directory partitioning. It looks straightforward to partition, for example, on date.
  • Arrow also provides nice file system interfaces to the various cloud storage providers.

@mikeldking mikeldking moved this to Todo in phoenix roadmap Dec 28, 2023
@mikeldking mikeldking changed the title 🗺️ Trace and Evals Persistence 🗺️ Remote Phoenix support and persistence Jan 26, 2024
@RogerHYang RogerHYang moved this from 📘 Todo to 👨‍💻 In progress in phoenix Mar 14, 2024
@stdweird
Copy link

@axiomofjoy what kind of backends will you target? i see some code related to file backends, but why not sql databases (given the .to_sql for dataframes, and probably other/better methods); or native json support in most solutions. given phoenix coupling with RAG, most people will already have a vectordb that should work.

@axiomofjoy
Copy link
Contributor

@mikeldking Can you provide an update for @stdweird?

@mikeldking
Copy link
Contributor Author

@axiomofjoy what kind of backends will you target? i see some code related to file backends, but why not sql databases (given the .to_sql for dataframes, and probably other/better methods); or native json support in most solutions. given phoenix coupling with RAG, most people will already have a vectordb that should work.

@stdweird - good point - I think we see some limitations with sql backends and so we are currently benchmarking different backends. In general we will probably have a storage interface and you will be able to choose your storage mechanism but for now we are working on keeping the backend pretty lean and figuring out the interface as we go

@mikeldking mikeldking changed the title 🗺️ Remote Phoenix support and persistence 🗺️ Persistence Mar 29, 2024
@mikeldking
Copy link
Contributor Author

🥳

@github-project-automation github-project-automation bot moved this from 👨‍💻 In progress to ✅ Done in phoenix May 13, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in phoenix roadmap May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Status: Done
Development

No branches or pull requests

5 participants