Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: export to parquet #424

Merged
merged 12 commits into from
Mar 28, 2023
Merged

feat: export to parquet #424

merged 12 commits into from
Mar 28, 2023

Conversation

RogerHYang
Copy link
Contributor

@RogerHYang RogerHYang commented Mar 23, 2023

resolves #417
resolves #432

clean-ups:

  1. replaced os.path with pathlib
  2. moved Event related functions to Events.py

app/schema.graphql Outdated Show resolved Hide resolved
src/phoenix/datasets/dataset.py Outdated Show resolved Hide resolved
src/phoenix/server/api/types/ExportResonse.py Outdated Show resolved Hide resolved
Comment on lines 115 to 135
def export_events_as_parquet_file(
self,
rows: Mapping[DatasetType, Iterable[int]],
parquet_file: BinaryIO,
) -> None:
"""
Given row numbers, exports dataframe subset into parquet file.
Duplicate rows are removed.

Parameters
----------
rows: Mapping[DatasetType, Iterable[int]]
mapping of dataset type to list of row numbers
parquet_file: file handle
output parquet file handle
"""
pd.concat(
dataset.export_events(rows.get(dataset_type, ()))
for dataset_type, dataset in self.__datasets.items()
if dataset is not None
).to_parquet(parquet_file, index=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be important to encode the dataset type into the parquet file itself so that if the user is exporting a cluster that has both, they can distinguish them. I posed a similar question on the main platform ticket and I think that makes sense.

Context: https://github.com/Arize-ai/arize/issues/19710

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the heads-up. will revisit in a future PR

src/phoenix/datasets/dataset.py Outdated Show resolved Hide resolved
Comment on lines 88 to 90
self.__original_column_indices = [
dataframe.columns.get_loc(column_name) for column_name in original_column_names
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this information passed back to the server runtime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call out. currently this does nothing given that datasets are initialized (and validated) twice. will remove from this PR and revisit in the future

@RogerHYang RogerHYang merged commit 1c4e4c8 into main Mar 28, 2023
@RogerHYang RogerHYang deleted the export-mvp branch March 28, 2023 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[gql] export cluster mutation file-system persistence of clusters
2 participants