-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: export to parquet #424
Conversation
src/phoenix/core/model.py
Outdated
def export_events_as_parquet_file( | ||
self, | ||
rows: Mapping[DatasetType, Iterable[int]], | ||
parquet_file: BinaryIO, | ||
) -> None: | ||
""" | ||
Given row numbers, exports dataframe subset into parquet file. | ||
Duplicate rows are removed. | ||
|
||
Parameters | ||
---------- | ||
rows: Mapping[DatasetType, Iterable[int]] | ||
mapping of dataset type to list of row numbers | ||
parquet_file: file handle | ||
output parquet file handle | ||
""" | ||
pd.concat( | ||
dataset.export_events(rows.get(dataset_type, ())) | ||
for dataset_type, dataset in self.__datasets.items() | ||
if dataset is not None | ||
).to_parquet(parquet_file, index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be important to encode the dataset type into the parquet file itself so that if the user is exporting a cluster that has both, they can distinguish them. I posed a similar question on the main platform ticket and I think that makes sense.
Context: https://github.com/Arize-ai/arize/issues/19710
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the heads-up. will revisit in a future PR
src/phoenix/datasets/dataset.py
Outdated
self.__original_column_indices = [ | ||
dataframe.columns.get_loc(column_name) for column_name in original_column_names | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this information passed back to the server runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call out. currently this does nothing given that datasets are initialized (and validated) twice. will remove from this PR and revisit in the future
resolves #417
resolves #432
clean-ups:
os.path
withpathlib
Events.py