-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend data explorer for document-based data #568
Comments
Things to improve 1) Enable exploring datasets prior to loading This can be easily done if the dataset is stored in parquet format since Dask can interface with different backend storage with fsspec 2) Visualize overview of pipeline run with subsets and fields Visualize evolution of a chosen pipeline and schema 3) Ability to export the dataset to different format for user desired visualization ** Structured data: Parquet -> pandas, csv ** 4) Better view of documents
5) Comparing different pipeline runs Some runners already provide this functionalities so let's keep it simple, goal is to understand how different parameters might be affecting the data processing. Data points can be linked together by ID.
6) Better understanding of components
6) Other
|
Not convinced that this is within scope. Datasets prior to loading can have any format. In any case, I don't think it's high priority.
I think this makes sense as an overview page indeed. Maybe even for navigation. We should think about how it fits in the explorer. Note that we should still add this to the pipeline SDK for iterative development in the future as well.
Not completely sure what this means, but I would keep as much as possible inside the data explorer.
This is the most important one for now, and we should go further. We should be able to show formatted documents.
This could be a powerful one. We should think about:
Not sure if this fits in the data explorer. It also doesn't have access to this content at this moment. These are important as well, and based on this link it seems like it is well supported with dataframes in Streamlit. This would be useful to eg. check all chunks of a document by filtering on the We should really tackle this with the RAG validation use case in mind, so I would focus on the following ranked priority-wise:
|
Inspecting the data can be difficult since long strings are stretched horizontally in the explorer's table. The table view needs to be automatically adjusted based on the available data
Tasks
The text was updated successfully, but these errors were encountered: