Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend data explorer for document-based data #568

Closed
8 of 9 tasks
Tracked by #563
PhilippeMoussalli opened this issue Oct 30, 2023 · 2 comments
Closed
8 of 9 tasks
Tracked by #563

Extend data explorer for document-based data #568

PhilippeMoussalli opened this issue Oct 30, 2023 · 2 comments
Assignees

Comments

@PhilippeMoussalli
Copy link
Contributor

PhilippeMoussalli commented Oct 30, 2023

Inspecting the data can be difficult since long strings are stretched horizontally in the explorer's table. The table view needs to be automatically adjusted based on the available data

@PhilippeMoussalli
Copy link
Contributor Author

PhilippeMoussalli commented Nov 14, 2023

Things to improve

1) Enable exploring datasets prior to loading

This can be easily done if the dataset is stored in parquet format since Dask can interface with different backend storage with fsspec

2) Visualize overview of pipeline run with subsets and fields

Visualize evolution of a chosen pipeline and schema

Image

3) Ability to export the dataset to different format for user desired visualization

** Structured data: Parquet -> pandas, csv **
** Unstructured data** -> Mainly images for now, integrate the existing script to export the dataset locally.

4) Better view of documents

  • Enable hovering over a field to display the full content

Image

  • Ability to export or visualize in a separate tab (to be investigated)

5) Comparing different pipeline runs

Some runners already provide this functionalities so let's keep it simple, goal is to understand how different parameters might be affecting the data processing. Data points can be linked together by ID.

  • Arguments (link arguments to certain outcome)
  • Images side by side (outcome of different processing parameters/m on image processing e.g segmentation)
  • Captions side by side (outcome of different processing parameters on image processing e.g. captioning )

6) Better understanding of components

  • Display component spec or script for selected component

6) Other

  • Filter
  • Search
  • Toggle

@RobbeSneyders RobbeSneyders changed the title Enable better visualization of explorer data Extend data explorer for document-based data Nov 14, 2023
@RobbeSneyders
Copy link
Member

1) Enable exploring datasets prior to loading

Not convinced that this is within scope. Datasets prior to loading can have any format. In any case, I don't think it's high priority.

2) Visualize overview of pipeline run with subsets and fields

I think this makes sense as an overview page indeed. Maybe even for navigation. We should think about how it fits in the explorer.

Note that we should still add this to the pipeline SDK for iterative development in the future as well.

3) Ability to export the dataset to different format for user desired visualization

Not completely sure what this means, but I would keep as much as possible inside the data explorer.

4) Better view of documents

This is the most important one for now, and we should go further. We should be able to show formatted documents.

  • Properly show basic formatting like newlines etc.
  • Show html documents
  • Show pdf documents?

5) Comparing different pipeline runs

This could be a powerful one. We should think about:

  • How to show this, because there's a third dimension (row, column, run)
  • How to merge the two runs based on id

6) Better understanding of components

Not sure if this fits in the data explorer. It also doesn't have access to this content at this moment.

6) Filter, search, sort

These are important as well, and based on this link it seems like it is well supported with dataframes in Streamlit. This would be useful to eg. check all chunks of a document by filtering on the document_id and sorting on the chunk_id.


We should really tackle this with the RAG validation use case in mind, so I would focus on the following ranked priority-wise:

  1. Better view of documents
  2. Filter, search, sort
  3. Comparing different pipeline runs

@RobbeSneyders RobbeSneyders moved this from Breakdown to Ready for development in Fondant development Nov 22, 2023
@PhilippeMoussalli PhilippeMoussalli moved this from Ready for development to In Progress in Fondant development Nov 22, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Fondant development Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants