Extend data explorer for document-based data #568

PhilippeMoussalli · 2023-10-30T08:59:46Z

Inspecting the data can be difficult since long strings are stretched horizontally in the explorer's table. The table view needs to be automatically adjusted based on the available data

Tasks

Give feedback

PhilippeMoussalli · 2023-11-14T16:30:13Z

Things to improve

1) Enable exploring datasets prior to loading

This can be easily done if the dataset is stored in parquet format since Dask can interface with different backend storage with fsspec

2) Visualize overview of pipeline run with subsets and fields

Visualize evolution of a chosen pipeline and schema

3) Ability to export the dataset to different format for user desired visualization

** Structured data: Parquet -> pandas, csv **
** Unstructured data** -> Mainly images for now, integrate the existing script to export the dataset locally.

4) Better view of documents

Enable hovering over a field to display the full content

Ability to export or visualize in a separate tab (to be investigated)

5) Comparing different pipeline runs

Some runners already provide this functionalities so let's keep it simple, goal is to understand how different parameters might be affecting the data processing. Data points can be linked together by ID.

Arguments (link arguments to certain outcome)
Images side by side (outcome of different processing parameters/m on image processing e.g segmentation)
Captions side by side (outcome of different processing parameters on image processing e.g. captioning )

6) Better understanding of components

Display component spec or script for selected component

6) Other

Filter
Search
Toggle

RobbeSneyders · 2023-11-14T19:55:56Z

1) Enable exploring datasets prior to loading

Not convinced that this is within scope. Datasets prior to loading can have any format. In any case, I don't think it's high priority.

2) Visualize overview of pipeline run with subsets and fields

I think this makes sense as an overview page indeed. Maybe even for navigation. We should think about how it fits in the explorer.

Note that we should still add this to the pipeline SDK for iterative development in the future as well.

3) Ability to export the dataset to different format for user desired visualization

Not completely sure what this means, but I would keep as much as possible inside the data explorer.

4) Better view of documents

This is the most important one for now, and we should go further. We should be able to show formatted documents.

Properly show basic formatting like newlines etc.
Show html documents
Show pdf documents?

5) Comparing different pipeline runs

This could be a powerful one. We should think about:

How to show this, because there's a third dimension (row, column, run)
How to merge the two runs based on id

6) Better understanding of components

Not sure if this fits in the data explorer. It also doesn't have access to this content at this moment.

6) Filter, search, sort

These are important as well, and based on this link it seems like it is well supported with dataframes in Streamlit. This would be useful to eg. check all chunks of a document by filtering on the document_id and sorting on the chunk_id.

We should really tackle this with the RAG validation use case in mind, so I would focus on the following ranked priority-wise:

Better view of documents
Filter, search, sort
Comparing different pipeline runs

PhilippeMoussalli added this to Fondant development Oct 23, 2023

PhilippeMoussalli converted this from a draft issue Oct 30, 2023

RobbeSneyders mentioned this issue Oct 30, 2023

Make it easier to create custom components #563

Closed

RobbeSneyders changed the title ~~Enable better visualization of explorer data~~ Extend data explorer for document-based data Nov 14, 2023

RobbeSneyders assigned PhilippeMoussalli Nov 14, 2023

RobbeSneyders moved this from Breakdown to Ready for development in Fondant development Nov 22, 2023

PhilippeMoussalli moved this from Ready for development to In Progress in Fondant development Nov 22, 2023

RobbeSneyders closed this as completed Dec 18, 2023

github-project-automation bot moved this from In Progress to Done in Fondant development Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend data explorer for document-based data #568

Extend data explorer for document-based data #568

PhilippeMoussalli commented Oct 30, 2023 •

edited

Loading

Tasks

PhilippeMoussalli commented Nov 14, 2023 •

edited

Loading

RobbeSneyders commented Nov 14, 2023

Extend data explorer for document-based data #568

Extend data explorer for document-based data #568

Comments

PhilippeMoussalli commented Oct 30, 2023 • edited Loading

Tasks

PhilippeMoussalli commented Nov 14, 2023 • edited Loading

RobbeSneyders commented Nov 14, 2023

PhilippeMoussalli commented Oct 30, 2023 •

edited

Loading

PhilippeMoussalli commented Nov 14, 2023 •

edited

Loading