Consolidation of datachain examples with unstructured #353

tibor-mach · 2024-08-25T14:39:54Z

That one is a bit different as it summarises the text. But otherwise it is rather similar in what it does, so I guess we could simply add one more column to the example with embeddings where we have the text summary for the entire article.

We have this example with unstructured which shows text summarisation and then this example which chunkifies text and creates embeddings. Otherwise they are very similar.

I would merge the two, deleting the example from the datachain repo and adding article text summary to the output in this example in datachain-examples.

@mattseddon @dberenbaum (you seem to have worked on the summarisation example) do yo agree?

cc @shcheklein

The text was updated successfully, but these errors were encountered:

shcheklein · 2024-08-25T23:42:02Z

I would merge the two, deleting the example from the datachain repo

I think it's better to keep in the datachain repo if possible. It's not a jupyter notebook, we already have tests for this. Atm datachain-examples doesn't look stable tbh, doesn't have a good structure, linters, etc, etc, etc.

This example has become one of the basics one we show to users.

tibor-mach · 2024-08-26T12:33:08Z

Well, in that case I would just keep both of them as they are....or keep all examples in the datachain repo as before. I don't see a clear rule by which to keep examples in one repo or the other at the moment.

shcheklein · 2024-08-26T17:35:26Z

I don't see a clear rule by which to keep examples in one repo or the other at the moment.

I think we wanted to migrate notebooks initially? that was pretty much the rule

if it's a single script that can run (on a subset of data) sufficiently fast and represents a high-level use case / example - I think we can keep it in the main repo

Well, in that case I would just keep both of them as they are.

could you clarify this a bit? what would be the reason to have / maintain two of them?

tibor-mach · 2024-08-26T17:58:42Z

if it's a single script that can run (on a subset of data) sufficiently fast and represents a high-level use case / example - I think we can keep it in the main repo

Ok that makes sense. Then I take that back - it does make sense to only keep one script. But I would then take the one from datachain-examples and add its content to the one in datachain which does summarisation (so it will also include embeddings and all the stuff from the blogpost).

We will still have the issue of maintaining the notebooks there which often have similar code (with just more text around it). But this will at least reduce the amount of duplication, even if it does not eliminate it completely.

tibor-mach · 2024-08-27T10:33:26Z

I played around with the examples a bit and I am not very happy with any version which combines both the summarisation and chunking/embeddings in a single script. I thought they were more or less demonstrating the same thing, but I no longer think so.

The examples work on a different level of granularity (w.r.t. the document) and they use different datachain methods as well. The example from @dberenbaum works on the level of an individual file, uses .map to create a table where each row represents the file and a summary of its content.

The example with embeddings uses .gen to create a lot of rows, one row represents one chunk of a partitioned document. It doesn't make much sense to summarise chunks and while the whole document summary could be copy-pasted to each row generated from that document, I think that is unnecessary duplication and kind of goes against the idea that we do not copy anything extra in DataChain.

Alternatively, it could be kept as a single longer script with multiple steps but then it becomes harder to read than each of the two separate examples and I don't think it would reduce maintenance much anyway.

Both examples use unstructured and work with text, but otherwise they show different things. So I would just move the script from datachain-examples to datachain and then make the tests better so that it is more stable.

shcheklein · 2024-08-27T19:03:23Z

sounds good @tibor-mach !

shcheklein assigned tibor-mach and shcheklein Aug 25, 2024

tibor-mach mentioned this issue Aug 27, 2024

added embeddings/gen example #362

Merged

shcheklein added the documentation Improvements or additions to documentation label Aug 31, 2024

mattseddon closed this as completed Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidation of datachain examples with unstructured #353

Consolidation of datachain examples with unstructured #353

tibor-mach commented Aug 25, 2024

shcheklein commented Aug 25, 2024

tibor-mach commented Aug 26, 2024

shcheklein commented Aug 26, 2024

tibor-mach commented Aug 26, 2024

tibor-mach commented Aug 27, 2024 •

edited

Loading

shcheklein commented Aug 27, 2024

Consolidation of datachain examples with unstructured #353

Consolidation of datachain examples with unstructured #353

Comments

tibor-mach commented Aug 25, 2024

shcheklein commented Aug 25, 2024

tibor-mach commented Aug 26, 2024

shcheklein commented Aug 26, 2024

tibor-mach commented Aug 26, 2024

tibor-mach commented Aug 27, 2024 • edited Loading

shcheklein commented Aug 27, 2024

tibor-mach commented Aug 27, 2024 •

edited

Loading