Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidation of datachain examples with unstructured #353

Closed
tibor-mach opened this issue Aug 25, 2024 · 6 comments
Closed

Consolidation of datachain examples with unstructured #353

tibor-mach opened this issue Aug 25, 2024 · 6 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@tibor-mach
Copy link
Contributor

That one is a bit different as it summarises the text. But otherwise it is rather similar in what it does, so I guess we could simply add one more column to the example with embeddings where we have the text summary for the entire article.

We have this example with unstructured which shows text summarisation and then this example which chunkifies text and creates embeddings. Otherwise they are very similar.

I would merge the two, deleting the example from the datachain repo and adding article text summary to the output in this example in datachain-examples.

@mattseddon @dberenbaum (you seem to have worked on the summarisation example) do yo agree?

cc @shcheklein

@shcheklein
Copy link
Member

I would merge the two, deleting the example from the datachain repo

I think it's better to keep in the datachain repo if possible. It's not a jupyter notebook, we already have tests for this. Atm datachain-examples doesn't look stable tbh, doesn't have a good structure, linters, etc, etc, etc.

This example has become one of the basics one we show to users.

@tibor-mach
Copy link
Contributor Author

Well, in that case I would just keep both of them as they are....or keep all examples in the datachain repo as before. I don't see a clear rule by which to keep examples in one repo or the other at the moment.

@shcheklein
Copy link
Member

I don't see a clear rule by which to keep examples in one repo or the other at the moment.

I think we wanted to migrate notebooks initially? that was pretty much the rule

if it's a single script that can run (on a subset of data) sufficiently fast and represents a high-level use case / example - I think we can keep it in the main repo

Well, in that case I would just keep both of them as they are.

could you clarify this a bit? what would be the reason to have / maintain two of them?

@tibor-mach
Copy link
Contributor Author

if it's a single script that can run (on a subset of data) sufficiently fast and represents a high-level use case / example - I think we can keep it in the main repo

Ok that makes sense. Then I take that back - it does make sense to only keep one script. But I would then take the one from datachain-examples and add its content to the one in datachain which does summarisation (so it will also include embeddings and all the stuff from the blogpost).

We will still have the issue of maintaining the notebooks there which often have similar code (with just more text around it). But this will at least reduce the amount of duplication, even if it does not eliminate it completely.

@tibor-mach
Copy link
Contributor Author

tibor-mach commented Aug 27, 2024

I played around with the examples a bit and I am not very happy with any version which combines both the summarisation and chunking/embeddings in a single script. I thought they were more or less demonstrating the same thing, but I no longer think so.

The examples work on a different level of granularity (w.r.t. the document) and they use different datachain methods as well. The example from @dberenbaum works on the level of an individual file, uses .map to create a table where each row represents the file and a summary of its content.

The example with embeddings uses .gen to create a lot of rows, one row represents one chunk of a partitioned document. It doesn't make much sense to summarise chunks and while the whole document summary could be copy-pasted to each row generated from that document, I think that is unnecessary duplication and kind of goes against the idea that we do not copy anything extra in DataChain.

Alternatively, it could be kept as a single longer script with multiple steps but then it becomes harder to read than each of the two separate examples and I don't think it would reduce maintenance much anyway.

Both examples use unstructured and work with text, but otherwise they show different things. So I would just move the script from datachain-examples to datachain and then make the tests better so that it is more stable.

@shcheklein
Copy link
Member

sounds good @tibor-mach !

@shcheklein shcheklein added the documentation Improvements or additions to documentation label Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants