-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidation of datachain examples with unstructured #353
Comments
I think it's better to keep in the datachain repo if possible. It's not a jupyter notebook, we already have tests for this. Atm datachain-examples doesn't look stable tbh, doesn't have a good structure, linters, etc, etc, etc. This example has become one of the basics one we show to users. |
Well, in that case I would just keep both of them as they are....or keep all examples in the datachain repo as before. I don't see a clear rule by which to keep examples in one repo or the other at the moment. |
I think we wanted to migrate notebooks initially? that was pretty much the rule if it's a single script that can run (on a subset of data) sufficiently fast and represents a high-level use case / example - I think we can keep it in the main repo
could you clarify this a bit? what would be the reason to have / maintain two of them? |
Ok that makes sense. Then I take that back - it does make sense to only keep one script. But I would then take the one from We will still have the issue of maintaining the notebooks there which often have similar code (with just more text around it). But this will at least reduce the amount of duplication, even if it does not eliminate it completely. |
I played around with the examples a bit and I am not very happy with any version which combines both the summarisation and chunking/embeddings in a single script. I thought they were more or less demonstrating the same thing, but I no longer think so. The examples work on a different level of granularity (w.r.t. the document) and they use different The example with embeddings uses Alternatively, it could be kept as a single longer script with multiple steps but then it becomes harder to read than each of the two separate examples and I don't think it would reduce maintenance much anyway. Both examples use |
sounds good @tibor-mach ! |
That one is a bit different as it summarises the text. But otherwise it is rather similar in what it does, so I guess we could simply add one more column to the example with embeddings where we have the text summary for the entire article.
We have this example with
unstructured
which shows text summarisation and then this example which chunkifies text and creates embeddings. Otherwise they are very similar.I would merge the two, deleting the example from the datachain repo and adding article text summary to the output in this example in datachain-examples.
@mattseddon @dberenbaum (you seem to have worked on the summarisation example) do yo agree?
cc @shcheklein
The text was updated successfully, but these errors were encountered: