Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf processing blogpost #2

Merged
merged 2 commits into from
Aug 13, 2024
Merged

pdf processing blogpost #2

merged 2 commits into from
Aug 13, 2024

Conversation

tibor-mach
Copy link
Contributor

Here's my notebook on RAG evaluation with Datachain (llm_rag_evaluation.ipynb) as well as a shortened one to be used as a blogpost (that one is i llm/pdf-processing).

Copy link
Collaborator

@mnrozhkov mnrozhkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it make sense to wrap llm_rag_evaluation.ipynb , requirements.txt and sample.pdf into a separate folder like llm_rag_evaluation.

@@ -0,0 +1,3 @@
unstructured[pdf,embed-huggingface]
Copy link
Contributor

@mattseddon mattseddon Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI - current version of unstructured[pdf] (0.15.1) does not work with the latest version of nltk which it uses for pdf processing. See iterative/datachain#277 for some more details. The issue should be resolved in the next unstructured release.

@tibor-mach tibor-mach merged commit cdb030a into main Aug 13, 2024
@tibor-mach tibor-mach deleted the pdf_processing branch August 13, 2024 12:08
Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tibor-mach ! let's start preparing the blog post please and a demo video for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants