Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JsonFileParser to FileStrategy #1195

Merged
merged 20 commits into from
Feb 2, 2024
Merged

Conversation

codebytes
Copy link
Contributor

@codebytes codebytes commented Jan 28, 2024

This pull request includes changes to the scripts/prepdocs.py and related files to refactor the document parsing and splitting logic. The changes introduce a more flexible and extensible architecture for handling different types of documents and splitting strategies. The most significant changes include the introduction of a new FileProcessor class, changes to how file strategies are set up, and the addition of new parsers and splitters for different file types and splitting strategies.

Refactoring:

  • scripts/prepdocs.py: Refactored the setup for file strategies to use a dictionary of FileProcessor instances for different file types. This allows for easy addition of new file types and their corresponding processing logic. [1] [2] [3]
  • scripts/prepdocslib/fileprocessor.py: Introduced a new FileProcessor class that encapsulates a parser and a splitter. This makes the code more modular and easier to extend.
  • scripts/prepdocslib/filestrategy.py: Refactored the FileStrategy class to use the new FileProcessor instances instead of individual parsers and splitters. This simplifies the code and makes it more flexible. [1] [2] [3] [4]

New parsers and splitters:

New and modified tests:

Other changes:

if isinstance(data, list):
for i, obj in enumerate(data):
page_text = json.dumps(obj)
offset += len(page_text)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to improve this as offset is wrong for large files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still the case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a change to the offset calculation with some test asserts. I'm not sure if I'm missing a more complex issue with offset, let me know if you were thinking of another issue around offset.

@mattgotteiner
Copy link
Collaborator

One of the strongest PR descriptions I've seen here. Thank you for being so descriptive.

Copy link
Collaborator

@mattgotteiner mattgotteiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the content of the JSON file is put as the page chunk text? Can you please share an example of what the text looks like for some example array and object?

offset += len(page_text)
yield Page(i, offset, page_text)
elif isinstance(data, dict):
yield Page(1, 0, json.dumps(data))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - you are using a 0-based index because of enumerate above but a 1-based index here for objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged and rewrote some code, i'll double check things. The likely fix will be to make them consistent and 1 based. 0 based for array of objects, 1 based for pages (because docs start on page 1).

@codebytes
Copy link
Contributor Author

One of the strongest PR descriptions I've seen here. Thank you for being so descriptive.

Thank you GitHub CoPilot :)

@mattgotteiner
Copy link
Collaborator

One of the strongest PR descriptions I've seen here. Thank you for being so descriptive.

Thank you GitHub CoPilot :)

i'll have to try it out!

pdf_parser: PdfParser,
text_splitter: TextSplitter,
pdf_text_splitter: PdfTextSplitter,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there are more parsers and splitters, I was wondering if you have thoughts about how to avoid having to pass in the classes? Imagine we also had parsers for HTML, CSV, etc, this would involve more parameters to init and attributes. Ideally we could avoid that while still allowing flexibility of using different parsers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good call. doing a refactor

@@ -0,0 +1,34 @@
from scripts.prepdocslib.page import Page
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, for the tests, you can run pytest --cov to see coverage stats for your new files


class SimpleTextSplitter(TextSplitter):
"""
Class that splits pages into smaller chunks. This is required because embedding models may not be able to analyze an entire page at once
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please adjust the docstrings to be distinct across these two classes. Also dont think the original docstring is 100% correct, as we also chunk to reduce the context sent to the LLM. Thats primary purpose in my mind. Perhaps you can mention both.

Copy link
Collaborator

@pamelafox pamelafox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall question: does this still work fine for the thought process and citation tab?

@codebytes
Copy link
Contributor Author

Overall question: does this still work fine for the thought process and citation tab?

image

@@ -0,0 +1,14 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the sample data but I'm worried about muddying up the sample index with unrelated data. I suppose we already do that for the GPT4V examples, and it hasn't practically caused an issue, so maybe that's fine? @mattgotteiner Thoughts?

Might be helpful to think about whether we're going to end up adding example HTML, docx, pptx, etc. Probably not as it slows ingestion time and overall deployment time. That's another reason to not add it.

If we do keep it, I'd suggest naming the folder to JSON_Examples since that's more their purpose in the repository.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested this. Parsing those JSON files takes less than 30 seconds, so maybe we just keep multiple example files. Its the PDFs and the call to Doc Intel that adds significantly to the time.

Copy link
Collaborator

@pamelafox pamelafox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this locally, both with Doc Intel and Local PDF parser. I've also confirmed that @codebytes added test coverage for all new lines of code (thank you!).

@pamelafox
Copy link
Collaborator

The new mypy issue is mine, from swapping out the namedtuple with frozen dataclass, at the suggestion of other Pythonistas. I'll check it out.

@pamelafox pamelafox merged commit 270d869 into Azure-Samples:main Feb 2, 2024
10 checks passed
@pamelafox
Copy link
Collaborator

@codebytes Merged! Thanks so much for the PR. Please send a follow-up if the offset calculation still needs improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants