-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JsonFileParser to FileStrategy #1195
Conversation
scripts/prepdocslib/jsonparser.py
Outdated
if isinstance(data, list): | ||
for i, obj in enumerate(data): | ||
page_text = json.dumps(obj) | ||
offset += len(page_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to improve this as offset is wrong for large files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a change to the offset calculation with some test asserts. I'm not sure if I'm missing a more complex issue with offset, let me know if you were thinking of another issue around offset.
One of the strongest PR descriptions I've seen here. Thank you for being so descriptive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the content of the JSON file is put as the page chunk text? Can you please share an example of what the text looks like for some example array and object?
scripts/prepdocslib/jsonparser.py
Outdated
offset += len(page_text) | ||
yield Page(i, offset, page_text) | ||
elif isinstance(data, dict): | ||
yield Page(1, 0, json.dumps(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting - you are using a 0-based index because of enumerate
above but a 1-based index here for objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged and rewrote some code, i'll double check things. The likely fix will be to make them consistent and 1 based. 0 based for array of objects, 1 based for pages (because docs start on page 1).
Thank you GitHub CoPilot :) |
i'll have to try it out! |
scripts/prepdocslib/filestrategy.py
Outdated
pdf_parser: PdfParser, | ||
text_splitter: TextSplitter, | ||
pdf_text_splitter: PdfTextSplitter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that there are more parsers and splitters, I was wondering if you have thoughts about how to avoid having to pass in the classes? Imagine we also had parsers for HTML, CSV, etc, this would involve more parameters to init and attributes. Ideally we could avoid that while still allowing flexibility of using different parsers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, good call. doing a refactor
@@ -0,0 +1,34 @@ | |||
from scripts.prepdocslib.page import Page |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, for the tests, you can run pytest --cov to see coverage stats for your new files
scripts/prepdocslib/textsplitter.py
Outdated
|
||
class SimpleTextSplitter(TextSplitter): | ||
""" | ||
Class that splits pages into smaller chunks. This is required because embedding models may not be able to analyze an entire page at once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please adjust the docstrings to be distinct across these two classes. Also dont think the original docstring is 100% correct, as we also chunk to reduce the context sent to the LLM. Thats primary purpose in my mind. Perhaps you can mention both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall question: does this still work fine for the thought process and citation tab?
Co-authored-by: Pamela Fox <pamela.fox@gmail.com>
Co-authored-by: Pamela Fox <pamela.fox@gmail.com>
data/DevOps_Examples/2189.json
Outdated
@@ -0,0 +1,14 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the sample data but I'm worried about muddying up the sample index with unrelated data. I suppose we already do that for the GPT4V examples, and it hasn't practically caused an issue, so maybe that's fine? @mattgotteiner Thoughts?
Might be helpful to think about whether we're going to end up adding example HTML, docx, pptx, etc. Probably not as it slows ingestion time and overall deployment time. That's another reason to not add it.
If we do keep it, I'd suggest naming the folder to JSON_Examples since that's more their purpose in the repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested this. Parsing those JSON files takes less than 30 seconds, so maybe we just keep multiple example files. Its the PDFs and the call to Doc Intel that adds significantly to the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this locally, both with Doc Intel and Local PDF parser. I've also confirmed that @codebytes added test coverage for all new lines of code (thank you!).
The new mypy issue is mine, from swapping out the namedtuple with frozen dataclass, at the suggestion of other Pythonistas. I'll check it out. |
@codebytes Merged! Thanks so much for the PR. Please send a follow-up if the offset calculation still needs improvement. |
This pull request includes changes to the
scripts/prepdocs.py
and related files to refactor the document parsing and splitting logic. The changes introduce a more flexible and extensible architecture for handling different types of documents and splitting strategies. The most significant changes include the introduction of a newFileProcessor
class, changes to how file strategies are set up, and the addition of new parsers and splitters for different file types and splitting strategies.Refactoring:
scripts/prepdocs.py
: Refactored the setup for file strategies to use a dictionary ofFileProcessor
instances for different file types. This allows for easy addition of new file types and their corresponding processing logic. [1] [2] [3]scripts/prepdocslib/fileprocessor.py
: Introduced a newFileProcessor
class that encapsulates a parser and a splitter. This makes the code more modular and easier to extend.scripts/prepdocslib/filestrategy.py
: Refactored theFileStrategy
class to use the newFileProcessor
instances instead of individual parsers and splitters. This simplifies the code and makes it more flexible. [1] [2] [3] [4]New parsers and splitters:
scripts/prepdocslib/jsonparser.py
: Added a newJsonParser
for parsing JSON files.scripts/prepdocslib/textsplitter.py
: Introduced newSentenceTextSplitter
andSimpleTextSplitter
classes for different text splitting strategies. This allows for easy addition of new splitting strategies. [1] [2]scripts/prepdocslib/pdfparser.py
: Refactored the existing PDF parser classes to extend from the newParser
abstract base class. This makes the code more consistent and easier to extend. [1] [2] [3] [4]New and modified tests:
tests/test_jsonparser.py
: Added new tests for theJsonParser
class.tests/test_prepdocslib_jsontextsplitter.py
: Added new tests for theSimpleTextSplitter
class.tests/test_prepdocslib_pdftextsplitter.py
: Modified existing tests to use the newSentenceTextSplitter
class. [1] [2]Other changes:
scripts/prepdocslib/page.py
: Introduced newPage
andSplitPage
classes to represent a page and a split section of a page, respectively.scripts/prepdocslib/parser.py
: Introduced a newParser
abstract base class for parsers. This makes the code more consistent and easier to extend.scripts/prepdocslib/listfilestrategy.py
: Added a newfile_extension
method to theLocalListFileStrategy
class to get the file extension of the file being processed.