Add JsonFileParser to FileStrategy #1195

codebytes · 2024-01-28T20:56:07Z

This pull request includes changes to the scripts/prepdocs.py and related files to refactor the document parsing and splitting logic. The changes introduce a more flexible and extensible architecture for handling different types of documents and splitting strategies. The most significant changes include the introduction of a new FileProcessor class, changes to how file strategies are set up, and the addition of new parsers and splitters for different file types and splitting strategies.

Refactoring:

scripts/prepdocs.py: Refactored the setup for file strategies to use a dictionary of FileProcessor instances for different file types. This allows for easy addition of new file types and their corresponding processing logic. [1] [2] [3]
scripts/prepdocslib/fileprocessor.py: Introduced a new FileProcessor class that encapsulates a parser and a splitter. This makes the code more modular and easier to extend.
scripts/prepdocslib/filestrategy.py: Refactored the FileStrategy class to use the new FileProcessor instances instead of individual parsers and splitters. This simplifies the code and makes it more flexible. [1] [2] [3] [4]

New parsers and splitters:

scripts/prepdocslib/jsonparser.py: Added a new JsonParser for parsing JSON files.
scripts/prepdocslib/textsplitter.py: Introduced new SentenceTextSplitter and SimpleTextSplitter classes for different text splitting strategies. This allows for easy addition of new splitting strategies. [1] [2]
scripts/prepdocslib/pdfparser.py: Refactored the existing PDF parser classes to extend from the new Parser abstract base class. This makes the code more consistent and easier to extend. [1] [2] [3] [4]

New and modified tests:

tests/test_jsonparser.py: Added new tests for the JsonParser class.
tests/test_prepdocslib_jsontextsplitter.py: Added new tests for the SimpleTextSplitter class.
tests/test_prepdocslib_pdftextsplitter.py: Modified existing tests to use the new SentenceTextSplitter class. [1] [2]

Other changes:

scripts/prepdocslib/page.py: Introduced new Page and SplitPage classes to represent a page and a split section of a page, respectively.
scripts/prepdocslib/parser.py: Introduced a new Parser abstract base class for parsers. This makes the code more consistent and easier to extend.
scripts/prepdocslib/listfilestrategy.py: Added a new file_extension method to the LocalListFileStrategy class to get the file extension of the file being processed.

codebytes · 2024-01-28T21:49:05Z

scripts/prepdocslib/jsonparser.py

+ if isinstance(data, list):
+ for i, obj in enumerate(data):
+ page_text = json.dumps(obj)
+ offset += len(page_text)


I need to improve this as offset is wrong for large files.

Is this comment still the case?

I just pushed a change to the offset calculation with some test asserts. I'm not sure if I'm missing a more complex issue with offset, let me know if you were thinking of another issue around offset.

mattgotteiner · 2024-01-29T19:27:19Z

One of the strongest PR descriptions I've seen here. Thank you for being so descriptive.

mattgotteiner

So the content of the JSON file is put as the page chunk text? Can you please share an example of what the text looks like for some example array and object?

mattgotteiner · 2024-01-29T19:28:29Z

scripts/prepdocslib/jsonparser.py

+ offset += len(page_text)
+ yield Page(i, offset, page_text)
+ elif isinstance(data, dict):
+ yield Page(1, 0, json.dumps(data))


Interesting - you are using a 0-based index because of enumerate above but a 1-based index here for objects?

I merged and rewrote some code, i'll double check things. The likely fix will be to make them consistent and 1 based. 0 based for array of objects, 1 based for pages (because docs start on page 1).

tests/test_jsonparser.py

codebytes · 2024-01-29T20:23:54Z

One of the strongest PR descriptions I've seen here. Thank you for being so descriptive.

Thank you GitHub CoPilot :)

mattgotteiner · 2024-01-29T20:43:17Z

One of the strongest PR descriptions I've seen here. Thank you for being so descriptive.

Thank you GitHub CoPilot :)

i'll have to try it out!

pamelafox · 2024-01-29T23:10:13Z

scripts/prepdocslib/filestrategy.py

 pdf_parser: PdfParser,
- text_splitter: TextSplitter,
+ pdf_text_splitter: PdfTextSplitter,


Now that there are more parsers and splitters, I was wondering if you have thoughts about how to avoid having to pass in the classes? Imagine we also had parsers for HTML, CSV, etc, this would involve more parameters to init and attributes. Ideally we could avoid that while still allowing flexibility of using different parsers.

yeah, good call. doing a refactor

pamelafox · 2024-01-30T11:25:26Z

tests/test_prepdocslib_jsontextsplitter.py

@@ -0,0 +1,34 @@
+from scripts.prepdocslib.page import Page


FYI, for the tests, you can run pytest --cov to see coverage stats for your new files

scripts/prepdocslib/textsplitter.py

pamelafox · 2024-01-30T11:28:03Z

scripts/prepdocslib/textsplitter.py

+
+class SimpleTextSplitter(TextSplitter):
+ """
+ Class that splits pages into smaller chunks. This is required because embedding models may not be able to analyze an entire page at once


Please adjust the docstrings to be distinct across these two classes. Also dont think the original docstring is 100% correct, as we also chunk to reduce the context sent to the LLM. Thats primary purpose in my mind. Perhaps you can mention both.

scripts/prepdocslib/pdfparser.py

scripts/prepdocslib/parser.py

scripts/prepdocslib/jsonparser.py

scripts/prepdocslib/filestrategy.py

scripts/prepdocs.py

pamelafox

Overall question: does this still work fine for the thought process and citation tab?

Co-authored-by: Pamela Fox <pamela.fox@gmail.com>

codebytes · 2024-01-30T16:38:50Z

Overall question: does this still work fine for the thought process and citation tab?

pamelafox · 2024-01-30T19:39:01Z

data/DevOps_Examples/2189.json

@@ -0,0 +1,14 @@
+{


I love the sample data but I'm worried about muddying up the sample index with unrelated data. I suppose we already do that for the GPT4V examples, and it hasn't practically caused an issue, so maybe that's fine? @mattgotteiner Thoughts?

Might be helpful to think about whether we're going to end up adding example HTML, docx, pptx, etc. Probably not as it slows ingestion time and overall deployment time. That's another reason to not add it.

If we do keep it, I'd suggest naming the folder to JSON_Examples since that's more their purpose in the repository.

I just tested this. Parsing those JSON files takes less than 30 seconds, so maybe we just keep multiple example files. Its the PDFs and the call to Doc Intel that adds significantly to the time.

scripts/prepdocslib/listfilestrategy.py

pamelafox

I've tested this locally, both with Doc Intel and Local PDF parser. I've also confirmed that @codebytes added test coverage for all new lines of code (thank you!).

pamelafox · 2024-02-02T00:29:48Z

The new mypy issue is mine, from swapping out the namedtuple with frozen dataclass, at the suggestion of other Pythonistas. I'll check it out.

pamelafox · 2024-02-02T01:06:00Z

@codebytes Merged! Thanks so much for the PR. Please send a follow-up if the offset calculation still needs improvement.

codebytes added 4 commits January 28, 2024 20:52

Add JsonFileParser to FileStrategy

faeff84

Refactor JSON parser in prepdocs.py

3200a6a

fixed linting errors with ruff

3521764

Fix formatting in filestrategy.py and test_jsonparser.py

61b7981

codebytes commented Jan 28, 2024

View reviewed changes

mattgotteiner reviewed Jan 29, 2024

View reviewed changes

tests/test_jsonparser.py Show resolved Hide resolved

Added new textsplitter and tests

5ff6c63

pamelafox reviewed Jan 29, 2024

View reviewed changes

codebytes added 3 commits January 30, 2024 01:01

Added File processors and refactor of prepdocs.py

cf1ab8e

fix ruff formatting issues

4e97610

fix linting errors

ea30dff

pamelafox reviewed Jan 30, 2024

View reviewed changes

scripts/prepdocslib/textsplitter.py Show resolved Hide resolved

pamelafox reviewed Jan 30, 2024

View reviewed changes

scripts/prepdocslib/pdfparser.py Show resolved Hide resolved

pamelafox reviewed Jan 30, 2024

View reviewed changes

scripts/prepdocslib/parser.py Outdated Show resolved Hide resolved

pamelafox reviewed Jan 30, 2024

View reviewed changes

scripts/prepdocslib/jsonparser.py Outdated Show resolved Hide resolved

pamelafox reviewed Jan 30, 2024

View reviewed changes

scripts/prepdocslib/filestrategy.py Show resolved Hide resolved

pamelafox reviewed Jan 30, 2024

View reviewed changes

scripts/prepdocs.py Show resolved Hide resolved

pamelafox reviewed Jan 30, 2024

View reviewed changes

codebytes and others added 4 commits January 30, 2024 07:12

Update scripts/prepdocslib/jsonparser.py

d401653

Co-authored-by: Pamela Fox <pamela.fox@gmail.com>

Update scripts/prepdocslib/parser.py

2e9eec1

Co-authored-by: Pamela Fox <pamela.fox@gmail.com>

Added sample json data, fixed bug in file extension

ac04bd4

Fix file extension retrieval in File class

dfcafdc

Refactor prepdocs.py script

c81d43c

pamelafox reviewed Jan 30, 2024

View reviewed changes

scripts/prepdocslib/listfilestrategy.py Show resolved Hide resolved

pamelafox mentioned this pull request Jan 30, 2024

Website as data source #1210

Closed

codebytes and others added 4 commits January 31, 2024 17:52

renamed data examples, added test

632ab6f

Merge branch 'main' into json-parser

17fd47b

Fix offset, add tests

8489e72

Add pragma no cover

21f4883

pamelafox approved these changes Feb 2, 2024

View reviewed changes

pamelafox added 3 commits February 2, 2024 00:36

Use the whole version of dataclass

660d1b1

Run ruff on imports

3c8a6d1

Reformatting

ab9057d

pamelafox mentioned this pull request Feb 2, 2024

Document support for filetypes other than pdf, such as .csv, .xlsx, .docx, .ppt #181

Closed

mattgotteiner approved these changes Feb 2, 2024

View reviewed changes

pamelafox merged commit 270d869 into Azure-Samples:main Feb 2, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JsonFileParser to FileStrategy #1195

Add JsonFileParser to FileStrategy #1195

codebytes commented Jan 28, 2024 •

edited

Loading

codebytes Jan 28, 2024

pamelafox Jan 30, 2024

pamelafox Feb 2, 2024

mattgotteiner commented Jan 29, 2024

mattgotteiner left a comment

mattgotteiner Jan 29, 2024

codebytes Jan 29, 2024

codebytes commented Jan 29, 2024

mattgotteiner commented Jan 29, 2024

pamelafox Jan 29, 2024

codebytes Jan 30, 2024

pamelafox Jan 30, 2024

pamelafox Jan 30, 2024

pamelafox left a comment

codebytes commented Jan 30, 2024

pamelafox Jan 30, 2024

pamelafox Feb 1, 2024

pamelafox left a comment

pamelafox commented Feb 2, 2024

pamelafox commented Feb 2, 2024

Add JsonFileParser to FileStrategy #1195

Add JsonFileParser to FileStrategy #1195

Conversation

codebytes commented Jan 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattgotteiner commented Jan 29, 2024

mattgotteiner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codebytes commented Jan 29, 2024

mattgotteiner commented Jan 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pamelafox left a comment

Choose a reason for hiding this comment

codebytes commented Jan 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pamelafox left a comment

Choose a reason for hiding this comment

pamelafox commented Feb 2, 2024

pamelafox commented Feb 2, 2024

codebytes commented Jan 28, 2024 •

edited

Loading