fix: Tika converter not yielding page break tags (`\f`) #8082

lambda-science · 2024-07-25T11:20:30Z

Related Issues

fixes TikaDocumentConverter does not split content by page #7949

Proposed Changes:

Fix TikaConverter not having \f page tag in the documents after parsing by using HTML mode of parsing and then parsing the HTML to text using the old Haystack 1.X integration as template.

How did you test it?

Tested by using it as a custom component in my pipeline and seeing that it worked. I was lazy to write proper tests, I'm not sure how to test this modification because it's supposed to be transparent for users ? Or maybe by parsing a two page PDF and verifying that there is at least one \f in the content ?

Notes for the reviewer

To be test

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

…g and then parsing the HTML to text using the old Haystack 1.X integration as template.

lambda-science · 2024-07-25T11:26:43Z

Tests are failling but I'm not sure I understand the test logic of this specific text, nor why it fails like this

=================================== FAILURES ===================================
______________________ TestTikaDocumentConverter.test_run ______________________

self = <converters.test_tika_doc_converter.TestTikaDocumentConverter object at 0x7f8fcf8c7970>
mock_tika_parser = <MagicMock name='from_buffer' id='140255133704784'>

    @patch("haystack.components.converters.tika.tika_parser.from_buffer")
    def test_run(self, mock_tika_parser):
        mock_tika_parser.return_value = {"content": "Content of mock source"}
    
        component = TikaDocumentConverter()
        source = ByteStream(data=b"placeholder data")
        documents = component.run(sources=[source])["documents"]
    
        assert len(documents) == 1
>       assert documents[0].content == "Content of mock source"
E       AssertionError: assert '' == 'Content of mock source'
E         
E         - Content of mock source

test/components/converters/test_tika_doc_converter.py:22: AssertionError

haystack/components/converters/tika.py

vblagoje · 2024-07-25T14:05:42Z

Tests are failling but I'm not sure I understand the test logic of this specific text, nor why it fails like this

=================================== FAILURES ===================================
______________________ TestTikaDocumentConverter.test_run ______________________

self = <converters.test_tika_doc_converter.TestTikaDocumentConverter object at 0x7f8fcf8c7970>
mock_tika_parser = <MagicMock name='from_buffer' id='140255133704784'>

    @patch("haystack.components.converters.tika.tika_parser.from_buffer")
    def test_run(self, mock_tika_parser):
        mock_tika_parser.return_value = {"content": "Content of mock source"}
    
        component = TikaDocumentConverter()
        source = ByteStream(data=b"placeholder data")
        documents = component.run(sources=[source])["documents"]
    
        assert len(documents) == 1
>       assert documents[0].content == "Content of mock source"
E       AssertionError: assert '' == 'Content of mock source'
E         
E         - Content of mock source

test/components/converters/test_tika_doc_converter.py:22: AssertionError

The failure is due to change of the from_buffer and use of TikaXHTMLParser just after. You can step trough with debugger and resolve the mystery...

lambda-science · 2024-07-26T07:46:57Z

Tests are failling but I'm not sure I understand the test logic of this specific text, nor why it fails like this

=================================== FAILURES ===================================
______________________ TestTikaDocumentConverter.test_run ______________________

self = <converters.test_tika_doc_converter.TestTikaDocumentConverter object at 0x7f8fcf8c7970>
mock_tika_parser = <MagicMock name='from_buffer' id='140255133704784'>

    @patch("haystack.components.converters.tika.tika_parser.from_buffer")
    def test_run(self, mock_tika_parser):
        mock_tika_parser.return_value = {"content": "Content of mock source"}
    
        component = TikaDocumentConverter()
        source = ByteStream(data=b"placeholder data")
        documents = component.run(sources=[source])["documents"]
    
        assert len(documents) == 1
>       assert documents[0].content == "Content of mock source"
E       AssertionError: assert '' == 'Content of mock source'
E         
E         - Content of mock source

test/components/converters/test_tika_doc_converter.py:22: AssertionError

The failure is due to change of the from_buffer and use of TikaXHTMLParser just after. You can step trough with debugger and resolve the mystery...

Fix the test by making the mock converter return XML instead of plain text. So the parsing can kick-in and return plain text as expected. What do you think ?

coveralls · 2024-07-26T07:53:27Z

Pull Request Test Coverage Report for Build 10115535355

Details

0 of 0 changed or added relevant lines in 0 files are covered.
4 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.01%) to 90.059%

Files with Coverage Reduction	New Missed Lines	%
components/converters/tika.py	4	92.73%

Totals
Change from base Build 10114001839:	0.01%
Covered Lines:	6813
Relevant Lines:	7565

💛 - Coveralls

vblagoje · 2024-07-26T09:03:33Z

Yes. that should be it. Lets ask @anakin87 for one last look as well - he's more familiar with this codebase

anakin87

I felt free to push some refinements.

I'll merge when the tests pass.

Thanks @lambda-science!

lambda-science added 2 commits July 25, 2024 12:49

Fix TikaConverter not having \f page tag by using HTML mode of parsin…

e8f8d8d

…g and then parsing the HTML to text using the old Haystack 1.X integration as template.

Add Reno

de6fb7e

lambda-science requested review from a team as code owners July 25, 2024 11:20

lambda-science requested review from dfokina and vblagoje and removed request for a team July 25, 2024 11:20

lambda-science mentioned this pull request Jul 25, 2024

TikaDocumentConverter does not split content by page #7949

Closed

github-actions bot added the type:documentation Improvements on the docs label Jul 25, 2024

vblagoje reviewed Jul 25, 2024

View reviewed changes

haystack/components/converters/tika.py Show resolved Hide resolved

Fix test by making Mock Tika return XML (before parsing)

fe8b0eb

github-actions bot added the topic:tests label Jul 26, 2024

anakin87 added 2 commits July 26, 2024 18:22

Merge branch 'main' into tikaconv

74b5854

refinements and test

9ba25b5

anakin87 approved these changes Jul 26, 2024

View reviewed changes

anakin87 merged commit 1c53aae into deepset-ai:main Jul 26, 2024
17 checks passed

lambda-science deleted the fix/TikaConverter branch August 13, 2024 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Tika converter not yielding page break tags (`\f`) #8082

fix: Tika converter not yielding page break tags (`\f`) #8082

lambda-science commented Jul 25, 2024

lambda-science commented Jul 25, 2024

vblagoje commented Jul 25, 2024

lambda-science commented Jul 26, 2024

coveralls commented Jul 26, 2024 •

edited

Loading

vblagoje commented Jul 26, 2024

anakin87 left a comment

fix: Tika converter not yielding page break tags (\f) #8082

fix: Tika converter not yielding page break tags (\f) #8082

Conversation

lambda-science commented Jul 25, 2024

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

lambda-science commented Jul 25, 2024

vblagoje commented Jul 25, 2024

lambda-science commented Jul 26, 2024

coveralls commented Jul 26, 2024 • edited Loading

Pull Request Test Coverage Report for Build 10115535355

Details

💛 - Coveralls

vblagoje commented Jul 26, 2024

anakin87 left a comment

Choose a reason for hiding this comment

fix: Tika converter not yielding page break tags (`\f`) #8082

fix: Tika converter not yielding page break tags (`\f`) #8082

coveralls commented Jul 26, 2024 •

edited

Loading