Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Tika converter not yielding page break tags (\f) #8082

Merged
merged 5 commits into from
Jul 26, 2024

Conversation

lambda-science
Copy link
Contributor

Related Issues

Proposed Changes:

Fix TikaConverter not having \f page tag in the documents after parsing by using HTML mode of parsing and then parsing the HTML to text using the old Haystack 1.X integration as template.

How did you test it?

Tested by using it as a custom component in my pipeline and seeing that it worked. I was lazy to write proper tests, I'm not sure how to test this modification because it's supposed to be transparent for users ? Or maybe by parsing a two page PDF and verifying that there is at least one \f in the content ?

Notes for the reviewer

To be test

Checklist

…g and then parsing the HTML to text using the old Haystack 1.X integration as template.
@lambda-science lambda-science requested review from a team as code owners July 25, 2024 11:20
@lambda-science lambda-science requested review from dfokina and vblagoje and removed request for a team July 25, 2024 11:20
@github-actions github-actions bot added the type:documentation Improvements on the docs label Jul 25, 2024
@lambda-science
Copy link
Contributor Author

Tests are failling but I'm not sure I understand the test logic of this specific text, nor why it fails like this

=================================== FAILURES ===================================
______________________ TestTikaDocumentConverter.test_run ______________________

self = <converters.test_tika_doc_converter.TestTikaDocumentConverter object at 0x7f8fcf8c7970>
mock_tika_parser = <MagicMock name='from_buffer' id='140255133704784'>

    @patch("haystack.components.converters.tika.tika_parser.from_buffer")
    def test_run(self, mock_tika_parser):
        mock_tika_parser.return_value = {"content": "Content of mock source"}
    
        component = TikaDocumentConverter()
        source = ByteStream(data=b"placeholder data")
        documents = component.run(sources=[source])["documents"]
    
        assert len(documents) == 1
>       assert documents[0].content == "Content of mock source"
E       AssertionError: assert '' == 'Content of mock source'
E         
E         - Content of mock source

test/components/converters/test_tika_doc_converter.py:22: AssertionError

@vblagoje
Copy link
Member

Tests are failling but I'm not sure I understand the test logic of this specific text, nor why it fails like this

=================================== FAILURES ===================================
______________________ TestTikaDocumentConverter.test_run ______________________

self = <converters.test_tika_doc_converter.TestTikaDocumentConverter object at 0x7f8fcf8c7970>
mock_tika_parser = <MagicMock name='from_buffer' id='140255133704784'>

    @patch("haystack.components.converters.tika.tika_parser.from_buffer")
    def test_run(self, mock_tika_parser):
        mock_tika_parser.return_value = {"content": "Content of mock source"}
    
        component = TikaDocumentConverter()
        source = ByteStream(data=b"placeholder data")
        documents = component.run(sources=[source])["documents"]
    
        assert len(documents) == 1
>       assert documents[0].content == "Content of mock source"
E       AssertionError: assert '' == 'Content of mock source'
E         
E         - Content of mock source

test/components/converters/test_tika_doc_converter.py:22: AssertionError

The failure is due to change of the from_buffer and use of TikaXHTMLParser just after. You can step trough with debugger and resolve the mystery...

@lambda-science
Copy link
Contributor Author

Tests are failling but I'm not sure I understand the test logic of this specific text, nor why it fails like this

=================================== FAILURES ===================================
______________________ TestTikaDocumentConverter.test_run ______________________

self = <converters.test_tika_doc_converter.TestTikaDocumentConverter object at 0x7f8fcf8c7970>
mock_tika_parser = <MagicMock name='from_buffer' id='140255133704784'>

    @patch("haystack.components.converters.tika.tika_parser.from_buffer")
    def test_run(self, mock_tika_parser):
        mock_tika_parser.return_value = {"content": "Content of mock source"}
    
        component = TikaDocumentConverter()
        source = ByteStream(data=b"placeholder data")
        documents = component.run(sources=[source])["documents"]
    
        assert len(documents) == 1
>       assert documents[0].content == "Content of mock source"
E       AssertionError: assert '' == 'Content of mock source'
E         
E         - Content of mock source

test/components/converters/test_tika_doc_converter.py:22: AssertionError

The failure is due to change of the from_buffer and use of TikaXHTMLParser just after. You can step trough with debugger and resolve the mystery...

Fix the test by making the mock converter return XML instead of plain text. So the parsing can kick-in and return plain text as expected. What do you think ?

@coveralls
Copy link
Collaborator

coveralls commented Jul 26, 2024

Pull Request Test Coverage Report for Build 10115535355

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 4 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.01%) to 90.059%

Files with Coverage Reduction New Missed Lines %
components/converters/tika.py 4 92.73%
Totals Coverage Status
Change from base Build 10114001839: 0.01%
Covered Lines: 6813
Relevant Lines: 7565

💛 - Coveralls

@vblagoje
Copy link
Member

Yes. that should be it. Lets ask @anakin87 for one last look as well - he's more familiar with this codebase

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt free to push some refinements.

I'll merge when the tests pass.

Thanks @lambda-science!

@anakin87 anakin87 merged commit 1c53aae into deepset-ai:main Jul 26, 2024
17 checks passed
@lambda-science lambda-science deleted the fix/TikaConverter branch August 13, 2024 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TikaDocumentConverter does not split content by page
4 participants