-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Tika converter not yielding page break tags (\f
)
#8082
Conversation
…g and then parsing the HTML to text using the old Haystack 1.X integration as template.
Tests are failling but I'm not sure I understand the test logic of this specific text, nor why it fails like this =================================== FAILURES ===================================
______________________ TestTikaDocumentConverter.test_run ______________________
self = <converters.test_tika_doc_converter.TestTikaDocumentConverter object at 0x7f8fcf8c7970>
mock_tika_parser = <MagicMock name='from_buffer' id='140255133704784'>
@patch("haystack.components.converters.tika.tika_parser.from_buffer")
def test_run(self, mock_tika_parser):
mock_tika_parser.return_value = {"content": "Content of mock source"}
component = TikaDocumentConverter()
source = ByteStream(data=b"placeholder data")
documents = component.run(sources=[source])["documents"]
assert len(documents) == 1
> assert documents[0].content == "Content of mock source"
E AssertionError: assert '' == 'Content of mock source'
E
E - Content of mock source
test/components/converters/test_tika_doc_converter.py:22: AssertionError |
The failure is due to change of the from_buffer and use of TikaXHTMLParser just after. You can step trough with debugger and resolve the mystery... |
Fix the test by making the mock converter return XML instead of plain text. So the parsing can kick-in and return plain text as expected. What do you think ? |
Pull Request Test Coverage Report for Build 10115535355Details
💛 - Coveralls |
Yes. that should be it. Lets ask @anakin87 for one last look as well - he's more familiar with this codebase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related Issues
Proposed Changes:
Fix TikaConverter not having \f page tag in the documents after parsing by using HTML mode of parsing and then parsing the HTML to text using the old Haystack 1.X integration as template.
How did you test it?
Tested by using it as a custom component in my pipeline and seeing that it worked. I was lazy to write proper tests, I'm not sure how to test this modification because it's supposed to be transparent for users ? Or maybe by parsing a two page PDF and verifying that there is at least one \f in the content ?
Notes for the reviewer
To be test
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.