-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: DocumentCleaner: keep the \f in text #8078
fix: DocumentCleaner: keep the \f in text #8078
Conversation
Hi @lambda-science Thank you for opening this PR! Could you please add a release note and also a test case for the fixed behavior? For the release note, here are the instructions: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes |
Pull Request Test Coverage Report for Build 10282053324Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
@julian-risch Added Reno and modified the tests. Ready for review I guess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vblagoje could you please also have a look at this PR? Code changes look good to me. I have one suggestion for simplicfication.
Other than that, I agree it was a bug that we removed the form feed character "\f" so far. We're not changing expected behavior. For example, we described here in the docs that "Pages in the text need to be separated by form feed character "\f"" https://docs.haystack.deepset.ai/docs/documentcleaner#overview
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lambda-science let's go with second version of _remove_empty_lines
Julian suggested. Would you please update this PR?
PR Updated as suggested :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be gtg now @julian-risch , please give it a quick final look as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
Related Issues
Proposed Changes:
Split on "\f" before performing cleaning and join in the end.
How did you test it?
End to end testing using it as a custom component in my pipeline. "\f" are not removed anymore and page number detection from Splitter is correct.
Notes for the reviewer
Could be optimized probably.
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.