Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: DocumentCleaner: keep the \f in text #8078

Merged
merged 4 commits into from
Aug 7, 2024

Conversation

lambda-science
Copy link
Contributor

Related Issues

Proposed Changes:

Split on "\f" before performing cleaning and join in the end.

How did you test it?

End to end testing using it as a custom component in my pipeline. "\f" are not removed anymore and page number detection from Splitter is correct.

Notes for the reviewer

Could be optimized probably.

Checklist

@julian-risch
Copy link
Member

Hi @lambda-science Thank you for opening this PR! Could you please add a release note and also a test case for the fixed behavior? For the release note, here are the instructions: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes

@coveralls
Copy link
Collaborator

coveralls commented Jul 25, 2024

Pull Request Test Coverage Report for Build 10282053324

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 120 unchanged lines in 28 files lost coverage.
  • Overall coverage increased (+0.006%) to 90.149%

Files with Coverage Reduction New Missed Lines %
components/builders/answer_builder.py 1 98.28%
components/builders/chat_prompt_builder.py 1 98.46%
components/converters/output_adapter.py 1 98.28%
components/preprocessors/document_splitter.py 1 98.96%
components/converters/pypdf.py 2 90.0%
components/converters/txt.py 2 90.0%
components/embedders/sentence_transformers_document_embedder.py 2 96.49%
components/embedders/sentence_transformers_text_embedder.py 2 95.92%
components/generators/chat/hugging_face_api.py 2 97.75%
components/preprocessors/document_cleaner.py 2 98.08%
Totals Coverage Status
Change from base Build 10082446389: 0.006%
Covered Lines: 6918
Relevant Lines: 7674

💛 - Coveralls

@lambda-science lambda-science requested a review from a team as a code owner July 25, 2024 11:17
@lambda-science lambda-science requested review from dfokina and removed request for a team July 25, 2024 11:17
@lambda-science
Copy link
Contributor Author

lambda-science commented Jul 25, 2024

Hi @lambda-science Thank you for opening this PR! Could you please add a release note and also a test case for the fixed behavior? For the release note, here are the instructions: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes

@julian-risch Added Reno and modified the tests. Ready for review I guess

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vblagoje could you please also have a look at this PR? Code changes look good to me. I have one suggestion for simplicfication.
Other than that, I agree it was a bug that we removed the form feed character "\f" so far. We're not changing expected behavior. For example, we described here in the docs that "Pages in the text need to be separated by form feed character "\f"" https://docs.haystack.deepset.ai/docs/documentcleaner#overview

haystack/components/preprocessors/document_cleaner.py Outdated Show resolved Hide resolved
Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lambda-science let's go with second version of _remove_empty_lines Julian suggested. Would you please update this PR?

@github-actions github-actions bot added the type:documentation Improvements on the docs label Aug 7, 2024
@lambda-science
Copy link
Contributor Author

@lambda-science let's go with second version of _remove_empty_lines Julian suggested. Would you please update this PR?

PR Updated as suggested :)

@vblagoje vblagoje self-requested a review August 7, 2024 12:33
Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be gtg now @julian-risch , please give it a quick final look as well

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@julian-risch julian-risch changed the title Fix DocumentCleaner: keep the \f in text fix: DocumentCleaner: keep the \f in text Aug 7, 2024
@julian-risch julian-risch merged commit 5851701 into deepset-ai:main Aug 7, 2024
17 checks passed
@lambda-science lambda-science deleted the fix/Document_Cleaner branch August 7, 2024 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: The DocumentCleaner is deleting the \f used to calculate the page_number of documents
4 participants