fix: DocumentCleaner: keep the \f in text #8078

lambda-science · 2024-07-25T09:54:20Z

Related Issues

fixes Bug: The DocumentCleaner is deleting the \f used to calculate the page_number of documents #8053

Proposed Changes:

Split on "\f" before performing cleaning and join in the end.

How did you test it?

End to end testing using it as a custom component in my pipeline. "\f" are not removed anymore and page number detection from Splitter is correct.

Notes for the reviewer

Could be optimized probably.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

julian-risch · 2024-07-25T09:57:29Z

Hi @lambda-science Thank you for opening this PR! Could you please add a release note and also a test case for the fixed behavior? For the release note, here are the instructions: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes

coveralls · 2024-07-25T10:01:49Z

Pull Request Test Coverage Report for Build 10282053324

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
120 unchanged lines in 28 files lost coverage.
Overall coverage increased (+0.006%) to 90.149%

Files with Coverage Reduction	New Missed Lines	%
components/builders/answer_builder.py	1	98.28%
components/builders/chat_prompt_builder.py	1	98.46%
components/converters/output_adapter.py	1	98.28%
components/preprocessors/document_splitter.py	1	98.96%
components/converters/pypdf.py	2	90.0%
components/converters/txt.py	2	90.0%
components/embedders/sentence_transformers_document_embedder.py	2	96.49%
components/embedders/sentence_transformers_text_embedder.py	2	95.92%
components/generators/chat/hugging_face_api.py	2	97.75%
components/preprocessors/document_cleaner.py	2	98.08%

Totals
Change from base Build 10082446389:	0.006%
Covered Lines:	6918
Relevant Lines:	7674

💛 - Coveralls

lambda-science · 2024-07-25T11:18:00Z

Hi @lambda-science Thank you for opening this PR! Could you please add a release note and also a test case for the fixed behavior? For the release note, here are the instructions: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes

@julian-risch Added Reno and modified the tests. Ready for review I guess

julian-risch

@vblagoje could you please also have a look at this PR? Code changes look good to me. I have one suggestion for simplicfication.
Other than that, I agree it was a bug that we removed the form feed character "\f" so far. We're not changing expected behavior. For example, we described here in the docs that "Pages in the text need to be separated by form feed character "\f"" https://docs.haystack.deepset.ai/docs/documentcleaner#overview

haystack/components/preprocessors/document_cleaner.py

vblagoje

@lambda-science let's go with second version of _remove_empty_lines Julian suggested. Would you please update this PR?

lambda-science · 2024-08-07T10:38:53Z

@lambda-science let's go with second version of _remove_empty_lines Julian suggested. Would you please update this PR?

PR Updated as suggested :)

vblagoje

Should be gtg now @julian-risch , please give it a quick final look as well

julian-risch

LGTM! 👍

Keep the \f in Document Cleaner

9f008fb

lambda-science requested a review from a team as a code owner July 25, 2024 09:54

lambda-science requested review from julian-risch and removed request for a team July 25, 2024 09:54

lambda-science mentioned this pull request Jul 25, 2024

Bug: The DocumentCleaner is deleting the \f used to calculate the page_number of documents #8053

Closed

1 task

lambda-science added 2 commits July 25, 2024 12:53

Add Reno

4a5d534

Add Test

643afcd

lambda-science requested a review from a team as a code owner July 25, 2024 11:17

lambda-science requested review from dfokina and removed request for a team July 25, 2024 11:17

github-actions bot added the topic:tests label Jul 25, 2024

julian-risch reviewed Aug 7, 2024

View reviewed changes

haystack/components/preprocessors/document_cleaner.py Outdated Show resolved Hide resolved

julian-risch requested a review from vblagoje August 7, 2024 07:41

vblagoje requested changes Aug 7, 2024

View reviewed changes

Simplified _remove_empty_lines() code

e25e88d

github-actions bot added the type:documentation Improvements on the docs label Aug 7, 2024

vblagoje self-requested a review August 7, 2024 12:33

vblagoje approved these changes Aug 7, 2024

View reviewed changes

julian-risch approved these changes Aug 7, 2024

View reviewed changes

julian-risch changed the title ~~Fix DocumentCleaner: keep the \f in text~~ fix: DocumentCleaner: keep the \f in text Aug 7, 2024

julian-risch merged commit 5851701 into deepset-ai:main Aug 7, 2024
17 checks passed

lambda-science deleted the fix/Document_Cleaner branch August 7, 2024 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: DocumentCleaner: keep the \f in text #8078

fix: DocumentCleaner: keep the \f in text #8078

lambda-science commented Jul 25, 2024

julian-risch commented Jul 25, 2024

coveralls commented Jul 25, 2024 •

edited

Loading

lambda-science commented Jul 25, 2024 •

edited

Loading

julian-risch left a comment

vblagoje left a comment

lambda-science commented Aug 7, 2024

vblagoje left a comment

julian-risch left a comment

fix: DocumentCleaner: keep the \f in text #8078

fix: DocumentCleaner: keep the \f in text #8078

Conversation

lambda-science commented Jul 25, 2024

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

julian-risch commented Jul 25, 2024

coveralls commented Jul 25, 2024 • edited Loading

Pull Request Test Coverage Report for Build 10282053324

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

lambda-science commented Jul 25, 2024 • edited Loading

julian-risch left a comment

Choose a reason for hiding this comment

vblagoje left a comment

Choose a reason for hiding this comment

lambda-science commented Aug 7, 2024

vblagoje left a comment

Choose a reason for hiding this comment

julian-risch left a comment

Choose a reason for hiding this comment

coveralls commented Jul 25, 2024 •

edited

Loading

lambda-science commented Jul 25, 2024 •

edited

Loading