feat: Port NLTKDocumentSplitter from dC to Haystack #8350

vblagoje · 2024-09-10T08:15:56Z

Why:

Introduces a new document splitter component utilizing NLTK for enhanced text processing.

What:

Implemented NLTKDocumentSplitter: A new component that leverages the Natural Language Toolkit (NLTK) for splitting documents based on words, sentences, passages, or pages.
Configuration Options: The splitter supports various configuration parameters such as split_by, split_length, split_overlap, respect_sentence_boundary, language, use_split_rules, and extend_abbreviations for fine-tuning the document splitting process.
Split Mechanics Enhancement: Improvements in split mechanics, including respecting sentence boundaries when splitting by words and addressing special cases with split rules and extended abbreviations for improved tokenization.
Testing: Comprehensive test suite for validating the functionality of the NLTKDocumentSplitter across different splitting scenarios and configurations.

How can it be used:

Dataset Preprocessing: Before feeding text data into machine learning models, use the splitter to preprocess documents into smaller, more manageable sizes or into specific formats required by downstream processing stages.

# Example usage of the NLTKDocumentSplitter component
from haystack.components.preprocessors import NLTKDocumentSplitter

document_splitter = NLTKDocumentSplitter(
    split_by="sentence",
    split_length=10,
    split_overlap=1,
    respect_sentence_boundary=True,
    language="en"
)

# Split documents
split_documents = document_splitter.run(documents=[your_input_document])

How did you test it:

Unit Tests: Implemented unit tests for each critical functionality of the NLTKDocumentSplitter, including various split_by configurations, handling different languages, and respecting sentence boundaries.
Integration Testing: Tested the component within a sample text preprocessing pipeline to ensure it integrates well with other components and handles real-world data as expected.

Notes for the reviewer:

Check if lazy imports are correctly used
This component has been used extensively in dC and is ported from dC into Haystack

coveralls · 2024-09-10T12:12:19Z

Pull Request Test Coverage Report for Build 10792807724

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
6 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.09%) to 90.398%

Files with Coverage Reduction	New Missed Lines	%
components/generators/azure.py	3	92.68%
components/generators/chat/azure.py	3	92.5%

Totals
Change from base Build 10774951533:	0.09%
Covered Lines:	7315
Relevant Lines:	8092

💛 - Coveralls

vblagoje

These are just my notes to help the reviewer orient more easily

vblagoje · 2024-09-10T12:39:27Z

haystack/components/preprocessors/utils.py

+    :param keep_white_spaces: If True, the tokenizer will keep white spaces between sentences.
+    :returns: nltk sentence tokenizer.
+    """
+    try:


What's the right thing to use here? See the issue described in #8238 (and the corrective action recommended)
In 1.26.x branch we load these nltk thingies a bit differently, see these LOCs

as it is now should be fine, the punkt_tab is the recommended folder to load and already works for us

Ok keeping it as is

vblagoje · 2024-09-10T12:40:24Z

haystack/components/preprocessors/utils.py

+            return True
+
+        # next sentence starts with a bracket or we return False
+        return re.search(r"^\s*[\(\[]", text[next_start:next_end]) is not None


All of these linters and checkers didn't allow me to keep this LOC as it was in the original code. Please double check @sjrl

What was the original line of code?

It was:

if re.search(r"^\s*[\(\[]", text[next_start:next_end]) is not None: return True return False

But then linter didn't allow me to have such code :-)

vblagoje · 2024-09-10T12:40:56Z

pyproject.toml

@@ -103,6 +103,8 @@ extra-dependencies = [
  "python-pptx",                      # PPTXToDocument
  "python-docx",                      # DocxToDocument

+  "nltk",                             # NLTKDocumentSplitter


nltk is treated as optional dep but we need to add it here for tests

vblagoje · 2024-09-10T13:11:10Z

@davidsbatista you won the lottery here but let's allow @sjrl a first pass to make sure all the pieces were migrated properly 🙏

sjrl · 2024-09-10T13:14:44Z

Hey @vblagoje broad question, would it be better to fold this functionality into the existing document splitter instead of creating a new component?

vblagoje · 2024-09-10T13:20:25Z

Forced pushed to properly credit @sjrl for all the work

vblagoje · 2024-09-10T13:23:18Z

Hey @vblagoje broad question, would it be better to fold this functionality into the existing document splitter instead of creating a new component?

I'm afraid of unintended side effect for the existing users of DocumentSplitter @sjrl Perhaps we can keep it as is now and carefully merge it for the next release I'd say, wdyt? wdyt @julian-risch ?

haystack/components/preprocessors/utils.py

vblagoje · 2024-09-12T07:24:53Z

@davidsbatista I converted a few more methods to static, they seems to be really tied to SentenceSplitter and as such I didn't make them free standing

vblagoje · 2024-09-12T07:28:01Z

@sjrl please have another look. I spoke to @julian-risch and he also agreed we integrate NLTKDocumentSplitter and later investigate options to perhaps merge NLTKDocumentSplitter and DocumentSplitter

haystack/components/preprocessors/utils.py

davidsbatista · 2024-09-12T08:18:28Z

Name                                                          Stmts   Miss  Cover   Missing
-------------------------------------------------------------------------------------------
haystack/components/preprocessors/__init__.py                     5      0   100%
haystack/components/preprocessors/document_cleaner.py           104      2    98%   90, 311
haystack/components/preprocessors/document_splitter.py           96      1    99%   127
haystack/components/preprocessors/nltk_document_splitter.py      98      0   100%
haystack/components/preprocessors/text_cleaner.py                29      0   100%
haystack/components/preprocessors/utils.py                       83     15    82%   91-95, 102-107, 174-176, 202, 208, 212, 230-231
-------------------------------------------------------------------------------------------
TOTAL                                                           415     18    96%

Running the test coverage locally it seems there's a few edge cases in utils.py that might be worth testing. This is what not currently being tested:

174-176 inside _apply_split_rules() tests never go inside the second while loop
202,208,212 _needs_join never falls into a return True case
230-231: _read_abbreviations always falls into the first return case

Do you think it's worth to extend the tests for this edge cases?

vblagoje · 2024-09-12T08:23:47Z

Sure @davidsbatista let's increase coverage and see about compiling those expressions 🙏

vblagoje · 2024-09-12T14:35:03Z

Ah pre-integration checks say we need to add a new documentation page for this component. Not yet ready for integration @davidsbatista @sjrl

vblagoje · 2024-09-16T07:54:22Z

@dfokina I created an initial version of the doc for this component
The main info centers around why someone would choose this splitter over the default one.

vblagoje · 2024-09-17T08:31:16Z

What prevents us from integrating this PR @davidsbatista and @sjrl ?

davidsbatista · 2024-09-17T08:43:41Z

to be complete maybe just the docs - but I wouldn't hold the merging because of that

davidsbatista

LGTM! Thanks @sjrl for this 👍🏽

sjrl · 2024-09-17T08:51:05Z

@vblagoje I'm doing one last quick look over now!

sjrl · 2024-09-17T08:54:05Z

haystack/components/preprocessors/utils.py

+        :param language: The language to read the abbreviations for.
+        :returns: List of abbreviations.
+        """
+        abbreviations_file = Path(__file__).parent.parent / f"data/abbreviations/{language}.txt"


Hey @vblagoje I noticed that we didn't add these files, could we do that?

You can find them here https://github.com/deepset-ai/deepset-cloud-custom-nodes/tree/haystack-1.x/deepset_cloud_custom_nodes/data/abbreviations

sjrl · 2024-09-17T09:00:00Z

Thanks @vblagoje this looks great! Just left a few comments.

Also, all code in the utils.py file was contributed by @tstadel except for the CustomPunktLanguageVars class. So if possible it would be great to attribute him instead :)

vblagoje · 2024-09-17T09:14:53Z

Thanks @vblagoje this looks great! Just left a few comments.

Also, all code in the utils.py file was contributed by @tstadel except for the CustomPunktLanguageVars class. So if possible it would be great to attribute him instead :)

Ah, no problem, will do - thanks @davidsbatista and @sjrl 🙏

vblagoje · 2024-09-17T11:58:45Z

Spoke to @tstadel - he waived attributions. Merging this now. @dfokina let's not forget to include this component in 2.6 docs release

github-actions bot added type:documentation Improvements on the docs topic:tests topic:build/distribution and removed type:documentation Improvements on the docs labels Sep 10, 2024

vblagoje commented Sep 10, 2024

View reviewed changes

vblagoje marked this pull request as ready for review September 10, 2024 13:09

vblagoje requested review from a team as code owners September 10, 2024 13:09

vblagoje requested review from dfokina, davidsbatista and sjrl and removed request for a team September 10, 2024 13:09

sjrl and others added 7 commits September 10, 2024 15:18

Port NLTKDocumentSplitter from dC to Haystack

20a2da9

Improve pydocs

4cbbcdf

Use haystack logging

4e5d7bf

Add NLTKDocumentSplitter to __init__.py

4f7b26a

Use haystack logging, rename test classes

d37202e

Fixing _needs_join return

db07392

Linting

a113d56

vblagoje force-pushed the document_splitter branch from e46c3a8 to a113d56 Compare September 10, 2024 13:19

vblagoje changed the title ~~draft: Port NLTKDocumentSplitter from dC to Haystack~~ feat: Port NLTKDocumentSplitter from dC to Haystack Sep 10, 2024

davidsbatista reviewed Sep 11, 2024

View reviewed changes

haystack/components/preprocessors/utils.py Outdated Show resolved Hide resolved

davidsbatista reviewed Sep 11, 2024

View reviewed changes

haystack/components/preprocessors/utils.py Show resolved Hide resolved

vblagoje added 2 commits September 11, 2024 16:44

PR feedback

df7ac6b

More static methods

696b269

davidsbatista reviewed Sep 12, 2024

View reviewed changes

haystack/components/preprocessors/utils.py Outdated Show resolved Hide resolved

vblagoje added 2 commits September 12, 2024 11:09

Increase test coverage

5b86408

Compile pattern

efd329e

davidsbatista approved these changes Sep 17, 2024

View reviewed changes

sjrl reviewed Sep 17, 2024

View reviewed changes

vblagoje merged commit badd059 into main Sep 17, 2024
19 checks passed

vblagoje deleted the document_splitter branch September 17, 2024 11:59

vblagoje mentioned this pull request Sep 17, 2024

feat: Enhance DocumentSplitter to support semantic document splitting #8111

Open

sjrl mentioned this pull request Dec 3, 2024

Add a Recursive Chunking strategy #8548

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Port NLTKDocumentSplitter from dC to Haystack #8350

feat: Port NLTKDocumentSplitter from dC to Haystack #8350

vblagoje commented Sep 10, 2024 •

edited

Loading

coveralls commented Sep 10, 2024 •

edited

Loading

vblagoje left a comment

vblagoje Sep 10, 2024

sjrl Sep 10, 2024

vblagoje Sep 12, 2024

vblagoje Sep 10, 2024

sjrl Sep 10, 2024

vblagoje Sep 12, 2024

vblagoje Sep 10, 2024

vblagoje commented Sep 10, 2024

sjrl commented Sep 10, 2024

vblagoje commented Sep 10, 2024

vblagoje commented Sep 10, 2024 •

edited

Loading

vblagoje commented Sep 12, 2024

vblagoje commented Sep 12, 2024

davidsbatista commented Sep 12, 2024

vblagoje commented Sep 12, 2024

vblagoje commented Sep 12, 2024

vblagoje commented Sep 16, 2024

vblagoje commented Sep 17, 2024

davidsbatista commented Sep 17, 2024

davidsbatista left a comment

sjrl commented Sep 17, 2024

sjrl Sep 17, 2024

sjrl Sep 17, 2024

sjrl commented Sep 17, 2024

vblagoje commented Sep 17, 2024

vblagoje commented Sep 17, 2024

feat: Port NLTKDocumentSplitter from dC to Haystack #8350

feat: Port NLTKDocumentSplitter from dC to Haystack #8350

Conversation

vblagoje commented Sep 10, 2024 • edited Loading

Why:

What:

How can it be used:

How did you test it:

Notes for the reviewer:

coveralls commented Sep 10, 2024 • edited Loading

Pull Request Test Coverage Report for Build 10792807724

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

vblagoje left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vblagoje commented Sep 10, 2024

sjrl commented Sep 10, 2024

vblagoje commented Sep 10, 2024

vblagoje commented Sep 10, 2024 • edited Loading

vblagoje commented Sep 12, 2024

vblagoje commented Sep 12, 2024

davidsbatista commented Sep 12, 2024

vblagoje commented Sep 12, 2024

vblagoje commented Sep 12, 2024

vblagoje commented Sep 16, 2024

vblagoje commented Sep 17, 2024

davidsbatista commented Sep 17, 2024

davidsbatista left a comment

Choose a reason for hiding this comment

sjrl commented Sep 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjrl commented Sep 17, 2024

vblagoje commented Sep 17, 2024

vblagoje commented Sep 17, 2024

vblagoje commented Sep 10, 2024 •

edited

Loading

coveralls commented Sep 10, 2024 •

edited

Loading

vblagoje commented Sep 10, 2024 •

edited

Loading