-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Port NLTKDocumentSplitter from dC to Haystack #8350
Conversation
Pull Request Test Coverage Report for Build 10792807724Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just my notes to help the reviewer orient more easily
:param keep_white_spaces: If True, the tokenizer will keep white spaces between sentences. | ||
:returns: nltk sentence tokenizer. | ||
""" | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as it is now should be fine, the punkt_tab
is the recommended folder to load and already works for us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok keeping it as is
return True | ||
|
||
# next sentence starts with a bracket or we return False | ||
return re.search(r"^\s*[\(\[]", text[next_start:next_end]) is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these linters and checkers didn't allow me to keep this LOC as it was in the original code. Please double check @sjrl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the original line of code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was:
if re.search(r"^\s*[\(\[]", text[next_start:next_end]) is not None:
return True
return False
But then linter didn't allow me to have such code :-)
@@ -103,6 +103,8 @@ extra-dependencies = [ | |||
"python-pptx", # PPTXToDocument | |||
"python-docx", # DocxToDocument | |||
|
|||
"nltk", # NLTKDocumentSplitter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nltk is treated as optional dep but we need to add it here for tests
@davidsbatista you won the lottery here but let's allow @sjrl a first pass to make sure all the pieces were migrated properly 🙏 |
Hey @vblagoje broad question, would it be better to fold this functionality into the existing document splitter instead of creating a new component? |
e46c3a8
to
a113d56
Compare
Forced pushed to properly credit @sjrl for all the work |
I'm afraid of unintended side effect for the existing users of DocumentSplitter @sjrl Perhaps we can keep it as is now and carefully merge it for the next release I'd say, wdyt? wdyt @julian-risch ? |
@davidsbatista I converted a few more methods to static, they seems to be really tied to |
@sjrl please have another look. I spoke to @julian-risch and he also agreed we integrate NLTKDocumentSplitter and later investigate options to perhaps merge NLTKDocumentSplitter and DocumentSplitter |
Running the test coverage locally it seems there's a few edge cases in
Do you think it's worth to extend the tests for this edge cases? |
Sure @davidsbatista let's increase coverage and see about compiling those expressions 🙏 |
Ah pre-integration checks say we need to add a new documentation page for this component. Not yet ready for integration @davidsbatista @sjrl |
What prevents us from integrating this PR @davidsbatista and @sjrl ? |
to be complete maybe just the docs - but I wouldn't hold the merging because of that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @sjrl for this 👍🏽
@vblagoje I'm doing one last quick look over now! |
:param language: The language to read the abbreviations for. | ||
:returns: List of abbreviations. | ||
""" | ||
abbreviations_file = Path(__file__).parent.parent / f"data/abbreviations/{language}.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @vblagoje I noticed that we didn't add these files, could we do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no problem, will do - thanks @davidsbatista and @sjrl 🙏 |
Why:
Introduces a new document splitter component utilizing NLTK for enhanced text processing.
What:
split_by
,split_length
,split_overlap
,respect_sentence_boundary
,language
,use_split_rules
, andextend_abbreviations
for fine-tuning the document splitting process.How can it be used:
How did you test it:
split_by
configurations, handling different languages, and respecting sentence boundaries.Notes for the reviewer: