-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate new version of pySBD #114
Comments
A fix is pending in commit 4fae978 |
@andrewhead Seems like you already were tracking pySBD repo. Thanks! |
@nipunsadvilkar Sure am! Looks like all is working as expected for now :-) There's one test case in our app that we've had to find a workaround for that you might be interested in knowing---when there are multiple consecutive periods, then the segmenter tends to stop segmenting everything after the second. Here's a minimal example: import pysbd
seg = pysbd.Segmenter()
seg.segment("Sentence. .. Next sentence. Next next sentence.")
# Output (lumps the last two sentences together):
# ['Sentence. ', '.. Next sentence. Next next sentence.'] I currently handle this by de-duplicating periods in the strings before they're passed in to the segmenter. Though I do wonder if this is indicative of some special-case behavior of the segmenter that I should write a more robust workaround for. Anyways, thanks for the tool! The segmenter is great, and I'm so appreciative that you fixed the issues with the char_span option in the most recent release. Keep up the good work :-D |
Thank you for kind words😃 Ahh yes, this consecutive periods seems some weird bug in pySBD. Thanks for pointing out! Will look into it. Cheers! |
In issue #65, workarounds were added to make sentence splitting more accurate in light of some known issues in the pysbd sentence splitter.
In nipunsadvilkar/pySBD#63, it seems that these issues were fixed. I suspect that the fixes in pysbd are more robust than our workarounds.
To fix this issue:
requirements.txt
file, and remove the workarounds from our code.The text was updated successfully, but these errors were encountered: