Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate new version of pySBD #114

Closed
2 tasks done
andrewhead opened this issue Jun 15, 2020 · 4 comments
Closed
2 tasks done

Integrate new version of pySBD #114

andrewhead opened this issue Jun 15, 2020 · 4 comments
Labels
pipeline Data processing pipeline sentences An issue or task related to sentences

Comments

@andrewhead
Copy link
Contributor

andrewhead commented Jun 15, 2020

In issue #65, workarounds were added to make sentence splitting more accurate in light of some known issues in the pysbd sentence splitter.

In nipunsadvilkar/pySBD#63, it seems that these issues were fixed. I suspect that the fixes in pysbd are more robust than our workarounds.

To fix this issue:

  • Test that pysbd works correctly on the test cases we developed when creating the workarounds
  • If so, update the version of pysbd in the requirements.txt file, and remove the workarounds from our code.
@andrewhead andrewhead added pipeline Data processing pipeline sentences An issue or task related to sentences labels Jun 15, 2020
@andrewhead
Copy link
Contributor Author

A fix is pending in commit 4fae978

@nipunsadvilkar
Copy link

@andrewhead Seems like you already were tracking pySBD repo. Thanks!
I was gonna comment today that new release with fixes has been pushed, try it out and let me know if you come across any edge case.

@andrewhead
Copy link
Contributor Author

andrewhead commented Aug 13, 2020

@nipunsadvilkar Sure am! Looks like all is working as expected for now :-)

There's one test case in our app that we've had to find a workaround for that you might be interested in knowing---when there are multiple consecutive periods, then the segmenter tends to stop segmenting everything after the second. Here's a minimal example:

import pysbd
seg = pysbd.Segmenter()
seg.segment("Sentence. .. Next sentence. Next next sentence.")
# Output (lumps the last two sentences together):
# ['Sentence. ', '.. Next sentence. Next next sentence.']

I currently handle this by de-duplicating periods in the strings before they're passed in to the segmenter. Though I do wonder if this is indicative of some special-case behavior of the segmenter that I should write a more robust workaround for.

Anyways, thanks for the tool! The segmenter is great, and I'm so appreciative that you fixed the issues with the char_span option in the most recent release. Keep up the good work :-D

@nipunsadvilkar
Copy link

Thank you for kind words😃

Ahh yes, this consecutive periods seems some weird bug in pySBD. Thanks for pointing out! Will look into it.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pipeline Data processing pipeline sentences An issue or task related to sentences
Projects
None yet
Development

No branches or pull requests

2 participants