Integrate new version of pySBD #114

andrewhead · 2020-06-15T15:58:08Z

In issue #65, workarounds were added to make sentence splitting more accurate in light of some known issues in the pysbd sentence splitter.

In nipunsadvilkar/pySBD#63, it seems that these issues were fixed. I suspect that the fixes in pysbd are more robust than our workarounds.

To fix this issue:

Test that pysbd works correctly on the test cases we developed when creating the workarounds
If so, update the version of pysbd in the requirements.txt file, and remove the workarounds from our code.

The text was updated successfully, but these errors were encountered:

andrewhead · 2020-08-11T19:56:18Z

A fix is pending in commit 4fae978

nipunsadvilkar · 2020-08-12T07:18:23Z

@andrewhead Seems like you already were tracking pySBD repo. Thanks!
I was gonna comment today that new release with fixes has been pushed, try it out and let me know if you come across any edge case.

andrewhead · 2020-08-13T21:19:54Z

@nipunsadvilkar Sure am! Looks like all is working as expected for now :-)

There's one test case in our app that we've had to find a workaround for that you might be interested in knowing---when there are multiple consecutive periods, then the segmenter tends to stop segmenting everything after the second. Here's a minimal example:

import pysbd
seg = pysbd.Segmenter()
seg.segment("Sentence. .. Next sentence. Next next sentence.")
# Output (lumps the last two sentences together):
# ['Sentence. ', '.. Next sentence. Next next sentence.']

I currently handle this by de-duplicating periods in the strings before they're passed in to the segmenter. Though I do wonder if this is indicative of some special-case behavior of the segmenter that I should write a more robust workaround for.

Anyways, thanks for the tool! The segmenter is great, and I'm so appreciative that you fixed the issues with the char_span option in the most recent release. Keep up the good work :-D

nipunsadvilkar · 2020-08-14T09:58:22Z

Thank you for kind words😃

Ahh yes, this consecutive periods seems some weird bug in pySBD. Thanks for pointing out! Will look into it.

Cheers!

Reference allenai/scholarphi#114

andrewhead added pipeline Data processing pipeline sentences An issue or task related to sentences labels Jun 15, 2020

andrewhead closed this as completed Aug 13, 2020

nipunsadvilkar added a commit to nipunsadvilkar/pySBD that referenced this issue Sep 11, 2020

🐛 Fix consecutive period bug

f0e71a5

Reference allenai/scholarphi#114

nipunsadvilkar mentioned this issue Sep 11, 2020

✨ Better handling consecutive periods and reserved special symbols nipunsadvilkar/pySBD#78

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate new version of pySBD #114

Integrate new version of pySBD #114

andrewhead commented Jun 15, 2020 •

edited

Loading

andrewhead commented Aug 11, 2020

nipunsadvilkar commented Aug 12, 2020

andrewhead commented Aug 13, 2020 •

edited

Loading

nipunsadvilkar commented Aug 14, 2020

Integrate new version of pySBD #114

Integrate new version of pySBD #114

Comments

andrewhead commented Jun 15, 2020 • edited Loading

andrewhead commented Aug 11, 2020

nipunsadvilkar commented Aug 12, 2020

andrewhead commented Aug 13, 2020 • edited Loading

nipunsadvilkar commented Aug 14, 2020

andrewhead commented Jun 15, 2020 •

edited

Loading

andrewhead commented Aug 13, 2020 •

edited

Loading