Different segmentation with Spacy and when using pySBD directly #55

nmstoker · 2019-11-27T01:04:02Z

Firstly thank you for this project - I was lucky to find it and it is really useful

I seem to have found a case where the segmentation is behaving differently when run within the Spacy pipeline and when run using pySBD directly. I stumbled on it with my own text where a sentence after a previous sentence that was in quotes was being lumped together. I looked through the Golden Rules and found this wasn't expected and then noticed that even with the text in one of your tests it acts differently in Spacy.

To reproduce run these two bits of code:

from pysbd.utils import PySBDFactory
nlp = spacy.blank('en')
nlp.add_pipe(PySBDFactory(nlp))
doc = nlp("She turned to him, \"This is great.\" She held the book out to show him.")
for sent in doc.sents:
    print(str(sent).strip() + '\n')

She turned to him, "This is great." She held the book out to show him.

import pysbd
text = "She turned to him, \"This is great.\" She held the book out to show him."
seg = pysbd.Segmenter(language="en", clean=False)
#print(seg.segment(text))
for sent in seg.segment(text):
    print(str(sent).strip() + '\n')

She turned to him, "This is great."

She held the book out to show him.

The second way is the desired output (based on the rules at least)

jenojp · 2020-01-29T17:13:19Z

Hey @nmstoker , I am having similar issues and I think I discovered what is going on, though I haven't gotten a chance to find the root cause in pysbd.

So if you run your example directly:

import pysbd
fake_note = "She turned to him, \"This is great.\" She held the book out to show him."

seg = pysbd.Segmenter(language="en", clean=False, char_span=True)
print(seg.segment(fake_note))

Returns

[TextSpan(sent='She turned to him, "This is great."', start=0, end=35), TextSpan(sent='She held the book out to show him.', start=35, end=69)]

If we replicate the spacy pipeline code found in pysbd/utils.py with a few print statements to see what's going on under the hood:

def test(doc):
    sents_char_spans = seg.segment(doc.text)
    print(sents_char_spans)
    char_spans = [doc.char_span(sent_span.start, sent_span.end)
                for sent_span in sents_char_spans]
    print(char_spans)
    start_token_ids = [span[0].idx for span in char_spans if span
                    is not None]
    for token in doc:
        token.is_sent_start = (True if token.idx
                            in start_token_ids else False)
    return doc

nlp = spacy.blank("en")
nlp.add_pipe(test, first=True)
doc = nlp(fake_note)
print([s.text for s in doc.sents])

We can see that sent_char_spans appears to match exactly the direct run. The problem appears to be when making char_spans because the start, end character indices do not match the spacy doc object. This returns None type spans.

#sents_char_spans
[TextSpan(sent='She turned to him, "This is great."', start=0, end=35), TextSpan(sent='She held the book out to show him.', start=35, end=69)]

#char_spans
[She turned to him, "This is great.", None]

So if you run run the following you only get one sentence:

nlp = spacy.blank("en")
nlp.add_pipe(test, first=True)
doc = nlp(fake_note)
print([s.text for s in doc.sents])

['She turned to him, "This is great." She held the book out to show him.']

If you look at the character indices of the tokens without using pysbd, they get messed up once they hit your /"

nlp = spacy.blank("en")
doc = nlp(fake_note)
print([(token.text, token.idx) for token in doc])

[('She', 0), ('turned', 4), ('to', 11), ('him', 14), (',', 17), ('"', 19), ('This', 20), ('is', 25), ('great', 28), ('.', 33), ('"', 34), ('She', 36), ('held', 40), ('the', 45), ('book', 49), ('out', 54), ('to', 58), ('show', 61), ('him', 66), ('.', 69)]
#notice 34 instead of 35

I'm seeing similar issues when you have a series of special characters, such as this example:

fake_note = """
PHYSICAL EXAMINATION:  Vital signs:  Temperature 96.5??????, blood
pressure 158/49, pulse 76, respirations 14, oxygen saturation
98% on 2 L, 92% on room air.  General:  She was elderly,
lying in bed."""

If you remove the ? series, the problem goes away.

nmstoker · 2020-01-29T17:21:51Z

That's interesting. Great work digging into it @jenojp you got further than I did!

nipunsadvilkar · 2020-01-29T17:33:11Z

@nmstoker Thanks for appreciating and noticing the issue. The issue is known to me and as @jenojp illustrated with an example, he's right..matching pysbd character offset indices with spaCy's Doc object is a bit tricky and which is why we see that disparity in the output. doc.sents requires tok i.e., doc[0].is_sent_start attribute to be set to True..logic is written in such a way that if we get proper span then it becomes very straightforward and we get neat results. On the other hand, if char_span returns None then we lose out capturing the sentence.

I have been wanting to resolve this issue but haven't found much time, I will see if I could do something about it in the near future. Though, it would be great if anyone could come up with a solution. That would be very welcoming contribution. Thanks again for pointing it out.

jenojp · 2020-01-29T20:47:19Z

@nipunsadvilkar I'll keep you posted if I can get some free time to look into it more. This is a really promising project!

nipunsadvilkar · 2020-02-12T10:04:06Z

@jenojp Have a look at a new issue which I just created. The solution might work to get proper segmentation in both with or without using spaCy.

Fixes #49, #53, #55 , #59

nipunsadvilkar · 2020-06-09T17:06:26Z

Fixed #63

nipunsadvilkar added bug help wanted labels Jan 29, 2020

nipunsadvilkar mentioned this issue Apr 11, 2020

Create Doc object using char_span with start idx as whitespace explosion/spaCy#5295

Closed

nipunsadvilkar mentioned this issue May 26, 2020

✨ 💫 sent char_span through with spaCy & regex & ♻️ Refactoring for more languages support #63

Merged

4 tasks

nipunsadvilkar added a commit that referenced this issue May 29, 2020

🎨 ✅ Add tests for resolved issues

68dc962

Fixes #49, #53, #55 , #59

nipunsadvilkar mentioned this issue May 29, 2020

Looses text when breaking into sentences #57

Closed

nipunsadvilkar closed this as completed Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different segmentation with Spacy and when using pySBD directly #55

Different segmentation with Spacy and when using pySBD directly #55

nmstoker commented Nov 27, 2019 •

edited

Loading

jenojp commented Jan 29, 2020

nmstoker commented Jan 29, 2020

nipunsadvilkar commented Jan 29, 2020 •

edited

Loading

jenojp commented Jan 29, 2020

nipunsadvilkar commented Feb 12, 2020

nipunsadvilkar commented Jun 9, 2020

Different segmentation with Spacy and when using pySBD directly #55

Different segmentation with Spacy and when using pySBD directly #55

Comments

nmstoker commented Nov 27, 2019 • edited Loading

jenojp commented Jan 29, 2020

nmstoker commented Jan 29, 2020

nipunsadvilkar commented Jan 29, 2020 • edited Loading

jenojp commented Jan 29, 2020

nipunsadvilkar commented Feb 12, 2020

nipunsadvilkar commented Jun 9, 2020

nmstoker commented Nov 27, 2019 •

edited

Loading

nipunsadvilkar commented Jan 29, 2020 •

edited

Loading