-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different segmentation with Spacy and when using pySBD directly #55
Comments
Hey @nmstoker , I am having similar issues and I think I discovered what is going on, though I haven't gotten a chance to find the root cause in pysbd. So if you run your example directly: import pysbd
fake_note = "She turned to him, \"This is great.\" She held the book out to show him."
seg = pysbd.Segmenter(language="en", clean=False, char_span=True)
print(seg.segment(fake_note)) Returns [TextSpan(sent='She turned to him, "This is great."', start=0, end=35), TextSpan(sent='She held the book out to show him.', start=35, end=69)] If we replicate the spacy pipeline code found in def test(doc):
sents_char_spans = seg.segment(doc.text)
print(sents_char_spans)
char_spans = [doc.char_span(sent_span.start, sent_span.end)
for sent_span in sents_char_spans]
print(char_spans)
start_token_ids = [span[0].idx for span in char_spans if span
is not None]
for token in doc:
token.is_sent_start = (True if token.idx
in start_token_ids else False)
return doc
nlp = spacy.blank("en")
nlp.add_pipe(test, first=True)
doc = nlp(fake_note)
print([s.text for s in doc.sents]) We can see that #sents_char_spans
[TextSpan(sent='She turned to him, "This is great."', start=0, end=35), TextSpan(sent='She held the book out to show him.', start=35, end=69)]
#char_spans
[She turned to him, "This is great.", None] So if you run run the following you only get one sentence: nlp = spacy.blank("en")
nlp.add_pipe(test, first=True)
doc = nlp(fake_note)
print([s.text for s in doc.sents])
If you look at the character indices of the tokens without using pysbd, they get messed up once they hit your /" nlp = spacy.blank("en")
doc = nlp(fake_note)
print([(token.text, token.idx) for token in doc])
I'm seeing similar issues when you have a series of special characters, such as this example:
If you remove the ? series, the problem goes away. |
That's interesting. Great work digging into it @jenojp you got further than I did! |
@nmstoker Thanks for appreciating and noticing the issue. The issue is known to me and as @jenojp illustrated with an example, he's right..matching pysbd character offset indices with spaCy's I have been wanting to resolve this issue but haven't found much time, I will see if I could do something about it in the near future. Though, it would be great if anyone could come up with a solution. That would be very welcoming contribution. Thanks again for pointing it out. |
@nipunsadvilkar I'll keep you posted if I can get some free time to look into it more. This is a really promising project! |
@jenojp Have a look at a new issue which I just created. The solution might work to get proper segmentation in both with or without using spaCy. |
Fixed #63 |
Firstly thank you for this project - I was lucky to find it and it is really useful
I seem to have found a case where the segmentation is behaving differently when run within the Spacy pipeline and when run using pySBD directly. I stumbled on it with my own text where a sentence after a previous sentence that was in quotes was being lumped together. I looked through the Golden Rules and found this wasn't expected and then noticed that even with the text in one of your tests it acts differently in Spacy.
To reproduce run these two bits of code:
The second way is the desired output (based on the rules at least)
The text was updated successfully, but these errors were encountered: