Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ πŸ’« sent char_span through with spaCy & regex & ♻️ Refactoring for more languages support #63

Merged
merged 26 commits into from
Jun 9, 2020

Conversation

nipunsadvilkar
Copy link
Owner

@nipunsadvilkar nipunsadvilkar commented May 26, 2020

πŸ›Old Approach

TLDR - Sentences(string) were right βœ… Sentence char spans were wrong ❌

Earlier pysbd v0.2.3 depended on creating Doc object through regex search - SENTENCE_BOUNDARY_REGEX - on a preprocessed text. The preprocessing is an essential part of pysbd rules which adds Unicode symbols to decide when to/not to break the sentence. Due to the addition of Unicode symbols, char_spans found by regex were not proper in contrast to original text. Another caveat was - spaCy doesn't provide Doc object if there is leading/trailing whitespace and returns None which used to lead to wrong char spans.

✨New Approach

Use Token.is_sent_start attribute over getting an Doc object to get correct char_spans

Rely on pysbd to give proper sentences (strings) out of the original text -> use regex to find those sentence within original text & keep start char indices of sentence matches > iterate over each Token of Doc object and set Token.is_sent_start True (if in sent_start_token_idx) else False.

In preliminary testing, new approach seems to be resolving issues raised by developers.

CHANGELOG

@nipunsadvilkar nipunsadvilkar changed the title ✨ πŸ’« sent char_span through with spaCy & regex ✨ πŸ’« sent char_span through with spaCy & regex May 29, 2020
@codecov-commenter
Copy link

codecov-commenter commented Jun 6, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@f7c640f). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master      #63   +/-   ##
=========================================
  Coverage          ?   97.45%           
=========================================
  Files             ?       20           
  Lines             ?      748           
  Branches          ?        0           
=========================================
  Hits              ?      729           
  Misses            ?       19           
  Partials          ?        0           
Flag Coverage Ξ”
#unittests 97.45% <0.00%> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data
Powered by Codecov. Last update f7c640f...0ab7a6f. Read the comment docs.

@nipunsadvilkar nipunsadvilkar changed the title ✨ πŸ’« sent char_span through with spaCy & regex ✨ πŸ’« sent char_span through with spaCy & regex & ♻️ Refactoring for more languages support Jun 9, 2020
@nipunsadvilkar nipunsadvilkar merged commit e0cdada into master Jun 9, 2020
@nipunsadvilkar nipunsadvilkar deleted the npn-char-span-fix branch June 9, 2020 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants