-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spacy Integration - "detect an empty sentence" #53
Comments
Hello! I don't think there's a workaround yet. Feel free to make a pull request and I'll try to have a look at it :)
|
Actually, using the en_core_web_lg model instead of the sm model mitigates the problem to a great extent (it just parses the sentences better). May not be worth a pull just yet, but I had to fix the same issue during training by pre-processing the sentences. |
Hi, I have a one-line fix that should work for this as it keeps happening. spacy integration code that resolves this issue fairly nicely. I am not sure how to do the pull request, but the in the "spacy_integration.py" file I would propose the following change:
I have updated the code and run a substantial amount of text through it. I did not however create any unit-tests but will be happy to do so you would consider this addition. It could be merged into the list comprehension above, but I wanted to show the logic clearly here. |
The |
Yes, I have run into this issue. I ended up dropping the spacy support altogether and just parse the sentences with my own parser that is targeted at my data. This resolved my issue with the empty sentences and the let me deal with better sentence capture. Contracts and other legal documents do not really parse well with Spacy anyways. |
Using your spacy integration the senticizer for Spacy will sometimes produce an empty sentence (using "en_core_web_sm"). These leads to the SpanMarkerTokenizer throwing an exception. Not sure how active this project is any more, but these seems like an easy fix. Is there a work-around already for this? Would you like the code updated to have one (I might be able to do this fix).
The text was updated successfully, but these errors were encountered: