-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CRFCut sentence segmentation #337
Conversation
Hello @cstorm125! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-12-20 17:27:11 UTC |
collapse two for loop to one
|
@bact The model is 5MB so I agree we can include it as battery. Is everyone okay with having some model files in the library? @artificiala @wannaphongcom |
For reference on model size, see #298 I think 5 MB is ok. |
- add few words to the STARTERS and ENDERS lists - change word list to set, faster membership test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have add few words to STARTERS and ENDERS. May require retrain and reupload of the model.
Apart of that, I think we're good to go.
Please also kindly update the table here: #298 thx
Great work! Another step towards full pipeline.
💯 |
TODO: Next step is to convert https://github.com/vistec-AI/ted_crawler/blob/master/sentenceseg_ted.ipynb to a commandline script and maybe put it in |
CRFCut -- Thai sentence segmentation with conditional random field, default trained on TED dataset
See development notebooks at https://github.com/vistec-AI/ted_crawler;
POS features are not used due to unreliable POS tagging available