Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CRFCut sentence segmentation #337

Merged
merged 33 commits into from
Dec 20, 2019
Merged

Add CRFCut sentence segmentation #337

merged 33 commits into from
Dec 20, 2019

Conversation

cstorm125
Copy link
Member

@cstorm125 cstorm125 commented Dec 17, 2019

CRFCut -- Thai sentence segmentation with conditional random field, default trained on TED dataset

See development notebooks at https://github.com/vistec-AI/ted_crawler;
POS features are not used due to unreliable POS tagging available

@pep8speaks
Copy link

pep8speaks commented Dec 17, 2019

Hello @cstorm125! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-12-20 17:27:11 UTC

@lalital lalital mentioned this pull request Dec 17, 2019
@bact bact changed the title crfcut and tests crfcut (sentence segmentation) and tests Dec 17, 2019
@bact bact added the enhancement enhance functionalities label Dec 17, 2019
@bact bact added this to the 2.2 milestone Dec 17, 2019
@bact
Copy link
Member

bact commented Dec 19, 2019

  1. python-crfsuite should be included in setup.py's install_requires (and maybe in requirements.txt as well, see next point)
  • The code currently passed the test only because in the test the python-crfsuite was installed as dependencies of attacut - but in real settings, user may not install the optional attacut.
  1. How big is the model? If it's not that big I would suggest to include it in the module, so we will less dependent on the network during runtime.
  • This will make it possible to have crfcut included as a pythainlp core, as a default engine for sentence segmentation.
  • If this is really the case, python-crfsuite should be in requirements.txt too.
  • The model file should be included in setup.py's package_data
  • Once the model included, _download() function should be removed.

@cstorm125
Copy link
Member Author

@bact The model is 5MB so I agree we can include it as battery. Is everyone okay with having some model files in the library? @artificiala @wannaphongcom

@bact
Copy link
Member

bact commented Dec 19, 2019

For reference on model size, see #298

I think 5 MB is ok.

@bact bact requested review from lalital and wannaphong December 19, 2019 11:49
@bact bact self-assigned this Dec 19, 2019
pythainlp/tokenize/__init__.py Outdated Show resolved Hide resolved
pythainlp/tokenize/__init__.py Outdated Show resolved Hide resolved
Copy link
Member

@bact bact left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have add few words to STARTERS and ENDERS. May require retrain and reupload of the model.
Apart of that, I think we're good to go.

Please also kindly update the table here: #298 thx

Great work! Another step towards full pipeline.

@cstorm125 cstorm125 merged commit 7bf2365 into PyThaiNLP:dev Dec 20, 2019
@bact bact changed the title crfcut (sentence segmentation) and tests Add CRFCut sentence segmentation Dec 20, 2019
@bact bact mentioned this pull request Dec 20, 2019
@wannaphong
Copy link
Member

💯

@bact
Copy link
Member

bact commented Dec 20, 2019

TODO: Next step is to convert https://github.com/vistec-AI/ted_crawler/blob/master/sentenceseg_ted.ipynb to a commandline script and maybe put it in bin/ directory. So people can train their own model. Follow up on #344.

@bact bact mentioned this pull request May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement enhance functionalities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants