Add CRFCut sentence segmentation #337

cstorm125 · 2019-12-17T07:51:56Z

CRFCut -- Thai sentence segmentation with conditional random field, default trained on TED dataset

ORCHID - space-correct accuracy 87% vs 95% state-of-the-art (Zhou et al, 2016; https://www.aclweb.org/anthology/C16-1031.pdf)
TED dataset - space-correct accuracy 82%

See development notebooks at https://github.com/vistec-AI/ted_crawler;
POS features are not used due to unreliable POS tagging available

pep8speaks · 2019-12-17T07:52:01Z

Hello @cstorm125! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-12-20 17:27:11 UTC

collapse two for loop to one

bact · 2019-12-19T01:30:57Z

python-crfsuite should be included in setup.py's install_requires (and maybe in requirements.txt as well, see next point)

The code currently passed the test only because in the test the python-crfsuite was installed as dependencies of attacut - but in real settings, user may not install the optional attacut.

How big is the model? If it's not that big I would suggest to include it in the module, so we will less dependent on the network during runtime.

This will make it possible to have crfcut included as a pythainlp core, as a default engine for sentence segmentation.
If this is really the case, python-crfsuite should be in requirements.txt too.
The model file should be included in setup.py's package_data
Once the model included, _download() function should be removed.

cstorm125 · 2019-12-19T05:04:49Z

@bact The model is 5MB so I agree we can include it as battery. Is everyone okay with having some model files in the library? @artificiala @wannaphongcom

bact · 2019-12-19T11:21:55Z

For reference on model size, see #298

I think 5 MB is ok.

pythainlp/tokenize/__init__.py

- add few words to the STARTERS and ENDERS lists - change word list to set, faster membership test

bact

I have add few words to STARTERS and ENDERS. May require retrain and reupload of the model.
Apart of that, I think we're good to go.

Please also kindly update the table here: #298 thx

Great work! Another step towards full pipeline.

wannaphong · 2019-12-20T17:45:03Z

💯

bact · 2019-12-20T18:05:36Z

TODO: Next step is to convert https://github.com/vistec-AI/ted_crawler/blob/master/sentenceseg_ted.ipynb to a commandline script and maybe put it in bin/ directory. So people can train their own model. Follow up on #344.

crfcut and tests

5d7a807

lalital mentioned this pull request Dec 17, 2019

PR #337 #339

Closed

bact changed the title ~~crfcut and tests~~ crfcut (sentence segmentation) and tests Dec 17, 2019

bact added the enhancement enhance functionalities label Dec 17, 2019

bact added this to the 2.2 milestone Dec 17, 2019

bact and others added 16 commits December 17, 2019 12:30

Update test_tokenize.py

39ecffc

Update __init__.py

f2a6eec

refactor crfcut.py

31b0cbd

collapse two for loop to one

Update crfcut.py

9bb60bf

Update .travis.yml

51dea29

Update .travis.yml

497acb5

Update .travis.yml

3b4cb69

Close wordnet test

0a2bc32

Update test_corpus.py

fcdf6c6

Update .travis.yml

517bd3f

Update test_corpus.py

ffbf0d3

Update test_corpus.py

8967256

Add pytest-xdist

16d31fd

coverage run --concurrency=multiprocessing

9d56d62

Update .travis.yml

86cafd3

Merge branch 'dev' into dev

208b88e

bact requested review from lalital and wannaphong December 19, 2019 11:49

bact self-assigned this Dec 19, 2019

wannaphong reviewed Dec 19, 2019

View reviewed changes

pythainlp/tokenize/__init__.py Outdated Show resolved Hide resolved

pythainlp/tokenize/__init__.py Outdated Show resolved Hide resolved

bact and others added 5 commits December 19, 2019 14:52

Update __init__.py

612f46c

Update test_tokenize.py

767db4e

add python-crfsuite to requirements.txt and setup.py

809c5dd

change crfcut to battery-included

2c42a19

import os in crfcut.py

76baf1f

cstorm125 requested review from bact and wannaphong December 20, 2019 03:45

This was referenced Dec 20, 2019

Sentence tokenizer for Thai - การตัดประโยคที่ไม่ได้คั่นด้วย whitespace #73

Closed

Elementary discourse unit segmentation #225

Closed

bact added 8 commits December 20, 2019 10:29

Update __init__.py

330e517

Add starter and ender words, change list to set

e2092a7

- add few words to the STARTERS and ENDERS lists - change word list to set, faster membership test

uncommeted wordnet tests

260c3f0

Format test_tokenize.py

870b283

Sort requirements.txt

8370de8

Uncomment wordnet

0ce8c50

Sort requirements in setup.py

d2e4c64

Add sentenceseg-ted.model to package_data

18516cb

bact requested changes Dec 20, 2019

View reviewed changes

bact and others added 3 commits December 20, 2019 11:01

Fix requirements

cf7749e

Merge branch 'dev' into dev

e7e5b52

update sentenceseg-ted.model with new starters and enders

f18b54e

cstorm125 merged commit 7bf2365 into PyThaiNLP:dev Dec 20, 2019

bact changed the title ~~crfcut (sentence segmentation) and tests~~ Add CRFCut sentence segmentation Dec 20, 2019

bact mentioned this pull request Dec 20, 2019

PyThaiNLP 2.2 change log #330

Closed

bact mentioned this pull request May 14, 2020

Add crfcut v2 model #380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CRFCut sentence segmentation #337

Add CRFCut sentence segmentation #337

cstorm125 commented Dec 17, 2019 •

edited by bact

Loading

pep8speaks commented Dec 17, 2019 •

edited

Loading

bact commented Dec 19, 2019 •

edited

Loading

cstorm125 commented Dec 19, 2019

bact commented Dec 19, 2019 •

edited

Loading

bact left a comment •

edited

Loading

wannaphong commented Dec 20, 2019

bact commented Dec 20, 2019 •

edited

Loading

Add CRFCut sentence segmentation #337

Add CRFCut sentence segmentation #337

Conversation

cstorm125 commented Dec 17, 2019 • edited by bact Loading

pep8speaks commented Dec 17, 2019 • edited Loading

Comment last updated at 2019-12-20 17:27:11 UTC

bact commented Dec 19, 2019 • edited Loading

cstorm125 commented Dec 19, 2019

bact commented Dec 19, 2019 • edited Loading

bact left a comment • edited Loading

Choose a reason for hiding this comment

wannaphong commented Dec 20, 2019

bact commented Dec 20, 2019 • edited Loading

cstorm125 commented Dec 17, 2019 •

edited by bact

Loading

pep8speaks commented Dec 17, 2019 •

edited

Loading

bact commented Dec 19, 2019 •

edited

Loading

bact commented Dec 19, 2019 •

edited

Loading

bact left a comment •

edited

Loading

bact commented Dec 20, 2019 •

edited

Loading