-
Notifications
You must be signed in to change notification settings - Fork 4
WeSearch_DescriptiveStatistics
JonathonRead edited this page Mar 12, 2014
·
1 revision
Beginning by reproducing the methodology of Baldwin et al. (2013) using the WDC, with the following exceptions:
- tokenisation using REPP. Punctuation removed from tokens
According to langid.py (Lui and Baldwin, 2012) 100% of the WDC is English. Reassuring.
Baldwin, T., Cook, P., Lui, M., MacKinlay, A., and Wang, L. (2013). "How Noisy Social Media Text, How Diffrnt Social Media Sources?" in Proceedings of the International Joint Conference on Natural Language Processing, pp. 356-364.
Lui, M and Baldwin, T. (2012). "langid.py: An off-the-shelf language identification tool" in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics Demo Session, pp.25-30.
Home | Forum | Discussions | Events