-
Notifications
You must be signed in to change notification settings - Fork 1
Dependency parser for tweets ... that we don't know how to use. Taken from: http://www.cs.cmu.edu/~ark/TweetNLP
License
SowaLabs/Tweebo
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
======================================================================================== TweeboParser -- A Dependency Parser for Tweets. Version 1.0 Written and maintained by Lingpeng Kong (lingpenk [at] cs.cmu.edu). ======================================================================================== This file is part of TweeboParser, a project started at the computational linguistics research group, ARK (http://www.ark.cs.cmu.edu/), in Carnegie Mellon University. This package contains the implementation of the dependency parser for English Tweets described in: [1] A Dependency Parser for Tweets. Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer, and Noah A. Smith. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, October 2014. TweeboParser is based on TurboParser (http://www.ark.cs.cmu.edu/TurboParser/). The preprocessing steps depend on Twitter Part-of-Speech Tagger (http://www.ark.cs.cmu.edu/TweetNLP/). This package is self-contained, it includes all the required software. ======================================================================================== TweeboParser is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. TweeboParser is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. ======================================================================================== Contents ======================================================================================== 1. TweeboParser 1.1 Compiling 1.2 Example of usage 1.3 Directory Structure 2. Tweebank ======================================================================================== 1. TweeboParser ======================================================================================== 1.1 Compiling ======================================================================================== To compile the code, first unzip/tar the downloaded file: > tar -zxvf TweeboParser-1.0.tar.gz > cd TweeboParser-1.0 Next, run the following command > ./install.sh This will install TweeboParser and all its dependencies. ======================================================================================== 1.2 Example of usage ======================================================================================== To run the TweeboParser on raw text input with one sentence per line (e.g. on the sample_input.txt): > ./run.sh sample_input.txt The run.sh file contains the steps we run the whole TweeboParser pipeline, including twokenization, POS tagging, appending brown clustering features and PTB features etc. The output file will be "sample_input.txt.predict" in the same directory as "sample_input.txt". which contains the CoNLL format (http://ilk.uvt.nl/conll/#dataformat) output of the parse tree. (HEAD < 0 means the word is not included in the tree) ======================================================================================== 1.3 Directory Structure ======================================================================================== ark-tweet-nlp ---- The Twitter POS Tagger (http://www.ark.cs.cmu.edu/TweetNLP/) pretrained_models ---- Tagging, token selection, brown clusters obtained from Owoputi et al. (2012) and pre-trained parsing models for PTB and Tweets. scripts ---- Supporting scripts. TBParser ---- TweeboParser, which is based on TurboParser version 2.1.0. The source code of TweeboParser can be found at TBParser/src token_selection ---- The token selection tool implemented in Python. Tweebank ---- Tweebank data release. working_dir ---- The working space for the parser. The temp files generated by TweeboParser when parsing a sentence are putted here. (So don't remove or rename it.) run.sh ---- The bash script which runs the parser on raw inputs (see sec. 1.2). install.sh ---- The bash script which installs everything (see sec. 1.1). ======================================================================================== 2. Tweebank ======================================================================================== Most of TWEEBANK was built in a day by two dozen annotators, most of whom had only cursory training in the annotation scheme. Train_Test_Splited ---- The CoNLL format train and test split we use in the paper. "Train" and "Test-New" in the paper. Raw_Data ---- The json format file which contains all the annotation we have. (Note that some of them are under-specific in annotation. But We only use the ones with the full annotation.) The CoNLL format file contains all the full annotated data we have, the MWE is pre-processed by first-order TurboParser trained on the Penn Treebank. (See the paper for more details.) ========================================================================================
About
Dependency parser for tweets ... that we don't know how to use. Taken from: http://www.cs.cmu.edu/~ark/TweetNLP
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published