Transition based dependency parser with state embeddings computed by LSTM RNNs
Please, follow CL Article for a full description of the parser.
For the EMNLP character-based model, please check out from the branch "char-based" and follow the instructions shown in the readme file of the new branch. Here, the branch
There is an easier to use version here, provided by duncanka, which can be used on data without oracle transitions.
There is a version here that incorporates morphological features and also allows it to run with and without character based embeddings (option -S)
- A C++ compiler supporting the C++11 language standard
- Boost libraries
- Eigen (newer versions strongly recommended)
- CMake
- gcc (only tested with gcc version 5.3.0, may be incompatible with earlier versions)
mkdir build
cd build
cmake .. -DEIGEN3_INCLUDE_DIR=/path/to/eigen
make -j2
Having a training.conll file and a development.conll formatted according to the CoNLL data format, to train a parsing model with the LSTM parser type the following at the command line prompt:
java -jar ParserOracleArcStdWithSwap.jar -t -1 -l 1 -c training.conll > trainingOracle.txt
java -jar ParserOracleArcStdWithSwap.jar -t -1 -l 1 -c development.conll > devOracle.txt
parser/lstm-parse -T trainingOracle.txt -d devOracle.txt --hidden_dim 100 --lstm_input_dim 100 -w sskip.100.vectors --pretrained_dim 100 --rel_dim 20 --action_dim 20 -t -P
Link to the word vectors that we used in the ACL 2015 paper for English: sskip.100.vectors.
Note-1: you can also run it without word embeddings by removing the -w option for both training and parsing.
Note-2: the training process should be stopped when the development result does not substantially improve anymore. Normally, after 5500 iterations.
Note-3: the parser reports (after each iteration) results including punctuation symbols while in the ACL-15 paper we report results excluding them (as it is common practice in those data sets). You can find eval.pl script from the CoNLL-X Shared Task to get the correct numbers.
Having a test.conll file formatted according to the CoNLL data format
java -jar ParserOracleArcStdWithSwap.jar -t -1 -l 1 -c test.conll > testOracle.txt
parser/lstm-parse -T trainingOracle.txt -d testOracle.txt --hidden_dim 100 --lstm_input_dim 100 -w sskip.100.vectors --pretrained_dim 100 --rel_dim 20 --action_dim 20 -P -m parser_pos_2_32_100_20_100_12_20-pidXXXX.params
The model name/id is stored where the parser has been trained. The parser will output the conll file with the parsing result.
TODO
If you make use of this software, please cite the following:
@inproceedings{dyer:2015acl,
author={Chris Dyer and Miguel Ballesteros and Wang Ling and Austin Matthews and Noah A. Smith},
title={Transition-based Dependency Parsing with Stack Long Short-Term Memory},
booktitle={Proc. ACL},
year=2015,
}
This software is released under the terms of the Apache License, Version 2.0.
For questions and usage issues, please contact cdyer@cs.cmu.edu and miguel.ballesteros@upf.edu.