Transition-based shallow semantic parser for causal relations, with state embeddings computed by LSTM RNNs. Based on the LSTM syntactic parser. This system was documented in a 2018 EMNLP paper.
Note: the instructions below assume that you have a dataset annotated in the same brat format as BECAUSE.
- A C++ compiler supporting the C++11 language standard
- Boost libraries
- Eigen (newer versions strongly recommended)
- CMake
- gcc (only tested with gcc version 5.3.0 and above; may be incompatible with earlier versions)
- Googletest library, if you're going to compile in debug mode (which is necessary for unit tests)
git clone --recursive https://github.com/duncanka/lstm-causality-tagger.git # --recursive pulls in lstm-parser submodule
cd lstm-causality-tagger
mkdir build
cd build
cmake .. -DEIGEN3_INCLUDE_DIR=/path/to/eigen -DCMAKE_BUILD_TYPE=RelWithDebInfo
make -j2 # or your desired number of compilation threads
To get DeepCx set up, you will first need to perform steps 1-7 from the Causeway README to set up the Causeway causal language tagger. This is necessary only for running the transition oracle (apologies for the overkill). Skip step 5 if you already have the BECAUSE corpus or equivalent set up.
The instructions below assume you have Causeway set up at $CAUSEWAY_DIR
and that your working directory is the root lstm-causality-tagger
directory.
You should also download the pretrained LSTM syntactic parser model to the lstm-parser
subdirectory.
To train a causal language tagging model:
-
Run the transition oracle on your training data.
export PYTHONPATH=$CAUSEWAY_DIR/NLPypline/src:$CAUSEWAY_DIR/src scripts/transition_oracle.py $PATH_TO_TRAINING_DATA
This will output
.trans
files alongside the.txt
and.ann
files in$PATH_TO_TRAINING_DATA
. -
Run the binary in training mode:
build/lstm-causality/lstm-causality-tagger --cnn-mem 800 --train --training-data $PATH_TO_TRAINING_DATA
This will create a
models
directory within your working directory with the model file inside it (the name will be something liketagger_10_2_48_48_32_20_32_8_72_0_new-conn__pid5797.params
).Training will stop automatically once a given number of epochs has passed without substantial improvement. (You can adjust this behavior with the
--epochs-cutoff
,--recent-improvements-cutoff
, and--recent-improvements-epsilon
flags.) You can also stop it with Ctrl+C.
DeepCx outputs causal language annotations in the BECAUSE brat format. It appends to (not overwrites!) any .ann
files in the input directory.
To tag new data with a trained model located at $MODEL_FILEPATH
:
-
Create blank
.ann
files for the transition oracle to read (and clear any existing annotations):scripts/clear-anns.sh $PATH_TO_TEST_DATA
The script will ask you to confirm that you do want to clear existing files.
-
Run the transition oracle on your blank
.ann
files to transform the text corpus into the transition format that DeepCx can ingest:export PYTHONPATH=$CAUSEWAY_DIR/NLPypline/src:$CAUSEWAY_DIR/src scripts/transition_oracle.py $PATH_TO_TEST_DATA
-
Run the binary in test mode:
build/lstm-causality/lstm-causality-tagger --cnn-mem 800 --test --test-data $PATH_TO_TEST_DATA --model $MODEL_FILEPATH
To evaluate a trained model located at $MODEL_FILEPATH
:
- Run the transition oracle as for training to produce
.trans
files. - Run the tagger in evaluation mode:
In this mode, it will not output results to the
build/lstm-causality/lstm-causality-tagger --cnn-mem 800 --evaluate --test-data $PATH_TO_TEST_DATA --model $MODEL_FILEPATH
.ann
files unless you specifically request it with the--write-results
option. Note that doing so will append to the annotations from which the gold transitions were generated, which you probably do not want to do!
The EMNLP results were produced with cross-validation, which you can accomplish by adding the --folds
option:
build/lstm-causality/lstm-causality-tagger --cnn-mem 800 --train --training-data $BECAUSE_DIR --folds 20
Note that cross-validation will keep overwriting the model for each fold.
TODO
This software is released under the terms of the Apache License, Version 2.0.
For questions and usage issues, please contact jdunietz@cs.cmu.edu.