This is the implementation of Constituency Parsing with Span Attention at Findings of EMNLP2020.
Please contact us at yhtian@uw.edu
if you have any questions.
If you use or extend our work, please cite our paper at Findings of EMNLP-2020.
@inproceedings{tian-etal-2020-improving,
title = "Improving Constituency Parsing with Span Attention",
author = "Tian, Yuanhe and Song, Yan and Xia, Fei and Zhang, Tong",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
pages = "1691--1703",
}
python 3.6
pytorch 1.1
Install python dependencies by running:
pip install -r requirements.txt
EVALB
and EVALB_SPMRL
contain the code to evaluate the parsing results for English and other languages. Before running evaluation, you need to go to the EVALB
(for English) or EVALB_SPMRL
(for other languages) and run make
.
In our paper, we use BERT, ZEN, and XLNet as the encoder.
For BERT, please download pre-trained BERT model from Google and convert the model from the TensorFlow version to PyTorch version.
- For Arabic, we use MulBERT-Base, Multilingual Cased.
- For Chinese, we use BERT-Base, Chinese;
- For English, we use BERT-Large, Cased and BERT-Large, Uncased.
For ZEN, you can download the pre-trained model from here.
For XLNet, you can download the pre-trained model from here.
For our pre-trained model, you can download them from Baidu Wangpan (passcode: 2o1n) or Google Drive.
To train a model on a small dataset, run:
./run.sh
We use datasets in three languages: Arabic, Chinese, and English.
- Arabic: we use ATB2.0 part 1-3 (LDC2003T06, LDC2004T02, and LDC2005T20).
- Chinese: we use CTB5 (LDC2005T01).
- English: we use PTB (LDC99T42).
To preprocess the data, please go to data_processing
directory and follow the instruction to process the data. You need to obtain the official datasets yourself before running our code.
Ideally, all data will appear in ./data
directory. The data with gold POS tags are located in folders whose name is the same as the dataset name (i.e., ATB, CTB, and PTB); the data with predicted POS tags are located in folders whose name has a "_POS" suffix (i.e., ATB_POS, CTB_POS, and PTB_POS).
You can find the command lines to train and test models on a specific dataset in run.sh
.
- Regular maintenance.
You can leave comments in the Issues
section, if you want us to implement any functions.
You can check our updates at updates.md.