Part of Code for our paper "Improving AMR parsing with Sequence-to-Sequence Pre-training" in EMNLP-2020.
Now we make our part of code and temporary best pre-trained model available to predict AMR graph for arbitrary sentences.
-
Pytorch 1.1.0
Models are trained under Pytorch 1.1.0 and they may not be loaded by lower version directly (such as 0.4.0), even the code works well.
-
AllenNLP
We use AllenNLP to tokenize the source sentences for AMR Parsing.
-
subword-nmt
We employs BPE to segment all the tokens into subwords by byte pair encoding. See https://github.com/rsennrich/subword-nmt/ for more details.
-
Post-Processing tool
See https://github.com/RikVN/AMR for more details.
-
pre-trained model
We provide our temporary best model PTM-MT(WMT14B)-SemPar(WMT14M), which greately advances the state-of-the-art performance with 81.4 Smatch on AMR2.0.
Download Here
链接:https://pan.baidu.com/s/1bdIKXBtlSldC-IPMxkG04A
提取码:SUDA
Assuming that the file named "sent" contains sentences waiting for parsing .
See AllenNLP Documents for more details.
Here is a python demo for Tokenization.
from allennlp.data.tokenizers import WordTokenizer
token = WordTokenizer()
sent = "Has history given us too many lessons?, 530, 412, 64"
tokenized_sent = " ".join(str(tok) for tok in token.tokenize(sent.strip()))
print(tokenized_sent)
# OUTPUT :
# 'Has history given us too many lessons ? , 530 , 412 , 64'
we employ BPE to segment word sequence into subword sequence.
Here is a demo for BPE.
# $subword_nmt is path of BPE tools, E.G.
# subword_nmt=XXX/subword-nmt/subword_nmt/
python3 $subword_nmt/apply_bpe.py -c bpe.codes < sent.tok > sent.tok.bpe
Now we can use subword sequence and pre-trained model to generate AMR Graph.
Here is a command demo for Decoding.
# $GPU_ID is the N-th GPU used
# $amr_parser_model is pre-trained model path, here is PTM-MT(WMT14B)-SemPar(WMT14M), E.G.
# GPU_ID=0
# amr_parser_model=ptm_mt_en2deB_sem_enM.pt
CUDA_VISIBLE_DEVICES=$GPU_ID python3 codes/translate.py -model $amr_parser_model -beam_size 5 -src sent.tok.bpe -output sent.amr.bpe -task_type task2 -decode_extra_length 1000 -minimal_relative_prob 0.01 -gpu 0
From Step2 we can get the BPE AMR sequence and we should remove BPE symbol, @@, with under command.
sed -r 's/(@@ )|(@@ ?$)//g' sent.amr.bpe > sent.amr
Now we get sequence AMR from Step3. We should do post-processing if need to recover its full graph.
See Pre- and post-processing scripts for neural sequence-to-sequence AMR parsing for more details.
Here is a command demo for post-processing.
python2 postprocess_AMRs.py -f sent.amr -s sent
We adopted some modules or code from AllenNLP, OpenNMT-py, subword-nmt and RikVN/AMR. Thanks to these open-source projects!
If you like our paper or parser, please cite
@misc{xu2020improving,
title={Improving AMR Parsing with Sequence-to-Sequence Pre-training},
author={Dongqin Xu and Junhui Li and Muhua Zhu and Min Zhang and Guodong Zhou},
year={2020},
eprint={2010.01771},
archivePrefix={arXiv},
primaryClass={cs.CL}
}