This is the repository for our ACL 2019 paper Unsupervised PCFG Induction with Normalizing Flow.
Required packages:
- Python 3.6+
- pytorch 1.1+
- gensim
- nltk
- numpy
- scipy
- bidict
Newest versions of the packages should also work.
-
You will need a linetrees file, which is a one-sentence-per-line file with bracketed trees without the ROOT node. The
make_linetoks.py
andmake_ints_file.py
inutils
folder will use this file to first generate alinetoks
file, one-sentence-per-line with just the terminals, anddict
andints
file where the terminals are replaced with indices. -
embed_with_multilingual_elmo.py
in utils requires ElmoForManyLang. You can also get pretrained Elmo models from there. Use the script to generate the Elmo embeddings for the dataset. -
A config file is needed. One sample config file is provided in the
config
folder along with the necessary text files inuyghur_data
. The options are explained in the config file. -
The running command is
python dimi-trainer.py config/yourconfig.ini
. If the Elmo embeddings are generated for the provided Uyghur file, issuingpython dimi-trainer.py config/uyghur.ini
should start the model immediately. A GPU is required for running the system. -
Results will be dumped out into the provided output folder. The main diagnostic file is
running_status.txt
, which includes a whole array of different measurements for grammar quality. The*.vittrees.gz
files are gzipped files of Viterbi trees of the dataset.