Skip to content

AlanAnsell/PolyLM

Repository files navigation

PolyLM - Neural Word Sense Model

Repository for the EACL 2021 paper PolyLM: Learning about Polysemy through Language Modeling by Alan Ansell, Felipe Bravo-Marquez and Bernhard Pfahringer.

Setup

Recommended environment is Python >= 3.7 and TensorFlow 1.15, although earlier versions of TensorFlow 1 may also work.

git clone https://github.com/AlanAnsell/PolyLM.git
cd PolyLM
conda create -n polylm python=3.7 tensorflow-gpu=1.15
conda activate polylm
pip install -r requirements.txt

Currently Stanford CoreNLP's part-of-speech tagger v3.9.2 is also required to perform lemmatization when evaluating on WSI.

Pretrained Models

The code has been rewritten to achieve much better performance since the paper was published and the old checkpoints are no longer compatible. However this has enabled us to train a larger version of the model, PolyLMLARGE, which achieves state-of-the-art performance on both WSI datasets used for evaluation, and is available below. We are also training versions of the SMALL and BASE models which are compatible with the new code (BASE model is now available, SMALL coming soon).

Model Embedding size Num. params FScore VMeasure SemEval-2010 AVG FBC FNMI SemEval-2013 AVG
PolyLMLARGE 384 90M 67.5 43.6 54.3 66.7 23.7 39.7
PolyLMBASE 256 54M 65.8 40.5 51.6 64.8 23.0 38.6
PolyLMSMALL 128 24M 65.6 35.7 48.4 64.5 18.5 34.5
Amrami and Goldberg (2019) - BERTLARGE 1024 340M 71.3 40.4 53.6 64.0 21.4 37.0

It may be most convenient to use the download scripts provided in the models folder, e.g. to download PolyLMLARGE, do

cd models
./download-lemmatized-large.sh

Training

The training corpus should be stored in a text file, where each line is a training example. The maximum length of a line should be the maximum sequence length minus two (to allow for the addition of beginning and end of sequence tokens).

The first step is to create a vocabulary file for the corpus:

from data import Vocabulary
vocab = Vocabulary('/path/to/corpus/corpus.txt', min_occurrences=500, build=True)
vocab.write_vocab_file('/path/to/corpus/vocab.txt')

The min_occurrences parameter controls which tokens are included in the vocabulary. Tokens which appear fewer than min_occurrences times in the corpus will be replaced with an <UNK> token during training.

The following command can be used to train a model with the same parameters as PolyLMBASE:

python train.py --model_dir=/where/to/save/model
                --corpus_path=/path/to/corpus/corpus.txt
                --vocab_path=/path/to/corpus/vocab.txt
                --embedding_size=256
                --bert_intermediate_size=1024
                --n_disambiguation_layers=4
                --n_prediction_layers=12
                --max_senses_per_word=8
                --min_occurrences_for_vocab=500
                --min_occurrences_for_polysemy=20000
                --max_seq_len=128
                --gpus=0
                --batch_size=32
                --n_batches=6000000
                --dl_warmup_steps=2000000
                --ml_warmup_steps=1000000
                --dl_r=1.5
                --ml_coeff=0.1
                --learning_rate=0.00003
                --print_every=100
                --save_every=10000

Notes:

  • Tokens appearing at least max_occurrences_for_polysemy times in the corpus will be allocated max_senses_per_word senses; all other tokens will be allocated a single sense. If you wish to allocate a custom number of senses per token, you may createa file where each line is of the form <token> <n_senses> and pass it as the value of command line parameter --n_senses_file. Note that the --max_senses_per_word parameter must still be supplied.
  • To train on multiple GPUs, pass them as a comma-separated string, e.g. --gpus=0,1. Note that --batch_size is per-GPU, so to train with a total batch size of 32 on 2 GPUs, you should set --batch_size=16.
  • You may optionally supply a --test_words argument to specify a number of words for which to print nearest-neighbour senses during training. For instance, setting --test_words="bank rock bar" and --test_every=1000 would cause nearest-neighbour senses for all senses of words "bank", "rock" and "bar" to be printed each 1,000 batches. This can be useful for ensuring that training is proceeding as intended.

Word Sense Induction

First download the SemEval 2010 and 2013 WSI datasets:

cd data
./download-wsi.sh
cd ..

Activate NLTK's WordNet capabilities:

python -c "import nltk; nltk.download('wordnet')"

PolyLM evaluation can be performed as follows:

./wsi.sh data/wsi/SemEval-2010 SemEval-2010 /path/to/model --gpus 0 --pos_tagger_root /path/to/stanford/pos/tagger
./wsi.sh data/wsi/SemEval-2013 SemEval-2013 /path/to/model --gpus 0 --pos_tagger_root /path/to/stanford/pos/tagger

Note that inference is only supported on a single GPU currently, but is generally very fast.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published