Update to transformers 2.3.0 & Add ALBERT (#990)

* fix roberta tokenization error * update transformers * update alignment func * trim input_module * update lm head * update albert special tokens * input_module_to_pretokenized -> transformer_input_module_to_tokenizer_id * update ccg alignment * fix wic retokenize * update wic docstring, remove unnecessary condition * refactor record task to avoid tokenization problem Co-authored-by: Sam Bowman <bowman@nyu.edu>
nyu-mll · Jan 28, 2020 · 4a9b058 · 4a9b058
1 parent 900e9e8
commit 4a9b058
Show file tree

Hide file tree

Showing 27 changed files with 395 additions and 379 deletions.
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ A few things you might want to know about `jiant`:
 - `jiant` is configuration-driven. You can run an enormous variety of experiments by simply writing configuration files. Of course, if you need to add any major new features, you can also easily edit or extend the code.
 - `jiant` contains implementations of strong baselines for the [GLUE](https://gluebenchmark.com) and [SuperGLUE](https://super.gluebenchmark.com/) benchmarks, and it's the recommended starting point for work on these benchmarks.
 - `jiant` was developed at [the 2018 JSALT Workshop](https://www.clsp.jhu.edu/workshops/18-workshop/) by [the General-Purpose Sentence Representation Learning](https://jsalt18-sentence-repl.github.io/) team and is maintained by [the NYU Machine Learning for Language Lab](https://wp.nyu.edu/ml2/people/), with help from [many outside collaborators](https://github.com/nyu-mll/jiant/graphs/contributors) (especially Google AI Language's [Ian Tenney](https://ai.google/research/people/IanTenney)).
-- `jiant` is built on [PyTorch](https://pytorch.org). It also uses many components from [AllenNLP](https://github.com/allenai/allennlp) and the HuggingFace PyTorch [implementations](https://github.com/huggingface/pytorch-transformers) of GPT, BERT, and XLNet.
+- `jiant` is built on [PyTorch](https://pytorch.org). It also uses many components from [AllenNLP](https://github.com/allenai/allennlp) and the HuggingFace Transformers [implementations](https://github.com/huggingface/transformers) for GPT, BERT and other transformer models.
 - The name `jiant` doesn't mean much. The 'j' stands for JSALT. That's all the acronym we have.
 
 ## Getting Started

diff --git a/environment.yml b/environment.yml
@@ -30,7 +30,7 @@ dependencies:
  # for --remote_log functionality
  - google-cloud-logging==1.11.0
 
- # for some tokenizers in pytorch-transformers
+ # for some tokenizers in huggingface transformers
  - spacy==2.1
  - ftfy
 
@@ -39,9 +39,8 @@ dependencies:
  - sacremoses
 
  # Warning: jiant currently depends on *both* pytorch_pretrained_bert > 0.6 _and_
- # pytorch_transformers > 1.0. These are the same package, though the name changed between
+ # transformers > 2.3.0. These are the same package, though the name changed between
  # these two versions. AllenNLP requires 0.6 to support the BertAdam optimizer, and jiant
  # directly requires 1.0 to support XLNet and WWM-BERT.
  # This AllenNLP issue is relevant: https://github.com/allenai/allennlp/issues/3067
- - sacremoses
- - pytorch-transformers==1.2.0
+ - transformers==2.3.0
diff --git a/gcp/config/jiant_paths.sh b/gcp/config/jiant_paths.sh
@@ -13,7 +13,7 @@ export JIANT_PROJECT_PREFIX="$HOME/exp"
 # pre-downloaded ELMo models
 export ELMO_SRC_DIR="/nfs/jiant/share/elmo"
 # cache for BERT etc. models
-export PYTORCH_PRETRAINED_BERT_CACHE="/nfs/jiant/share/pytorch_transformers_cache"
+export HUGGINGFACE_TRANSFORMERS_CACHE="/nfs/jiant/share/transformers_cache"
 # word embeddings
 export WORD_EMBS_FILE="/nfs/jiant/share/wiki-news-300d-1M.vec"
 
diff --git a/gcp/kubernetes/templates/jiant_env.libsonnet b/gcp/kubernetes/templates/jiant_env.libsonnet
@@ -21,7 +21,7 @@
  # Path to ELMO cache.
  elmo_src_dir: "/nfs/jiant/share/elmo",
  # Path to BERT etc. model cache; should be writable by Kubernetes workers.
- pytorch_transformers_cache_path: "/nfs/jiant/share/pytorch_transformers_cache",
+ transformers_cache_path: "/nfs/jiant/share/transformers_cache",
  # Path to default word embeddings file
  word_embs_file: "/nfs/jiant/share/wiki-news-300d-1M.vec",
 }
diff --git a/gcp/kubernetes/templates/run_batch.jsonnet b/gcp/kubernetes/templates/run_batch.jsonnet
@@ -35,8 +35,8 @@ function(job_name, command, project_dir, uid, fsgroup,
  value: jiant_env.jiant_data_dir,
  },
  {
- name: "PYTORCH_PRETRAINED_BERT_CACHE",
- value: jiant_env.pytorch_transformers_cache_path
+ name: "HUGGINGFACE_TRANSFORMERS_CACHE",
+ value: jiant_env.transformers_cache_path
  },
  {
  name: "ELMO_SRC_DIR",

diff --git a/gcp/set_up_workstation.sh b/gcp/set_up_workstation.sh
@@ -26,8 +26,8 @@ source /etc/profile.d/jiant_paths.sh
 if [ ! -d "${JIANT_PROJECT_PREFIX}" ]; then
  mkdir "${JIANT_PROJECT_PREFIX}"
 fi
-if [ ! -d "${PYTORCH_PRETRAINED_BERT_CACHE}" ]; then
- sudo mkdir -m 0777 "${PYTORCH_PRETRAINED_BERT_CACHE}"
+if [ ! -d "${HUGGINGFACE_TRANSFORMERS_CACHE}" ]; then
+ sudo mkdir -m 0777 "${HUGGINGFACE_TRANSFORMERS_CACHE}"
 fi
 
 # Build the conda environment, and activate

diff --git a/jiant/config/defaults.conf b/jiant/config/defaults.conf
@@ -244,20 +244,23 @@ input_module = "" // The word embedding or contextual word representation layer
  // - elmo-chars-only: The dynamic CNN-based word embedding layer of AllenNLP's
  // ELMo, but not ELMo's LSTM layer hidden states. Use with
  // tokenizer = MosesTokenizer.
- // - bert-base-uncased, etc.: Any BERT model from pytorch_transformers.
+ // - bert-base-uncased, etc.: Any BERT model from transformers.
  // - roberta-base / roberta-large / roberta-large-mnli: RoBERTa model from
- // pytorch_transformers.
+ // transformers.
+ // - albert-base-v1 / albert-large-v1 / albert-xlarge-v1 / albert-xxlarge-v1
+ // - albert-base-v2 / albert-large-v2 / albert-xlarge-v2 / albert-xxlarge-v2:
+ // ALBERT model from transformers.
  // - xlnet-base-cased / xlnet-large-cased: XLNet Model from
- // pytorch_transformers.
+ // transformers.
  // - openai-gpt: The OpenAI GPT language model encoder from
- // pytorch_transformers.
- // - gpt2 / gpt2-medium / gpt2-large: The OpenAI GPT-2 language model encoder from
- // pytorch_transformers.
+ // transformers.
+ // - gpt2 / gpt2-medium / gpt2-large/ gpt2-xl: The OpenAI GPT-2 language model
+ // encoder from transformers.
  // - transfo-xl-wt103: The Transformer-XL language model encoder from
- // pytorch_transformers.
+ // transformers.
  // - xlm-mlm-en-2048: XLM english language model encoder from
- // pytorch_transformers.
- // Note: Any input_module from pytorch_transformers requires
+ // transformers.
+ // Note: Any input_module from transformers requires
  // tokenizer = ${input_module} or auto.
 
 tokenizer = auto // The name of the tokenizer, passed to the Task constructor for
@@ -269,7 +272,7 @@ tokenizer = auto // The name of the tokenizer, passed to the Task constructor f
  // - MosesTokenizer: Our standard word tokenizer. (Support for
  // other NLTK tokenizers is pending.)
  // - bert-uncased-base, etc.: Use the tokenizer supplied with
- // pytorch_transformers that corresponds the input_module.
+ // transformers that corresponds the input_module.
  // - SplitChars: Splits the input into individual characters.
 
 word_embs_file = ${WORD_EMBS_FILE} // Path to embeddings file, used with glove and fastText.
@@ -284,21 +287,21 @@ d_char = 100 // Dimension of trained char embeddings.
 n_char_filters = 100 // Number of filters in trained char CNN.
 char_filter_sizes = "2,3,4,5" // Size of char CNN filters.
 
-pytorch_transformers_output_mode = "none" // How to handle the embedding layer of the
-  // BERT/XLNet model:
-  // "none" or "top" returns only top-layer activation,
-  // "cat" returns top-layer concatenated with
-  // lexical layer,
-  // "only" returns only lexical layer,
-  // "mix" uses ELMo-style scalar mixing (with learned
-  // weights) across all layers.
-pytorch_transformers_max_layer = -1 // Maximum layer to return from BERT etc. encoder. Layer 0 is
-  // wordpiece embeddings. pytorch_transformers_embeddings_mode
-  // will behave as if the is truncated at this layer, so 'top'
-  // will return this layer, and 'mix' will return a mix of all
-  // layers up to and including this layer.
-  // Set to -1 to use all layers.
-  // Used for probing experiments.
+transformers_output_mode = "none" // How to handle the embedding layer of the
+ // BERT/XLNet model:
+ // "none" or "top" returns only top-layer activation,
+ // "cat" returns top-layer concatenated with
+ // lexical layer,
+ // "only" returns only lexical layer,
+ // "mix" uses ELMo-style scalar mixing (with learned
+ // weights) across all layers.
+transformers_max_layer = -1 // Maximum layer to return from BERT etc. encoder. Layer 0 is
+ // wordpiece embeddings. transformers_embeddings_mode
+ // will behave as if the is truncated at this layer, so 'top'
+ // will return this layer, and 'mix' will return a mix of all
+ // layers up to and including this layer.
+ // Set to -1 to use all layers.
+ // Used for probing experiments.
 
 force_include_wsj_vocabulary = 0 // Set if using PTB parsing (grammar induction) task. Makes sure
  // to include WSJ vocabulary.
@@ -365,7 +368,7 @@ pair_attn = 1 // If true, use attn in sentence-pair classification/regression t
 d_hid_attn = 512 // Post-attention LSTM state size.
 shared_pair_attn = 0 // If true, share pair_attn parameters across all tasks that use it.
 d_proj = 512 // Size of task-specific linear projection applied before before pooling.
- // Disabled when fine-tuning pytorch_transformers models.
+ // Disabled when fine-tuning transformers models.
 pool_type = "auto" // Type of pooling to reduce sequences of vectors into a single vector.
  // Options: "auto", "max", "mean", "first", "final"
  // "auto" uses "first" for plain BERT (with no sent_enc), "final" for plain

diff --git a/jiant/config/examples/stilts_example.conf b/jiant/config/examples/stilts_example.conf
@@ -18,7 +18,7 @@ batch_size = 24
 write_preds = "val,test"
 
 //BERT-specific parameters
-pytorch_transformers_output_mode = "top"
+transformers_output_mode = "top"
 sep_embs_for_skip = 1
 sent_enc = "none"
 classifier = log_reg // following BERT paper

diff --git a/jiant/config/superglue_bert.conf b/jiant/config/superglue_bert.conf
@@ -10,7 +10,7 @@ max_seq_len = 256 // Mainly needed for MultiRC, to avoid over-truncating
 
 // Model settings
 input_module = "bert-large-cased"
-pytorch_transformers_output_mode = "top"
+transformers_output_mode = "top"
 pair_attn = 0 // shouldn't be needed but JIC
 s2s = {
  attention = none

diff --git a/jiant/huggingface_transformers_interface/__init__.py b/jiant/huggingface_transformers_interface/__init__.py
@@ -0,0 +1,56 @@
+"""
+Warning: jiant currently depends on *both* pytorch_pretrained_bert > 0.6 _and_
+transformers > 2.3
+
+These are the same package, though the name changed between these two versions. AllenNLP requires
+0.6 to support the BertAdam optimizer, and jiant directly requires 2.3.
+
+This AllenNLP issue is relevant: https://github.com/allenai/allennlp/issues/3067
+
+TODO: We do not support non-English versions of XLM, if you need them, add some code in XLMEmbedderModule
+to prepare langs input to transformers.XLMModel
+"""
+
+# All the supported input_module from huggingface transformers
+# input_modules mapped to the same string share vocabulary
+transformer_input_module_to_tokenizer_name = {
+ "bert-base-uncased": "bert_uncased",
+ "bert-large-uncased": "bert_uncased",
+ "bert-large-uncased-whole-word-masking": "bert_uncased",
+ "bert-large-uncased-whole-word-masking-finetuned-squad": "bert_uncased",
+ "bert-base-cased": "bert_cased",
+ "bert-large-cased": "bert_cased",
+ "bert-large-cased-whole-word-masking": "bert_cased",
+ "bert-large-cased-whole-word-masking-finetuned-squad": "bert_cased",
+ "bert-base-cased-finetuned-mrpc": "bert_cased",
+ "bert-base-multilingual-uncased": "bert_multilingual_uncased",
+ "bert-base-multilingual-cased": "bert_multilingual_cased",
+ "roberta-base": "roberta",
+ "roberta-large": "roberta",
+ "roberta-large-mnli": "roberta",
+ "xlnet-base-cased": "xlnet_cased",
+ "xlnet-large-cased": "xlnet_cased",
+ "openai-gpt": "openai_gpt",
+ "gpt2": "gpt2",
+ "gpt2-medium": "gpt2",
+ "gpt2-large": "gpt2",
+ "gpt2-xl": "gpt2",
+ "transfo-xl-wt103": "transfo_xl",
+ "xlm-mlm-en-2048": "xlm_en",
+ "albert-base-v1": "albert",
+ "albert-large-v1": "albert",
+ "albert-xlarge-v1": "albert",
+ "albert-xxlarge-v1": "albert",
+ "albert-base-v2": "albert",
+ "albert-large-v2": "albert",
+ "albert-xlarge-v2": "albert",
+ "albert-xxlarge-v2": "albert",
+}
+
+
+def input_module_uses_transformers(input_module):
+ return input_module in transformer_input_module_to_tokenizer_name
+
+
+def input_module_tokenizer_name(input_module):
+ return transformer_input_module_to_tokenizer_name[input_module]