Description
With a HF model class, one can resize token embeddings to account for any special tokens, there's no upper limit, i.e.,
in the usual scenario (this isn't necessarily working code, I may have gotten the tokenizer APIs incorrect):
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer
config_class = GPT2Config
model_class = GPT2LMHeadModel
tokenizer_class = GPT2Tokenizer
config = config_class.from_pretrained("gpt2-xl") # let's say we want to use the XL config for now, has its own vocab size
tokenizer = tokenizer_class.from_pretrained("gpt2-xl") # default XL vocab
tokenizer.add_special_tokens("<speaker1>")
tokenizer.add_special_tokens("<speaker2>")
model = model_class(config)
model.resize_token_embeddings(len(tokenizer))
The last line essentially allocates 2 new indices for the newly added special tokens in the input embeddings matrix, and initializes their embeddings with random weights.
Now in the pipeline regime, one cannot just resize the token embeddings after initialization of the PipelineModule
, since the module would have already split the model across pipeline stages. Is it possible to provide a callback/mechanism with PipelineModule
that can allow for resizing and fresh initialization of newly added special token embeddings for downstream users?
Also, shouldn't this be a problem with the implementation of pipeline (and more generally 3D) parallelism in the DeepSpeedExamples
repo too? A user of a model that's been pre-trained with pipeline parallelism would certainly have some basic downstream needs such as addition of special tokens for fine-tuning.