Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Dictionary not loaded when fine-tuning poly-encoder #2904

Closed
naitian opened this issue Jul 30, 2020 · 2 comments
Closed

Dictionary not loaded when fine-tuning poly-encoder #2904

naitian opened this issue Jul 30, 2020 · 2 comments

Comments

@naitian
Copy link

naitian commented Jul 30, 2020

Bug description
I'm trying to fine-tune the poly-encoder from the poly_model_huge_reddit model. I'm getting a size mismatch error on the embeddings because (I think) a new dictionary is being generated instead of reusing dictionary from the init model.

Reproduction steps
Here is the command I'm using to run:

python -m parlai.scripts.train_model \
   --init-model data/models/pretrained_transformers/poly_model_huge_reddit/model \
   --dict-file data/models/pretrained_transformers/poly_model_huge_reddit/model.dict \
   -t convai2 \
   -dp ../aloha/data/sdlong_config1/fold1_l4390/ \
   --model transformer/polyencoder --batchsize 64 --eval-batchsize 20 \
   --warmup_updates 100 --warmup_rate 0.0001 \
   --lr-scheduler reduceonplateau --lr-scheduler-patience 0 --lr-scheduler-decay 0.4 \
   -lr 5e-05 --data-parallel True --history-size 20 --label-truncate 72 \
   --text-truncate 360 --num-epochs 8.0 --max_train_time 200000 -veps 1 \
   -vme 8000 --validation-metric accuracy --validation-patience 2 --validation-metric-mode max \
   --save-after-valid True --log_every_n_secs 20 --candidates batch --fp16 True \
   --dict-tokenizer bpe --dict-lower True --optimizer adamax --output-scaling 0.06 \
   --variant xlm --reduction-type mean --share-encoders False \
   --learn-positional-embeddings True --n-layers 12 --n-heads 12 --ffn-size 3072 \
   --attention-dropout 0.1 --relu-dropout 0.0 --dropout 0.1 --n-positions 1024 \
   --embedding-size 768 --activation gelu --embeddings-scale False --n-segments 2 \
   --learn-embeddings True --polyencoder-type n_first --poly-n-codes 64 \
   --poly-attention-type basic --dict-endtoken __start__ \
   --model-file ../aloha/data/paper_model_no_finetune/model

Expected behavior
This should reuse the dictionary from data/models/pretrained_transformers/poly_model_huge_reddit/model.dict.

Logs
It looks like a new dictionary is being built (with 18899 words instead of the ~55k words from the reddit model):

2020-07-30 08:55:24,058 INFO     | building dictionary first...
2020-07-30 08:55:24,066 INFO     | creating task(s): convai2
2020-07-30 08:55:24,074 INFO     | loading fbdialog data: ../aloha/data/sdlong_config1/fold1_l4390/ConvAI2/train_self_original.txt
Building dictionary:  95%|██████020-07-30 08:56:00,031 INFO     | Saving dictionary to ../aloha/data/paper_model_no_finetune/model.dict
2020-07-30 08:56:00,082 INFO     | dictionary built with 18899 tokens in 0.0s
2020-07-30 08:56:00,084 INFO     | No model with opt yet at: ../aloha/data/paper_model_no_finetune/model(.opt)
2020-07-30 08:56:00,210 INFO     | Using CUDA
2020-07-30 08:56:00,213 ERROR    | You set --fp16 true with --fp16-impl apex, but fp16 with apex is unavailable. To use apex fp16, please install APEX from https://github.com/NVIDIA/apex.
2020-07-30 08:56:00,213 INFO     | loading dictionary from data/models/pretrained_transformers/poly_model_huge_reddit/model.dict
2020-07-30 08:56:00,267 INFO     | num words = 18899
2020-07-30 08:56:03,271 INFO     | Total parameters: 200,724,480 (200,724,480 trainable)
2020-07-30 08:56:03,271 INFO     | Loading existing model parameters from data/models/pretrained_transformers/poly_model_huge_reddit/model
Building dictionary: 100%|██████████| 131k/131k [00:08<00:00, 14.7kex/s]
no pair has frequency >= 2. Stopping
Traceback (most recent call last):
  File "/auto/nlg-05/naitian/parlai/parlai/core/torch_agent.py", line 1809, in load_state_dict
    self.model.load_state_dict(state_dict)
  File "/home/nlg-05/naitian/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 847, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for PolyEncoderModule:
        size mismatch for encoder_ctxt.embeddings.weight: copying a param with shape torch.Size([54944, 768]) from checkpoint, the shape in current model is torch.Size([18904, 768]).
        size mismatch for encoder_cand.embeddings.weight: copying a param with shape torch.Size([54944, 768]) from checkpoint, the shape in current model is torch.Size([18904, 768]).

This seems to be a very similar issue to #2539, but explicitly setting the dict file path didn't work for me.

Thanks!

@stephenroller
Copy link
Contributor

Weird, it looks like the dict file is being ignored

if 'dict_file' in opt:
if opt['dict_file'] is None and opt.get('model_file'):
opt['dict_file'] = opt['model_file'] + '.dict'
logging.info("building dictionary first...")
build_dict(opt, skip_if_built=True)

and

if skip_if_built and os.path.isfile(opt['dict_file']):
# Dictionary already built, skip all loading or setup
logging.debug("dictionary already built.")
return None

Can you try ensuring that the file definitely exists, and that it wasn't overwritten or something?

@naitian
Copy link
Author

naitian commented Jul 31, 2020

Thanks! It looks like the file was overwritten at some point. I re-downloaded the model files and it's working now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants