Dictionary not loaded when fine-tuning poly-encoder #2904

naitian · 2020-07-30T16:44:40Z

Bug description
I'm trying to fine-tune the poly-encoder from the poly_model_huge_reddit model. I'm getting a size mismatch error on the embeddings because (I think) a new dictionary is being generated instead of reusing dictionary from the init model.

Reproduction steps
Here is the command I'm using to run:

python -m parlai.scripts.train_model \
   --init-model data/models/pretrained_transformers/poly_model_huge_reddit/model \
   --dict-file data/models/pretrained_transformers/poly_model_huge_reddit/model.dict \
   -t convai2 \
   -dp ../aloha/data/sdlong_config1/fold1_l4390/ \
   --model transformer/polyencoder --batchsize 64 --eval-batchsize 20 \
   --warmup_updates 100 --warmup_rate 0.0001 \
   --lr-scheduler reduceonplateau --lr-scheduler-patience 0 --lr-scheduler-decay 0.4 \
   -lr 5e-05 --data-parallel True --history-size 20 --label-truncate 72 \
   --text-truncate 360 --num-epochs 8.0 --max_train_time 200000 -veps 1 \
   -vme 8000 --validation-metric accuracy --validation-patience 2 --validation-metric-mode max \
   --save-after-valid True --log_every_n_secs 20 --candidates batch --fp16 True \
   --dict-tokenizer bpe --dict-lower True --optimizer adamax --output-scaling 0.06 \
   --variant xlm --reduction-type mean --share-encoders False \
   --learn-positional-embeddings True --n-layers 12 --n-heads 12 --ffn-size 3072 \
   --attention-dropout 0.1 --relu-dropout 0.0 --dropout 0.1 --n-positions 1024 \
   --embedding-size 768 --activation gelu --embeddings-scale False --n-segments 2 \
   --learn-embeddings True --polyencoder-type n_first --poly-n-codes 64 \
   --poly-attention-type basic --dict-endtoken __start__ \
   --model-file ../aloha/data/paper_model_no_finetune/model

Expected behavior
This should reuse the dictionary from data/models/pretrained_transformers/poly_model_huge_reddit/model.dict.

Logs
It looks like a new dictionary is being built (with 18899 words instead of the ~55k words from the reddit model):

2020-07-30 08:55:24,058 INFO     | building dictionary first...
2020-07-30 08:55:24,066 INFO     | creating task(s): convai2
2020-07-30 08:55:24,074 INFO     | loading fbdialog data: ../aloha/data/sdlong_config1/fold1_l4390/ConvAI2/train_self_original.txt
Building dictionary:  95%|██████020-07-30 08:56:00,031 INFO     | Saving dictionary to ../aloha/data/paper_model_no_finetune/model.dict
2020-07-30 08:56:00,082 INFO     | dictionary built with 18899 tokens in 0.0s
2020-07-30 08:56:00,084 INFO     | No model with opt yet at: ../aloha/data/paper_model_no_finetune/model(.opt)
2020-07-30 08:56:00,210 INFO     | Using CUDA
2020-07-30 08:56:00,213 ERROR    | You set --fp16 true with --fp16-impl apex, but fp16 with apex is unavailable. To use apex fp16, please install APEX from https://github.com/NVIDIA/apex.
2020-07-30 08:56:00,213 INFO     | loading dictionary from data/models/pretrained_transformers/poly_model_huge_reddit/model.dict
2020-07-30 08:56:00,267 INFO     | num words = 18899
2020-07-30 08:56:03,271 INFO     | Total parameters: 200,724,480 (200,724,480 trainable)
2020-07-30 08:56:03,271 INFO     | Loading existing model parameters from data/models/pretrained_transformers/poly_model_huge_reddit/model
Building dictionary: 100%|██████████| 131k/131k [00:08<00:00, 14.7kex/s]
no pair has frequency >= 2. Stopping
Traceback (most recent call last):
  File "/auto/nlg-05/naitian/parlai/parlai/core/torch_agent.py", line 1809, in load_state_dict
    self.model.load_state_dict(state_dict)
  File "/home/nlg-05/naitian/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 847, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for PolyEncoderModule:
        size mismatch for encoder_ctxt.embeddings.weight: copying a param with shape torch.Size([54944, 768]) from checkpoint, the shape in current model is torch.Size([18904, 768]).
        size mismatch for encoder_cand.embeddings.weight: copying a param with shape torch.Size([54944, 768]) from checkpoint, the shape in current model is torch.Size([18904, 768]).

This seems to be a very similar issue to #2539, but explicitly setting the dict file path didn't work for me.

Thanks!

The text was updated successfully, but these errors were encountered:

stephenroller · 2020-07-30T22:45:26Z

Weird, it looks like the dict file is being ignored

ParlAI/parlai/scripts/train_model.py

Lines 271 to 275 in 15c9688

    
           if 'dict_file' in opt: 
        
               if opt['dict_file'] is None and opt.get('model_file'): 
        
                   opt['dict_file'] = opt['model_file'] + '.dict' 
        
               logging.info("building dictionary first...") 
        
               build_dict(opt, skip_if_built=True)

and

ParlAI/parlai/scripts/build_dict.py

Lines 73 to 76 in 15c9688

    
           if skip_if_built and os.path.isfile(opt['dict_file']): 
        
               # Dictionary already built, skip all loading or setup 
        
               logging.debug("dictionary already built.") 
        
               return None

Can you try ensuring that the file definitely exists, and that it wasn't overwritten or something?

naitian · 2020-07-31T13:40:45Z

Thanks! It looks like the file was overwritten at some point. I re-downloaded the model files and it's working now.

naitian closed this as completed Jul 31, 2020

snyk-bot mentioned this issue Mar 16, 2023

[Snyk] Fix for 9 vulnerabilities terrorizer1980/ParlAI#56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dictionary not loaded when fine-tuning poly-encoder #2904

Dictionary not loaded when fine-tuning poly-encoder #2904

naitian commented Jul 30, 2020

stephenroller commented Jul 30, 2020

naitian commented Jul 31, 2020

Dictionary not loaded when fine-tuning poly-encoder #2904

Dictionary not loaded when fine-tuning poly-encoder #2904

Comments

naitian commented Jul 30, 2020

stephenroller commented Jul 30, 2020

naitian commented Jul 31, 2020