Drastic difference between .nemo and HF checkpoint #11360

rahul-sarvam · 2024-11-21T13:08:59Z

Describe the bug

I have trained a llama-like model with nemo using the below model config:

model:
  mcore_gpt: True
  micro_batch_size: 1
  global_batch_size: 512
  tensor_model_parallel_size: 1
  pipeline_model_parallel_size: 1
  virtual_pipeline_model_parallel_size: null
  context_parallel_size: 1
  encoder_seq_length: 8192
  max_position_embeddings: ${.encoder_seq_length}
  num_layers: 28
  hidden_size: 2048
  ffn_hidden_size: 11008
  num_attention_heads: 16
  init_method_std: 0.02
  use_scaled_init_method: True
  hidden_dropout: 0.0
  attention_dropout: 0.0
  ffn_dropout: 0.0
  kv_channels: null
  apply_query_key_layer_scaling: True
  normalization: 'rmsnorm'
  layernorm_epsilon: 1e-6
  do_layer_norm_weight_decay: False
  make_vocab_size_divisible_by: 128
  pre_process: True
  post_process: True
  persist_layer_norm: True
  bias: False
  activation: 'fast-swiglu'
  headscale: False
  transformer_block_type: 'pre_ln'
  openai_gelu: False
  normalize_attention_scores: True
  position_embedding_type: 'rope'
  rotary_percentage: 1.0
  attention_type: 'multihead'
  share_embeddings_and_output_weights: False
  overlap_p2p_comm: False
  batch_p2p_comm: True
  num_query_groups: 8
  rotary_base: 10000.0

The model works well when I run inference using the nemo checkpoint (script). But the converted checkpoint (script) drastically drops in performance. Any ideas why this might be happening? My only hunch is that apply_query_key_layer_scaling=True in nemo, which might not be the case in HF.

Environment details
https://docs.nvidia.com/nemo-framework/user-guide/latest/softwarecomponentversions.html#nemo-framework-24-05

The text was updated successfully, but these errors were encountered:

rahul-sarvam added the bug Something isn't working label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drastic difference between .nemo and HF checkpoint #11360

Drastic difference between .nemo and HF checkpoint #11360

rahul-sarvam commented Nov 21, 2024

Drastic difference between .nemo and HF checkpoint #11360

Drastic difference between .nemo and HF checkpoint #11360

Comments

rahul-sarvam commented Nov 21, 2024