Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drastic difference between .nemo and HF checkpoint #11360

Open
rahul-sarvam opened this issue Nov 21, 2024 · 0 comments
Open

Drastic difference between .nemo and HF checkpoint #11360

rahul-sarvam opened this issue Nov 21, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@rahul-sarvam
Copy link

Describe the bug

I have trained a llama-like model with nemo using the below model config:

model:
  mcore_gpt: True
  micro_batch_size: 1
  global_batch_size: 512
  tensor_model_parallel_size: 1
  pipeline_model_parallel_size: 1
  virtual_pipeline_model_parallel_size: null
  context_parallel_size: 1
  encoder_seq_length: 8192
  max_position_embeddings: ${.encoder_seq_length}
  num_layers: 28
  hidden_size: 2048
  ffn_hidden_size: 11008
  num_attention_heads: 16
  init_method_std: 0.02
  use_scaled_init_method: True
  hidden_dropout: 0.0
  attention_dropout: 0.0
  ffn_dropout: 0.0
  kv_channels: null
  apply_query_key_layer_scaling: True
  normalization: 'rmsnorm'
  layernorm_epsilon: 1e-6
  do_layer_norm_weight_decay: False
  make_vocab_size_divisible_by: 128
  pre_process: True
  post_process: True
  persist_layer_norm: True
  bias: False
  activation: 'fast-swiglu'
  headscale: False
  transformer_block_type: 'pre_ln'
  openai_gelu: False
  normalize_attention_scores: True
  position_embedding_type: 'rope'
  rotary_percentage: 1.0
  attention_type: 'multihead'
  share_embeddings_and_output_weights: False
  overlap_p2p_comm: False
  batch_p2p_comm: True
  num_query_groups: 8
  rotary_base: 10000.0

The model works well when I run inference using the nemo checkpoint (script). But the converted checkpoint (script) drastically drops in performance. Any ideas why this might be happening? My only hunch is that apply_query_key_layer_scaling=True in nemo, which might not be the case in HF.

Environment details
https://docs.nvidia.com/nemo-framework/user-guide/latest/softwarecomponentversions.html#nemo-framework-24-05

@rahul-sarvam rahul-sarvam added the bug Something isn't working label Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant