Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] AssertionError: causal mask is only for self attention #13

Open
Sunt-ing opened this issue Jun 8, 2024 · 2 comments
Open

[BUG] AssertionError: causal mask is only for self attention #13

Sunt-ing opened this issue Jun 8, 2024 · 2 comments
Labels

Comments

@Sunt-ing
Copy link

Sunt-ing commented Jun 8, 2024

Describe the bug

I tried to run a translation task on the checkpoint (converted 7b model), but a bug occurred from time to time (so no always, for some prompts the server works well).

one such prompt:

prompt: translate the sentences to English.

example: 

source: 

西米诺夫说,2013 年他在《创智赢家》节目中露面后,公司的销售额大增,当时节目组拒绝向这家初创公司投资。

target: 

Siminoff said sales boosted after his 2013 appearance in a Shark Tank episode where the show panel declined funding the startup.

source: 

2017 年年末,西米诺夫出现在 QVC 电视销售频道。

target: 

In late 2017, Siminoff appeared on shopping television channel QVC.

source: 

铃声 (Ring) 公司还与竞争对手 ADT 安保公司在一起官司中达成了庭外和解。

target: 

Ring also settled a lawsuit with competing security company, the ADT Corporation.

translate the following sentences:

source: 

他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”

target: 

args sent to the server:

 {'prompts': ['translate the sentences to English.\n\nexample: \n\nsource: \n\n西米诺夫说,2013 年他在《创智赢家》节目中露面后,公司的销售额大增,当时节目组拒绝向这家初创公司投资。\n\ntarget: \n\nSiminoff said sales boosted after his 2013 appearance in a Shark Tank episode where the show panel declined funding the startup.\n\nsource: \n\n2017 年年末,西米诺夫出现在 QVC 电视销售频道。\n\ntarget: \n\nIn late 2017, Siminoff appeared on shopping television channel QVC.\n\nsource: \n\n铃声 (Ring) 公司还与竞争对手 ADT 安保公司在一起官司中达成了庭外和解。\n\ntarget: \n\nRing also settled a lawsuit with competing security company, the ADT Corporation.\n\ntranslate the following sentences:\n\nsource: \n\n他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”\n\ntarget: \n\n'], 'tokens_to_generate': 50, 'top_k': 1, 'logprobs': True, 'random_seed': 42, 'echo_prompts': False, 'early_exit_thres': 0.2, 'exit_layers': [], 'use_early_exit': True}

error message in the server logs:

Traceback (most recent call last):
  File "/workspace/data/EE-LLM/megatron/text_generation/api.py", line 206, in generate
    output = generate_tokens_probs_and_return_on_first_stage(
  File "/workspace/data/EE-LLM/megatron/text_generation/generation.py", line 208, in generate_tokens_probs_and_return_on_first_stage
    logits = forward_step(tokens2use, positions2use, attention_mask2use)
  File "/workspace/data/EE-LLM/megatron/text_generation/forward_step.py", line 57, in __call__
    return _no_pipelining_forward_step(self.model,
  File "/workspace/data/EE-LLM/megatron/text_generation/forward_step.py", line 113, in _no_pipelining_forward_step
    output_tensor = _forward_step_helper(model, tokens, position_ids,
  File "/workspace/data/EE-LLM/megatron/text_generation/forward_step.py", line 99, in _forward_step_helper
    output_tensor = model(tokens, position_ids, attention_mask,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/data/EE-LLM/megatron/model/module.py", line 181, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/data/EE-LLM/megatron/model/early_exit_gpt_model.py", line 160, in forward
    lm_output, early_exit_output = self.language_model(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/data/EE-LLM/megatron/model/language_model.py", line 713, in forward
    encoder_output, early_exit_output = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/data/EE-LLM/megatron/model/transformer.py", line 2151, in forward
    hidden_states = layer(hidden_states,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/data/EE-LLM/megatron/model/transformer.py", line 1145, in forward
    self.self_attention(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/data/EE-LLM/megatron/model/transformer.py", line 790, in forward
    context_layer = self.core_attention(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/data/EE-LLM/megatron/model/transformer.py", line 376, in forward
    attention_probs = self.scale_mask_softmax(attention_scores,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1423, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/data/EE-LLM/megatron/model/fused_softmax.py", line 148, in forward
    return self.forward_fused_softmax(input, mask)
  File "/workspace/data/EE-LLM/megatron/model/fused_softmax.py", line 179, in forward_fused_softmax
    assert sq == sk, "causal mask is only for self attention"
AssertionError: causal mask is only for self attention

To Reproduce
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

Expected behavior
A clear and concise description of what you expected to happen.

Stack trace/logs
If applicable, add the stack trace or logs from the time of the error.

Environment (please complete the following information):

  • Megatron-LM commit ID
  • PyTorch version: '2.3.1+cu118'
  • CUDA version: 11.8
  • NCCL version

Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context
Add any other context about the problem here.

@pan-x-c
Copy link
Owner

pan-x-c commented Jun 8, 2024

The error message does not appear to be related to EE-LLM, but rather seems to be caused by the environment. My inference server can generate content normally without any errors.

The startup log of my server is as follows:

Zarr-based strategies will not be registered because of missing packages
load checkpoint args
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 10880 from checkpoint
Setting seq_length to 2048 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting num_query_groups to 1 from checkpoint
Setting group_query_attention to False from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 2048 from checkpoint
Setting position_embedding_type to rope from checkpoint
Setting add_position_embedding to False from checkpoint
Setting use_rotary_position_embeddings to False from checkpoint
Setting rotary_percent to 1.0 from checkpoint
Setting add_bias_linear to False from checkpoint
Setting swiglu to True from checkpoint
Setting untie_embeddings_and_output_weights to True from checkpoint
Setting apply_layernorm_1p to False from checkpoint
Setting normalization to RMSNorm from checkpoint
Setting padded_vocab_size to 32128 from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting exit_layer_nums to [9, 17] from checkpoint
Setting exit_layer_weight to [0.1, 0.2] from checkpoint
Setting use_exit_mlp to False from checkpoint
Setting use_exit_block to False from checkpoint
Setting use_exit_norm to False from checkpoint
Setting untie_exit_output_weights to True from checkpoint
Setting pre_exit to True from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Checkpoint did not provide arguments virtual_pipeline_model_parallel_size
Checkpoint did not provide arguments num_layers_per_virtual_pipeline_stage
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:SentencePieceTokenizer
setting global batch size to 1
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. False
  add_position_embedding .......................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... True
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  backward_forward_ratio .......................... 2.0
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ False
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  check_for_nan_in_loss_and_grad .................. True
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... None
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  delay_grad_reduce ............................... True
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 32
  encoder_seq_length .............................. 2048
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 100
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_layer_nums ................................. [9, 17]
  exit_layer_temperature .......................... [1.0, 1.0]
  exit_layer_weight ............................... [0.1, 0.2]
  exit_layer_weight_init .......................... [0.0, 0.0]
  exit_layer_weight_warmup_iters .................. 0
  exit_layer_weight_warmup_style .................. linear
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 1
  expert_parallel ................................. False
  ffn_hidden_size ................................. 10880
  fill_explicit_bubbles ........................... False
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 1
  gradient_accumulation_fusion .................... True
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 4096
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  iteration ....................................... 36000
  kv_channels ..................................... 128
  lazy_mpu_init ................................... None
  load ............................................ /home/data/shared/checkpoints/EE-LLM-release/EE-LLM-7B-dj-refine-150B/convert-1
  load_iteration .................................. 0
  local_rank ...................................... None
  log_batch_size_to_tracker ....................... False
  log_interval .................................... 100
  log_learning_rate_to_tracker .................... True
  log_loss_scale_to_tracker ....................... True
  log_memory_to_tracker ........................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tracker ........................... False
  log_validation_ppl_to_tracker ................... False
  log_world_size_to_tracker ....................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. None
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. linear
  lr_warmup_fraction .............................. None
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 2048
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0
  mmap_warmup ..................................... False
  model_spec ...................................... None
  no_load_optim ................................... True
  no_load_rng ..................................... True
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... RMSNorm
  num_attention_heads ............................. 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_fill_cooldown_microbatches .................. None
  num_fill_warmup_microbatches .................... None
  num_layers ...................................... 32
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 1
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  override_opt_param_scheduler .................... False
  padded_vocab_size ............................... 32128
  params_dtype .................................... torch.float32
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  port ............................................ 5000
  position_embedding_type ......................... rope
  pre_exit ........................................ True
  profile ......................................... False
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_return_doc_ids ............................ False
  retro_workdir ................................... None
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ None
  save_interval ................................... None
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 2048
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  split ........................................... 969, 30, 1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  swiglu .......................................... True
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. /home/data/panxuchen.pxc/code/Megatron-LM/tokenizer/tokenizer.model
  tokenizer_type .................................. SentencePieceTokenizer
  tracker_log_interval ............................ 1
  train_data_path ................................. None
  train_iters ..................................... None
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  tune_exit ....................................... False
  tune_exit_pipeline_parallel_size ................ 1
  untie_embeddings_and_output_weights ............. True
  untie_exit_output_weights ....................... True
  use_checkpoint_args ............................. True
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... None
  use_distributed_optimizer ....................... False
  use_dynamic_exit_layer_weight ................... False
  use_exit_block .................................. False
  use_exit_mlp .................................... False
  use_exit_norm ................................... False
  use_flash_attn .................................. False
  use_mcore_models ................................ False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... None
  wandb_exp_name .................................. default
  wandb_group ..................................... None
  wandb_project ................................... None
  wandb_save_dir .................................. 
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 1
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
> building SentencePieceTokenizer tokenizer ...
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/mnt/data/panxuchen.pxc/dev/Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/mnt/data/panxuchen.pxc/dev/Megatron-LM/megatron/data'
>>> done with dataset index builder. Compilation time: 0.037 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
NCCL version 2.15.5+cuda11.8
>>> done with compiling and loading fused kernels. Compilation time: 0.266 seconds
WARNING: Forcing exit_on_missing_checkpoint to True for text generation.
building EarlyExitGPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 6952325120
 loading checkpoint from /home/data/shared/checkpoints/EE-LLM-release/EE-LLM-7B-dj-refine-150B/convert-1 at iteration 36000
 checkpoint version 3.0
  successfully loaded checkpoint from /home/data/shared/checkpoints/EE-LLM-release/EE-LLM-7B-dj-refine-150B/convert-1 at iteration 36000
 * Serving Flask app 'megatron.early_exit_text_generation_server'
 * Debug mode: off

The specific prompt request and the corresponding response log are as follows:

request IP: 127.0.0.1
{"prompts": ["translate the sentences to English.\n\nexample: \n\nsource: \n\n\u897f\u7c73\u8bfa\u592b\u8bf4\uff0c2013 \u5e74\u4ed6\u5728\u300a\u521b\u667a\u8d62\u5bb6\u300b\u8282\u76ee\u4e2d\u9732\u9762\u540e\uff0c\u516c\u53f8\u7684\u9500\u552e\u989d\u5927\u589e\uff0c\u5f53\u65f6\u8282\u76ee\u7ec4\u62d2\u7edd\u5411\u8fd9\u5bb6\u521d\u521b\u516c\u53f8\u6295\u8d44\u3002\n\ntarget: \n\nSiminoff said sales boosted after his 2013 appearance in a Shark Tank episode where the show panel declined funding the startup.\n\nsource: \n\n2017 \u5e74\u5e74\u672b\uff0c\u897f\u7c73\u8bfa\u592b\u51fa\u73b0\u5728 QVC \u7535\u89c6\u9500\u552e\u9891\u9053\u3002\n\ntarget: \n\nIn late 2017, Siminoff appeared on shopping television channel QVC.\n\nsource: \n\n\u94c3\u58f0 (Ring) \u516c\u53f8\u8fd8\u4e0e\u7ade\u4e89\u5bf9\u624b ADT \u5b89\u4fdd\u516c\u53f8\u5728\u4e00\u8d77\u5b98\u53f8\u4e2d\u8fbe\u6210\u4e86\u5ead\u5916\u548c\u89e3\u3002\n\ntarget: \n\nRing also settled a lawsuit with competing security company, the ADT Corporation.\n\ntranslate the following sentences:\n\nsource: \n\n\u4ed6\u8865\u5145\u9053\uff1a\u201c\u6211\u4eec\u73b0\u5728\u6709 4 \u4e2a\u6708\u5927\u6ca1\u6709\u7cd6\u5c3f\u75c5\u7684\u8001\u9f20\uff0c\u4f46\u5b83\u4eec\u66fe\u7ecf\u5f97\u8fc7\u8be5\u75c5\u3002\u201d\n\ntarget: \n\n"], "tokens_to_generate": 200, "top_k": 1, "logprobs": true, "random_seed": 9958, "echo_prompts": false, "early_exit_thres": 0.2, "exit_layers": [], "use_early_exit": true, "print_max_prob": false, "top_p": 0, "top_p_decay": 0.0, "top_p_bound": 0.0, "temperature": 0.0, "add_BOS": false, "stop_sequences": null, "prevent_newline_after_colon": false, "length_penalty": 1}
start time:  2024-06-08 10:26:42.342346
Response(use 3.3466479778289795s): ['The company has been dealing with a series of respiratory illnesses in recent years.\n\ntranslate the following sentences:\n\nsource: \n\n他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充']

The error might be related to flash-attention or PyTorch, because the version of Pytorch you are using is quite high, whereas EE-LLM was developed on a relatively older version. I recommend trying out the docker image suggested in the README (nvcr.io/nvidia/pytorch:22.12-py3) to see if it resolves the issue.

Copy link

github-actions bot commented Aug 7, 2024

Marking as stale. No activity in 60 days.

@github-actions github-actions bot added the stale label Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants