Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion srcIndex < srcSelectDimSize failed #24698

Closed
MaggieK410 opened this issue Jul 6, 2023 · 4 comments
Closed

Assertion srcIndex < srcSelectDimSize failed #24698

MaggieK410 opened this issue Jul 6, 2023 · 4 comments

Comments

@MaggieK410
Copy link

Hi,
I am running medalpaca (but the error seems to come from llama) on 4 GPUs using device map="auto" and the SFTTrainer and want to prompt tune the model. I have written a custom Dataset class:

class DiagnosesDataset(torch.utils.data.Dataset):
def init(self, instances, tokenizer):
self.instances=instances
self.tokenizer=tokenizer

def __getitem__(self, idx):
	item={}
	prompt= self.instances["prompt"][idx]
	labels = self.instances["label"][idx]

	item=self.tokenize(prompt+labels)
	tokenized_instruction=self.tokenize(prompt)
	label_instruction=self.tokenizer(labels)

	i=len(tokenized_instruction["input_ids"])
	item["labels"][i:]=label_instruction["input_ids"]
	return item

def tokenize(self, prompt):
	result_prompt=self.tokenizer(prompt, 
			truncation=True, 
			max_length=2048,
			padding=False,
			return_tensors=None)

	result_prompt["labels"]=[-100]*len(result_prompt["input_ids"])	
	return result_prompt

def __len__(self):
	return len(self.instances)

The Training Arguments and Peft config:
training_arguments=TrainingArguments(
output_dir="./falcon_output_dir",
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
optim="paged_adamw_32bit",
save_steps=100,
logging_steps=10,
learning_rate=2e-4,
max_steps=10000,
fp16=False,
bf16=False,
lr_scheduler_type="constant",
warmup_ratio=0.03,
group_by_length=True,
remove_unused_columns=False)
peft_config=LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=4,
bias="none",
task_type=TaskType.CAUSAL_LM,
target_modules=["q_proj", "v_proj"])
The SFTTrainer I am using looks like this:

trainer=SFTTrainer(
	model=model,
	tokenizer=tokenizer,
	train_dataset=dataset,
	peft_config=peft_config,
	packing=True,
	args=training_arguments)
trainer.train()

However, when running the model, somewhere there seems to be an issue with some indices (https://discuss.pytorch.org/t/solved-assertion-srcindex-srcselectdimsize-failed-on-gpu-for-torch-cat/1804/27)

The error I am getting is this:
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/students/kulcsar/Bachelor/for_dataset/10000_diagnoses/falcon_model_pef │
│ t.py:544 in │
│ │
│ 541 │ │
│ 542 │ │
│ 543 │ args=parser.parse_args() │
│ ❱ 544 │ run() │
│ 545 │ #main() │
│ 546 │ │
│ 547 │ #all_data, prompts, golds=preprocess("./dataset.pkl") │
│ │
│ /home/students/kulcsar/Bachelor/for_dataset/10000_diagnoses/falcon_model_pef │
│ t.py:153 in run │
│ │
│ 150 │ │ packing=True, │
│ 151 │ │ data_collator=DataCollatorForSeq2Seq(tokenizer, pad_to_multipl │
│ 152 │ │ args=training_arguments) │
│ ❱ 153 │ trainer.train() │
│ 154 │ │
│ 155 │ logging.info("Run Train loop") │
│ 156 │ #model_updated=train(model, dataset, args.seed, args.batch_size, a │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/transformers/trainer.py:1537 in train │
│ │
│ 1534 │ │ inner_training_loop = find_executable_batch_size( │
│ 1535 │ │ │ self._inner_training_loop, self._train_batch_size, args.a │
│ 1536 │ │ ) │
│ ❱ 1537 │ │ return inner_training_loop( │
│ 1538 │ │ │ args=args, │
│ 1539 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1540 │ │ │ trial=trial, │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/transformers/trainer.py:1802 in _inner_training_loop │
│ │
│ 1799 │ │ │ │ │ self.control = self.callback_handler.on_step_begi │
│ 1800 │ │ │ │ │
│ 1801 │ │ │ │ with self.accelerator.accumulate(model): │
│ ❱ 1802 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1803 │ │ │ │ │
│ 1804 │ │ │ │ if ( │
│ 1805 │ │ │ │ │ args.logging_nan_inf_filter │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/transformers/trainer.py:2647 in training_step │
│ │
│ 2644 │ │ │ return loss_mb.reduce_mean().detach().to(self.args.device │
│ 2645 │ │ │
│ 2646 │ │ with self.compute_loss_context_manager(): │
│ ❱ 2647 │ │ │ loss = self.compute_loss(model, inputs) │
│ 2648 │ │ │
│ 2649 │ │ if self.args.n_gpu > 1: │
│ 2650 │ │ │ loss = loss.mean() # mean() to average on multi-gpu para │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/transformers/trainer.py:2672 in compute_loss │
│ │
│ 2669 │ │ │ labels = inputs.pop("labels") │
│ 2670 │ │ else: │
│ 2671 │ │ │ labels = None │
│ ❱ 2672 │ │ outputs = model(**inputs) │
│ 2673 │ │ # Save past state if it exists │
│ 2674 │ │ # TODO: this needs to be fixed and made cleaner later. │
│ 2675 │ │ if self.args.past_index >= 0: │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1502 in _wrapped_call_impl │
│ │
│ 1499 │ │ if self._compiled_call_impl is not None: │
│ 1500 │ │ │ return self._compiled_call_impl(*args, **kwargs) # type: │
│ 1501 │ │ else: │
│ ❱ 1502 │ │ │ return self._call_impl(*args, **kwargs) │
│ 1503 │ │
│ 1504 │ def _call_impl(self, *args, **kwargs): │
│ 1505 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_s │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1511 in _call_impl │
│ │
│ 1508 │ │ if not (self._backward_hooks or self._backward_pre_hooks or s │
│ 1509 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hoo │
│ 1510 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1511 │ │ │ return forward_call(*args, **kwargs) │
│ 1512 │ │ # Do not call functions when jit is used │
│ 1513 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1514 │ │ backward_pre_hooks = [] │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/peft/peft_model.py:739 in forward │
│ │
│ 736 │ ): │
│ 737 │ │ peft_config = self.active_peft_config │
│ 738 │ │ if not isinstance(peft_config, PromptLearningConfig): │
│ ❱ 739 │ │ │ return self.base_model( │
│ 740 │ │ │ │ input_ids=input_ids, │
│ 741 │ │ │ │ attention_mask=attention_mask, │
│ 742 │ │ │ │ inputs_embeds=inputs_embeds, │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1502 in _wrapped_call_impl │
│ │
│ 1499 │ │ if self._compiled_call_impl is not None: │
│ 1500 │ │ │ return self._compiled_call_impl(*args, **kwargs) # type: │
│ 1501 │ │ else: │
│ ❱ 1502 │ │ │ return self._call_impl(*args, **kwargs) │
│ 1503 │ │
│ 1504 │ def _call_impl(self, *args, **kwargs): │
│ 1505 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_s │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1511 in _call_impl │
│ │
│ 1508 │ │ if not (self._backward_hooks or self._backward_pre_hooks or s │
│ 1509 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hoo │
│ 1510 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1511 │ │ │ return forward_call(*args, **kwargs) │
│ 1512 │ │ # Do not call functions when jit is used │
│ 1513 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1514 │ │ backward_pre_hooks = [] │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module.hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/transformers/models/llama/modeling_llama.py:691 in │
│ forward │
│ │
│ 688 │ │ return_dict = return_dict if return_dict is not None else self │
│ 689 │ │ │
│ 690 │ │ # decoder outputs consists of (dec_features, layer_state, dec

│ ❱ 691 │ │ outputs = self.model( │
│ 692 │ │ │ input_ids=input_ids, │
│ 693 │ │ │ attention_mask=attention_mask, │
│ 694 │ │ │ position_ids=position_ids, │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1502 in _wrapped_call_impl │
│ │
│ 1499 │ │ if self._compiled_call_impl is not None: │
│ 1500 │ │ │ return self._compiled_call_impl(*args, **kwargs) # type: │
│ 1501 │ │ else: │
│ ❱ 1502 │ │ │ return self._call_impl(*args, **kwargs) │
│ 1503 │ │
│ 1504 │ def _call_impl(self, *args, **kwargs): │
│ 1505 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_s │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1511 in _call_impl │
│ │
│ 1508 │ │ if not (self._backward_hooks or self._backward_pre_hooks or s │
│ 1509 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hoo │
│ 1510 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1511 │ │ │ return forward_call(*args, **kwargs) │
│ 1512 │ │ # Do not call functions when jit is used │
│ 1513 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1514 │ │ backward_pre_hooks = [] │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/transformers/models/llama/modeling_llama.py:532 in │
│ forward │
│ │
│ 529 │ │ │ position_ids = position_ids.view(-1, seq_length).long() │
│ 530 │ │ │
│ 531 │ │ if inputs_embeds is None: │
│ ❱ 532 │ │ │ inputs_embeds = self.embed_tokens(input_ids) │
│ 533 │ │ # embed positions │
│ 534 │ │ if attention_mask is None: │
│ 535 │ │ │ attention_mask = torch.ones( │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1502 in _wrapped_call_impl │
│ │
│ 1499 │ │ if self._compiled_call_impl is not None: │
│ 1500 │ │ │ return self._compiled_call_impl(*args, **kwargs) # type: │
│ 1501 │ │ else: │
│ ❱ 1502 │ │ │ return self._call_impl(*args, **kwargs) │
│ 1503 │ │
│ 1504 │ def _call_impl(self, *args, **kwargs): │
│ 1505 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_s │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1511 in _call_impl │
│ │
│ 1508 │ │ if not (self._backward_hooks or self._backward_pre_hooks or s │
│ 1509 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hoo │
│ 1510 │ │ │ │ or global_forward_hooks or global_forward_pre_hooks │
│ ❱ 1511 │ │ │ return forward_call(*args, **kwargs) │
│ 1512 │ │ # Do not call functions when jit is used │
│ 1513 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1514 │ │ backward_pre_hooks = [] │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module.hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/sparse.py:162 in forward │
│ │
│ 159 │ │ │ │ self.weight[self.padding_idx].fill
(0) │
│ 160 │ │
│ 161 │ def forward(self, input: Tensor) -> Tensor: │
│ ❱ 162 │ │ return F.embedding( │
│ 163 │ │ │ input, self.weight, self.padding_idx, self.max_norm, │
│ 164 │ │ │ self.norm_type, self.scale_grad_by_freq, self.sparse) │
│ 165 │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/functional.py:2238 in embedding │
│ │
│ 2235 │ │ # torch.embedding_renorm

│ 2236 │ │ # remove once script supports set_grad_enabled │
│ 2237 │ │ no_grad_embedding_renorm(weight, input, max_norm, norm_type │
│ ❱ 2238 │ return torch.embedding(weight, input, padding_idx, scale_grad_by

│ 2239 │
│ 2240 │
│ 2241 def embedding_bag( │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Does anyone have an idea, what might be the issue? Any help would be greatly appreciated!

@amyeroberts
Copy link
Collaborator

Hi @MaggieK410, thanks for reporting this issue.

This is typically caused by an indexing issue in the code.

Could you follow the issue template and:

  • Provide information about the running environment: run transformers-cli env in the terminal and copy-paste the output
  • Format the code examples. All code should be sandwiched between three backticks ``` all code goes here ```
  • Could you also put the error message in code formatting please?
  • Provide a checkpoint - which medalpaca model is being tested?
  • Ensure the example code is runnable? dataset is not defined

@MaggieK410
Copy link
Author

Hi, thank you very much for getting back to me! I have made a mistake when initializing the tokenizer (I added tokens withoud resizing the embedding). As it is solved, I will close this issue.

@SuperBruceJia
Copy link

Hi, thank you very much for getting back to me! I have made a mistake when initializing the tokenizer (I added tokens withoud resizing the embedding). As it is solved, I will close this issue.

May I know how you solve the problem? Thank you very much in advance!

@MaggieK410
Copy link
Author

In another part of the code I added a token but did not change the embedding size, which lead to the issue above. Since I did not need that token, I just removed that line and the code worked, but if you need to add the token, maybe look into changing your embeddings (https://stackoverflow.com/questions/72775559/resize-token-embeddings-on-the-a-pertrained-model-with-different-embedding-size)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants