Assertion `srcIndex < srcSelectDimSize` failed #24698

MaggieK410 · 2023-07-06T21:57:14Z

Hi,
I am running medalpaca (but the error seems to come from llama) on 4 GPUs using device map="auto" and the SFTTrainer and want to prompt tune the model. I have written a custom Dataset class:

class DiagnosesDataset(torch.utils.data.Dataset):
def init(self, instances, tokenizer):
self.instances=instances
self.tokenizer=tokenizer

def __getitem__(self, idx):
	item={}
	prompt= self.instances["prompt"][idx]
	labels = self.instances["label"][idx]

	item=self.tokenize(prompt+labels)
	tokenized_instruction=self.tokenize(prompt)
	label_instruction=self.tokenizer(labels)

	i=len(tokenized_instruction["input_ids"])
	item["labels"][i:]=label_instruction["input_ids"]
	return item

def tokenize(self, prompt):
	result_prompt=self.tokenizer(prompt, 
			truncation=True, 
			max_length=2048,
			padding=False,
			return_tensors=None)

	result_prompt["labels"]=[-100]*len(result_prompt["input_ids"])	
	return result_prompt

def __len__(self):
	return len(self.instances)

The Training Arguments and Peft config:
training_arguments=TrainingArguments(
output_dir="./falcon_output_dir",
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
optim="paged_adamw_32bit",
save_steps=100,
logging_steps=10,
learning_rate=2e-4,
max_steps=10000,
fp16=False,
bf16=False,
lr_scheduler_type="constant",
warmup_ratio=0.03,
group_by_length=True,
remove_unused_columns=False)
peft_config=LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=4,
bias="none",
task_type=TaskType.CAUSAL_LM,
target_modules=["q_proj", "v_proj"])
The SFTTrainer I am using looks like this:

trainer=SFTTrainer(
	model=model,
	tokenizer=tokenizer,
	train_dataset=dataset,
	peft_config=peft_config,
	packing=True,
	args=training_arguments)
trainer.train()

However, when running the model, somewhere there seems to be an issue with some indices (https://discuss.pytorch.org/t/solved-assertion-srcindex-srcselectdimsize-failed-on-gpu-for-torch-cat/1804/27)

The error I am getting is this:
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/students/kulcsar/Bachelor/for_dataset/10000_diagnoses/falcon_model_pef │
│ t.py:544 in │
│ │
│ 541 │ │
│ 542 │ │
│ 543 │ args=parser.parse_args() │
│ ❱ 544 │ run() │
│ 545 │ #main() │
│ 546 │ │
│ 547 │ #all_data, prompts, golds=preprocess("./dataset.pkl") │
│ │
│ /home/students/kulcsar/Bachelor/for_dataset/10000_diagnoses/falcon_model_pef │
│ t.py:153 in run │
│ │
│ 150 │ │ packing=True, │
│ 151 │ │ data_collator=DataCollatorForSeq2Seq(tokenizer, pad_to_multipl │
│ 152 │ │ args=training_arguments) │
│ ❱ 153 │ trainer.train() │
│ 154 │ │
│ 155 │ logging.info("Run Train loop") │
│ 156 │ #model_updated=train(model, dataset, args.seed, args.batch_size, a │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/transformers/trainer.py:1537 in train │
│ │
│ 1534 │ │ inner_training_loop = find_executable_batch_size( │
│ 1535 │ │ │ self._inner_training_loop, self._train_batch_size, args.a │
│ 1536 │ │ ) │
│ ❱ 1537 │ │ return inner_training_loop( │
│ 1538 │ │ │ args=args, │
│ 1539 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1540 │ │ │ trial=trial, │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/transformers/trainer.py:1802 in _inner_training_loop │
│ │
│ 1799 │ │ │ │ │ self.control = self.callback_handler.on_step_begi │
│ 1800 │ │ │ │ │
│ 1801 │ │ │ │ with self.accelerator.accumulate(model): │
│ ❱ 1802 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1803 │ │ │ │ │
│ 1804 │ │ │ │ if ( │
│ 1805 │ │ │ │ │ args.logging_nan_inf_filter │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/transformers/trainer.py:2647 in training_step │
│ │
│ 2644 │ │ │ return loss_mb.reduce_mean().detach().to(self.args.device │
│ 2645 │ │ │
│ 2646 │ │ with self.compute_loss_context_manager(): │
│ ❱ 2647 │ │ │ loss = self.compute_loss(model, inputs) │
│ 2648 │ │ │
│ 2649 │ │ if self.args.n_gpu > 1: │
│ 2650 │ │ │ loss = loss.mean() # mean() to average on multi-gpu para │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/transformers/trainer.py:2672 in compute_loss │
│ │
│ 2669 │ │ │ labels = inputs.pop("labels") │
│ 2670 │ │ else: │
│ 2671 │ │ │ labels = None │
│ ❱ 2672 │ │ outputs = model(**inputs) │
│ 2673 │ │ # Save past state if it exists │
│ 2674 │ │ # TODO: this needs to be fixed and made cleaner later. │
│ 2675 │ │ if self.args.past_index >= 0: │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1502 in _wrapped_call_impl │
│ │
│ 1499 │ │ if self._compiled_call_impl is not None: │
│ 1500 │ │ │ return self._compiled_call_impl(*args, **kwargs) # type: │
│ 1501 │ │ else: │
│ ❱ 1502 │ │ │ return self._call_impl(*args, **kwargs) │
│ 1503 │ │
│ 1504 │ def _call_impl(self, *args, **kwargs): │
│ 1505 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_s │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1511 in _call_impl │
│ │
│ 1508 │ │ if not (self._backward_hooks or self._backward_pre_hooks or s │
│ 1509 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hoo │
│ 1510 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1511 │ │ │ return forward_call(*args, **kwargs) │
│ 1512 │ │ # Do not call functions when jit is used │
│ 1513 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1514 │ │ backward_pre_hooks = [] │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/peft/peft_model.py:739 in forward │
│ │
│ 736 │ ): │
│ 737 │ │ peft_config = self.active_peft_config │
│ 738 │ │ if not isinstance(peft_config, PromptLearningConfig): │
│ ❱ 739 │ │ │ return self.base_model( │
│ 740 │ │ │ │ input_ids=input_ids, │
│ 741 │ │ │ │ attention_mask=attention_mask, │
│ 742 │ │ │ │ inputs_embeds=inputs_embeds, │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1502 in _wrapped_call_impl │
│ │
│ 1499 │ │ if self._compiled_call_impl is not None: │
│ 1500 │ │ │ return self._compiled_call_impl(*args, **kwargs) # type: │
│ 1501 │ │ else: │
│ ❱ 1502 │ │ │ return self._call_impl(*args, **kwargs) │
│ 1503 │ │
│ 1504 │ def _call_impl(self, *args, **kwargs): │
│ 1505 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_s │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1511 in _call_impl │
│ │
│ 1508 │ │ if not (self._backward_hooks or self._backward_pre_hooks or s │
│ 1509 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hoo │
│ 1510 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1511 │ │ │ return forward_call(*args, **kwargs) │
│ 1512 │ │ # Do not call functions when jit is used │
│ 1513 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1514 │ │ backward_pre_hooks = [] │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module.hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/transformers/models/llama/modeling_llama.py:691 in │
│ forward │
│ │
│ 688 │ │ return_dict = return_dict if return_dict is not None else self │
│ 689 │ │ │
│ 690 │ │ # decoder outputs consists of (dec_features, layer_state, dec │
│ ❱ 691 │ │ outputs = self.model( │
│ 692 │ │ │ input_ids=input_ids, │
│ 693 │ │ │ attention_mask=attention_mask, │
│ 694 │ │ │ position_ids=position_ids, │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1502 in _wrapped_call_impl │
│ │
│ 1499 │ │ if self._compiled_call_impl is not None: │
│ 1500 │ │ │ return self._compiled_call_impl(*args, **kwargs) # type: │
│ 1501 │ │ else: │
│ ❱ 1502 │ │ │ return self._call_impl(*args, **kwargs) │
│ 1503 │ │
│ 1504 │ def _call_impl(self, *args, **kwargs): │
│ 1505 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_s │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1511 in _call_impl │
│ │
│ 1508 │ │ if not (self._backward_hooks or self._backward_pre_hooks or s │
│ 1509 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hoo │
│ 1510 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1511 │ │ │ return forward_call(*args, **kwargs) │
│ 1512 │ │ # Do not call functions when jit is used │
│ 1513 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1514 │ │ backward_pre_hooks = [] │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/transformers/models/llama/modeling_llama.py:532 in │
│ forward │
│ │
│ 529 │ │ │ position_ids = position_ids.view(-1, seq_length).long() │
│ 530 │ │ │
│ 531 │ │ if inputs_embeds is None: │
│ ❱ 532 │ │ │ inputs_embeds = self.embed_tokens(input_ids) │
│ 533 │ │ # embed positions │
│ 534 │ │ if attention_mask is None: │
│ 535 │ │ │ attention_mask = torch.ones( │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1502 in _wrapped_call_impl │
│ │
│ 1499 │ │ if self._compiled_call_impl is not None: │
│ 1500 │ │ │ return self._compiled_call_impl(*args, **kwargs) # type: │
│ 1501 │ │ else: │
│ ❱ 1502 │ │ │ return self._call_impl(*args, **kwargs) │
│ 1503 │ │
│ 1504 │ def _call_impl(self, *args, **kwargs): │
│ 1505 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_s │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/module.py:1511 in _call_impl │
│ │
│ 1508 │ │ if not (self._backward_hooks or self._backward_pre_hooks or s │
│ 1509 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hoo │
│ 1510 │ │ │ │ or global_forward_hooks or global_forward_pre_hooks │
│ ❱ 1511 │ │ │ return forward_call(*args, **kwargs) │
│ 1512 │ │ # Do not call functions when jit is used │
│ 1513 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1514 │ │ backward_pre_hooks = [] │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module.hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/modules/sparse.py:162 in forward │
│ │
│ 159 │ │ │ │ self.weight[self.padding_idx].fill(0) │
│ 160 │ │
│ 161 │ def forward(self, input: Tensor) -> Tensor: │
│ ❱ 162 │ │ return F.embedding( │
│ 163 │ │ │ input, self.weight, self.padding_idx, self.max_norm, │
│ 164 │ │ │ self.norm_type, self.scale_grad_by_freq, self.sparse) │
│ 165 │
│ │
│ /home/students/kulcsar/anaconda3/envs/software_bubble_updated_pytorch/lib/py │
│ thon3.9/site-packages/torch/nn/functional.py:2238 in embedding │
│ │
│ 2235 │ │ # torch.embedding_renorm │
│ 2236 │ │ # remove once script supports set_grad_enabled │
│ 2237 │ │ no_grad_embedding_renorm(weight, input, max_norm, norm_type │
│ ❱ 2238 │ return torch.embedding(weight, input, padding_idx, scale_grad_by │
│ 2239 │
│ 2240 │
│ 2241 def embedding_bag( │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Does anyone have an idea, what might be the issue? Any help would be greatly appreciated!

The text was updated successfully, but these errors were encountered:

amyeroberts · 2023-07-10T12:48:40Z

Hi @MaggieK410, thanks for reporting this issue.

This is typically caused by an indexing issue in the code.

Could you follow the issue template and:

Provide information about the running environment: run transformers-cli env in the terminal and copy-paste the output
Format the code examples. All code should be sandwiched between three backticks ``` all code goes here ```
Could you also put the error message in code formatting please?
Provide a checkpoint - which medalpaca model is being tested?
Ensure the example code is runnable? dataset is not defined

MaggieK410 · 2023-07-11T18:49:21Z

Hi, thank you very much for getting back to me! I have made a mistake when initializing the tokenizer (I added tokens withoud resizing the embedding). As it is solved, I will close this issue.

SuperBruceJia · 2023-11-14T03:42:50Z

Hi, thank you very much for getting back to me! I have made a mistake when initializing the tokenizer (I added tokens withoud resizing the embedding). As it is solved, I will close this issue.

May I know how you solve the problem? Thank you very much in advance!

MaggieK410 · 2023-11-14T19:24:52Z

In another part of the code I added a token but did not change the embedding size, which lead to the issue above. Since I did not need that token, I just removed that line and the code worked, but if you need to add the token, maybe look into changing your embeddings (https://stackoverflow.com/questions/72775559/resize-token-embeddings-on-the-a-pertrained-model-with-different-embedding-size)

MaggieK410 closed this as completed Jul 11, 2023

ShoufaChen mentioned this issue Dec 5, 2023

RuntimeError: CUDA error: device-side assert triggered when running Llama on multiple gpus #22546

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion `srcIndex < srcSelectDimSize` failed #24698

Assertion `srcIndex < srcSelectDimSize` failed #24698

MaggieK410 commented Jul 6, 2023

amyeroberts commented Jul 10, 2023

MaggieK410 commented Jul 11, 2023

SuperBruceJia commented Nov 14, 2023

MaggieK410 commented Nov 14, 2023

Assertion srcIndex < srcSelectDimSize failed #24698

Assertion srcIndex < srcSelectDimSize failed #24698

Comments

MaggieK410 commented Jul 6, 2023

amyeroberts commented Jul 10, 2023

MaggieK410 commented Jul 11, 2023

SuperBruceJia commented Nov 14, 2023

MaggieK410 commented Nov 14, 2023

Assertion `srcIndex < srcSelectDimSize` failed #24698

Assertion `srcIndex < srcSelectDimSize` failed #24698