metric_for_best_model set to eval_f1 raises KeyError because metric is not found in evaluation results

### System Info

# Environment: (Google Colab)

Python 3.11.13  
torch==2.6.0+cu124  
transformers==4.55.2  
bitsandbytes==0.47.0  
peft==0.17.0  
accelerate==1.10.0  
numpy==1.26.4  
scipy==1.14.1  

GPU
NVIDIA L4  
Driver Version: 550.54.15  
CUDA Version: 12.4  

Model Quantized with QLoRA

# Dataset:

Train Dataset
{'text': Value('string'), 'embeddings': List(Value('float64')), 'tfidf_vector': List(Value('float64')), 'roberta_sent_neg': Value('float64'), 'roberta_sent_pos': Value('float64'), 'names': Value('int64'), 'organizations': Value('int64'), 'dates': Value('int64'), 'count_tokens': Value('int64'), 'label': Value('int64'), 'input_ids': List(Value('int32')), 'token_type_ids': List(Value('int8')), 'attention_mask': List(Value('int8'))}

Val Dataset
{'text': Value('string'), 'embeddings': List(Value('float64')), 'tfidf_vector': List(Value('float64')), 'roberta_sent_neg': Value('float64'), 'roberta_sent_pos': Value('float64'), 'names': Value('int64'), 'organizations': Value('int64'), 'dates': Value('int64'), 'count_tokens': Value('int64'), 'label': Value('int64'), 'input_ids': List(Value('int32')), 'token_type_ids': List(Value('int8')), 'attention_mask': List(Value('int8'))}






























### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

# Code Setup
```

model = "google-bert/bert-base-uncased"
tokenizer_bert = AutoTokenizer.from_pretrained(model)
if tokenizer_bert.pad_token is None:
    tokenizer_bert.pad_token = tokenizer_bert.eos_token
tokenizer_bert.padding_side = "right"


compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_use_double_quant = True,
    bnb_4bit_compute_dtype = compute_dtype,
)

original_model_bert = AutoModelForSequenceClassification.from_pretrained(
    model,
    num_labels = 2, 
    quantization_config= bnb_config, 
    )
lora_config = LoraConfig(
    r = 8,
    lora_alpha = 16,
    lora_dropout=0.1,
    bias = "none",
    task_type=TaskType.SEQ_CLS, 
)
kbit_model_bert = prepare_model_for_kbit_training(original_model_bert)
kbit_model_bert.gradient_checkpointing_enable()
peft_model_bert = get_peft_model(kbit_model_bert, lora_config)




def compute_metrics(eval_pred):
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis = -1)
  accuracy = accuracy_score(labels, predictions)
  precision = precision_score(labels, predictions)
  recall = recall_score(labels, predictions)
  f1 = f1_score(labels, predictions, average = "binary")
  print(f1)
  print("\n")
  return {"accuracy": accuracy,
      "precision": precision,
      "recall": recall,
      "f1": f1
          }


output_dir = f'/content/drive/}'
args = TrainingArguments(
        output_dir = output_dir,
        weight_decay=0.22511642804764023,
        warmup_ratio=0.12890328790683203,
        adam_beta1=0.9348819720458172,
        adam_beta2=0.9285998615546803,
        adam_epsilon=1.9972958061508847e-07,
        max_grad_norm=4.222172817940239,
        gradient_accumulation_steps=2,
        max_steps=712,
        do_train = True,
        do_eval = True,
        lr_scheduler_type='polynomial',
        warmup_steps=488,
        metric_for_best_model = "eval_f1",
        optim='paged_adamw_32bit',
        learning_rate = 2.1106713456200193e-05,
        num_train_epochs = 40,
        logging_dir = "./logs/",
        logging_strategy = "epoch",
        eval_strategy = "epoch",
        save_strategy = "epoch",
        label_names = ["label"],
        load_best_model_at_end = True,
        save_total_limit = 3,
    )


trainer = Trainer(
        model = peft_model_bert,
        args = args,
        train_dataset = train_dataset,
        eval_dataset = dev_train_dataset,
        compute_metrics = compute_metrics,
        data_collator = DataCollatorWithPadding(tokenizer = tokenizer_bert, padding=True),
        callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
    )

trainer.train()

```

### Expected behavior

# Error Message:
<img width="1080" height="960" alt="Image" src="https://github.com/user-attachments/assets/db917ede-5549-427a-8cec-a64ca749cc42" />

# Expected Behavior:
Train the model.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metric_for_best_model set to eval_f1 raises KeyError because metric is not found in evaluation results #40217

System Info

Environment: (Google Colab)

Dataset:

Who can help?

Information

Tasks

Reproduction

Code Setup

Expected behavior

Error Message:

Expected Behavior:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

metric_for_best_model set to eval_f1 raises KeyError because metric is not found in evaluation results #40217

Description

System Info

Environment: (Google Colab)

Dataset:

Who can help?

Information

Tasks

Reproduction

Code Setup

Expected behavior

Error Message:

Expected Behavior:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions