MMLU answer extraction regex fails with repeated "Answer: LETTER" pattern

# Description

The regular expression used to extract answers for MMLU in `common.py` fails when the pattern "Answer: LETTER" appears multiple times in the LLM output, affecting model performance.

# Example

The following example demonstrates the issue with a German output. The model correctly selects "C", but the regex extracts "A" as the answer.

![unnamed](https://github.com/user-attachments/assets/59fe1617-451f-46fa-8614-913205ae6b3f)

# Explanation

The regular expression mistakenly only considers the first occurrence of "Answer: LETTER".

https://github.com/openai/simple-evals/blob/a8e85cc8a5dea497d915f870895250e07f9cc737/common.py#L25-L71

In the German example above, it extracts the answer "A" from "Antwort:\n\nAntwort: C" because "**Antwort:\n\nA**ntwort: C".

# Impact

This bug significantly impacts the evaluation results for certain languages. In my experiments, German experienced this issue with ~20% of the samples, and Indonesian showed a ~4% impact. Other languages seem less affected.

	MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = (
	"(?i){}\s*([A-D]\|[أ-د]\|[অ]\|[ব]\|[ড]\|[ঢ]\|[Ａ]\|[Ｂ]\|[Ｃ]\|[Ｄ])"
	)
	# All the different ways "Answer" is written in different languages
	MULTILINGUAL_ANSWER_REGEXES = [
	"Answer\s*:",
	"Answer\s*:", # Korean invisible character
	"উত্তর\s*:",
	"उत्तर\s*:",
	"উত্তরঃ",
	"উত্তর\s*:",
	"Antwort\s*:",
	"답변\s*:",
	"정답\s*:",
	"답\s*:",
	"答案\s*：",
	"答案\s*:",
	"答\s*：",
	"答\s*:",
	"答复\s*：",
	"答曰\s*：",
	"الإجابة:",
	"الجواب:",
	"إجابة:",
	"الإجابة النهائية:",
	"الإجابة الصحيحة:",
	"الإجابة الصحيحة هي:",
	"الإجابة هي:",
	"Respuesta\s*:",
	"Risposta\s*:",
	"答え\s*:",
	"答え\s*：",
	"回答\s*:",
	"回答\s*：",
	"解答\s*:",
	"Jawaban\s*:",
	"Réponse\s*:",
	"Resposta\s*:",
	"Jibu\s*:",
	"Idahun\s*:",
	"Ìdáhùn\s*:",
	"Idáhùn\s*:",
	"Àmọ̀nà\s*:",
	"Àdáhùn\s*:",
	"Ànúgọ\s*:",
	"Àṣàyàn\s*:",
	]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MMLU answer extraction regex fails with repeated "Answer: LETTER" pattern #33

Description

Example

Explanation

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MMLU answer extraction regex fails with repeated "Answer: LETTER" pattern #33

Description

Description

Example

Explanation

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions