-
Notifications
You must be signed in to change notification settings - Fork 458
Description
Description
The regular expression used to extract answers for MMLU in common.py fails when the pattern "Answer: LETTER" appears multiple times in the LLM output, affecting model performance.
Example
The following example demonstrates the issue with a German output. The model correctly selects "C", but the regex extracts "A" as the answer.
Explanation
The regular expression mistakenly only considers the first occurrence of "Answer: LETTER".
Lines 25 to 71 in a8e85cc
| MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = ( | |
| "(?i){}\s*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])" | |
| ) | |
| # All the different ways "Answer" is written in different languages | |
| MULTILINGUAL_ANSWER_REGEXES = [ | |
| "Answer\s*:", | |
| "Answer\s*:", # Korean invisible character | |
| "উত্তর\s*:", | |
| "उत्तर\s*:", | |
| "উত্তরঃ", | |
| "উত্তর\s*:", | |
| "Antwort\s*:", | |
| "답변\s*:", | |
| "정답\s*:", | |
| "답\s*:", | |
| "答案\s*:", | |
| "答案\s*:", | |
| "答\s*:", | |
| "答\s*:", | |
| "答复\s*:", | |
| "答曰\s*:", | |
| "الإجابة:", | |
| "الجواب:", | |
| "إجابة:", | |
| "الإجابة النهائية:", | |
| "الإجابة الصحيحة:", | |
| "الإجابة الصحيحة هي:", | |
| "الإجابة هي:", | |
| "Respuesta\s*:", | |
| "Risposta\s*:", | |
| "答え\s*:", | |
| "答え\s*:", | |
| "回答\s*:", | |
| "回答\s*:", | |
| "解答\s*:", | |
| "Jawaban\s*:", | |
| "Réponse\s*:", | |
| "Resposta\s*:", | |
| "Jibu\s*:", | |
| "Idahun\s*:", | |
| "Ìdáhùn\s*:", | |
| "Idáhùn\s*:", | |
| "Àmọ̀nà\s*:", | |
| "Àdáhùn\s*:", | |
| "Ànúgọ\s*:", | |
| "Àṣàyàn\s*:", | |
| ] |
In the German example above, it extracts the answer "A" from "Antwort:\n\nAntwort: C" because "Antwort:\n\nAntwort: C".
Impact
This bug significantly impacts the evaluation results for certain languages. In my experiments, German experienced this issue with ~20% of the samples, and Indonesian showed a ~4% impact. Other languages seem less affected.
