You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was checking the English to Arabic translation of MMLU (ammlu), and I noticed that a question related to masking IPs was repeated 1,568 times in the Arabic dataset. Each time, it overwrites the original question, while still maintaining the original ground-truth.
Interestingly, the question is phrased slightly differently each time, below is a sample with frequency counts:
count, question
309, كيف يتم اكتشاف تزوير عنوان IP؟
217, كيف يتم اكتشاف تزييف عنوان IP؟
95, كيف يتم الكشف عن تزوير عنوان IP؟
85, كيف يتم الكشف عن تزييف عنوان IP؟
66, كيف يتم كشف تزوير عنوان IP؟
This does not appear in the original English dataset
count, question
8, Which of the following is true?
6, Which of the following statements is false?
6, Which of the following statements is correct?
4, Which of the following statements is true?
4, Which of the following statements is correct?
My guess is that this might be an artifact from the GPT translation, where it might have been prompted with the IP masking question as an example.
I found that 1,568 out of all 14,042 questions contained a substring related to IP from the list ['IP', 'بروتوكول الإنترنت','آي بي'].
I found two datasets on Hugging Face:
Hennara/ammlu: The dataset used in LM Evaluation Harness. Claims that it was translated by GPT-4.
I was checking the English to Arabic translation of MMLU (
ammlu
), and I noticed that a question related to masking IPs was repeated 1,568 times in the Arabic dataset. Each time, it overwrites the original question, while still maintaining the original ground-truth.Interestingly, the question is phrased slightly differently each time, below is a sample with frequency counts:
This does not appear in the original English dataset
My guess is that this might be an artifact from the GPT translation, where it might have been prompted with the IP masking question as an example.
I found that 1,568 out of all 14,042 questions contained a substring related to IP from the list
['IP', 'بروتوكول الإنترنت','آي بي']
.I found two datasets on Hugging Face:
Hennara/ammlu
: The dataset used in LM Evaluation Harness. Claims that it was translated by GPT-4.FreedomIntelligence/MMLU_Arabic
: Claims that it was translated by GPT-3.5-TurboHowever, both datasets seem identical.
The text was updated successfully, but these errors were encountered: