Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated Question in Arabic MMLU #1596

Closed
Yazeed7 opened this issue Mar 18, 2024 · 1 comment
Closed

Repeated Question in Arabic MMLU #1596

Yazeed7 opened this issue Mar 18, 2024 · 1 comment

Comments

@Yazeed7
Copy link
Contributor

Yazeed7 commented Mar 18, 2024

I was checking the English to Arabic translation of MMLU (ammlu), and I noticed that a question related to masking IPs was repeated 1,568 times in the Arabic dataset. Each time, it overwrites the original question, while still maintaining the original ground-truth.

Interestingly, the question is phrased slightly differently each time, below is a sample with frequency counts:

count, question
  309, كيف يتم اكتشاف تزوير عنوان IP؟
  217, كيف يتم اكتشاف تزييف عنوان IP؟
   95, كيف يتم الكشف عن تزوير عنوان IP؟
   85, كيف يتم الكشف عن تزييف عنوان IP؟
   66, كيف يتم كشف تزوير عنوان IP؟

This does not appear in the original English dataset

count, question
    8, Which of the following is true?
    6, Which of the following statements is false?
    6, Which of the following statements is correct?

    4, Which of the following statements is true?
    4, Which of the following statements is correct?

My guess is that this might be an artifact from the GPT translation, where it might have been prompted with the IP masking question as an example.

I found that 1,568 out of all 14,042 questions contained a substring related to IP from the list ['IP', 'بروتوكول الإنترنت','آي بي'].

I found two datasets on Hugging Face:

However, both datasets seem identical.

@haileyschoelkopf
Copy link
Collaborator

@khalil-Hennara do you have any insights here?

It may be worth removing the task in light of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants