-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate paragraph in HotpotQA training set #7
Comments
@danqi @robinjia @alontalmor @ajfisch could you please help to look into this issue? thanks. |
Hi @scissorsy! Could you give the qid for those examples? Thanks! |
Hi, @ajfisch ! At first glance, there're many such cases, for example: |
Found the bug and fixed it, HotpotQA data update soon to follow |
Updated. Thanks for reporting! |
Got it,thanks a lot! |
Hi @ajfisch and @alontalmor, this doesn't seem fixed, I can still see the duplicated sentences in the dev split. Just manually check the first 10 examples, I found 3 of them still contain duplicates. e.g., the first example (id=5a8c7595554299585d9e36b6)
10th example, (id=5a7166395542994082a3e814) the problem seems still exist, or it didn't get fully fixed. Note: I downloaded the dev split fresh from https://s3.us-east-2.amazonaws.com/mrqa/release/v2/dev/HotpotQA.jsonl.gz. Is it possible that the problem got fixed, but somehow the s3 cache was not purged? |
This problem still exists in 2021. |
I noticed that some duplicate paragraphs exist in HotpotQA training set as shown in the following pictures, is it reasonable? Thanks!
The text was updated successfully, but these errors were encountered: