Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate paragraph in HotpotQA training set #7

Closed
boyang-nlp opened this issue Jun 10, 2019 · 8 comments
Closed

Duplicate paragraph in HotpotQA training set #7

boyang-nlp opened this issue Jun 10, 2019 · 8 comments

Comments

@boyang-nlp
Copy link

boyang-nlp commented Jun 10, 2019

I noticed that some duplicate paragraphs exist in HotpotQA training set as shown in the following pictures, is it reasonable? Thanks!
image
image

@boyang-nlp
Copy link
Author

@danqi @robinjia @alontalmor @ajfisch could you please help to look into this issue? thanks.

@ajfisch
Copy link
Collaborator

ajfisch commented Jun 13, 2019

Hi @scissorsy! Could you give the qid for those examples? Thanks!

@boyang-nlp
Copy link
Author

boyang-nlp commented Jun 13, 2019

Hi, @ajfisch ! At first glance, there're many such cases, for example:
0e7b7238c366b988112468f3c
934c257e8b7d75e2a897aa2d4
6590d749f69bef962410664de
b801893bab83f760e3dcbc82a

@alontalmor
Copy link
Collaborator

Found the bug and fixed it, HotpotQA data update soon to follow

@ajfisch
Copy link
Collaborator

ajfisch commented Jun 13, 2019

Updated. Thanks for reporting!

@boyang-nlp
Copy link
Author

Got it,thanks a lot!

@csarron
Copy link

csarron commented Sep 12, 2019

Hi @ajfisch and @alontalmor, this doesn't seem fixed, I can still see the duplicated sentences in the dev split. Just manually check the first 10 examples, I found 3 of them still contain duplicates.

e.g., the first example (id=5a8c7595554299585d9e36b6)

Screenshot 2019-09-11 21 05 09
and the fourth example, (id=5a85ea095542994775f606a8):
Screenshot 2019-09-11 21 07 27

10th example, (id=5a7166395542994082a3e814)
Screenshot 2019-09-11 21 10 17

the problem seems still exist, or it didn't get fully fixed.

Note: I downloaded the dev split fresh from https://s3.us-east-2.amazonaws.com/mrqa/release/v2/dev/HotpotQA.jsonl.gz.

Is it possible that the problem got fixed, but somehow the s3 cache was not purged?

@cemilcengiz
Copy link

Hi @ajfisch and @alontalmor, this doesn't seem fixed, I can still see the duplicated sentences in the dev split. Just manually check the first 10 examples, I found 3 of them still contain duplicates.

e.g., the first example (id=5a8c7595554299585d9e36b6)

Screenshot 2019-09-11 21 05 09
and the fourth example, (id=5a85ea095542994775f606a8):
Screenshot 2019-09-11 21 07 27

10th example, (id=5a7166395542994082a3e814)
Screenshot 2019-09-11 21 10 17

the problem seems still exist, or it didn't get fully fixed.

Note: I downloaded the dev split fresh from https://s3.us-east-2.amazonaws.com/mrqa/release/v2/dev/HotpotQA.jsonl.gz.

Is it possible that the problem got fixed, but somehow the s3 cache was not purged?

This problem still exists in 2021.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants