-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NQ320k preprocessing? #16
Comments
Hi @kisozinov, sorry for the late reply. I think the reason is that the NQ dataset on huggingface dataset hub now only has 10.6k train data But originally it should have 307k. I am not sure what happened.. |
@ArvinZhuang This is not a problem. I've successfully downloaded this dataset again today, it has 307/7 k samples (maybe bug?). In the case of a full dataset, did you use the standard split into train/test (307/7 k)? The restriction on the unique doc title from your script is not suitable, as far as I understand :) |
@kisozinov Yes I was using standard train/test split. If nothing about the dataset is wrong, it is what it is. Maybe the proper way is to sample some other documents from Wikipedia and add them to the corpus to make up a corpus with 320k docs. |
I got it, thanks for the answer :) |
Hi, I tried to reproduce the results of your experiments on NQ320k dataset as per the table from your paper
To do this, I referred to your script from old repository, but I ran into the problem that simply by changing
NUM_TRAIN=307000
andNUM_EVAL=7000
script terminates in the middle, probably due to the repeated titles (stop at ~107000).Hence I have a question, what script or settings (train/val split) do you use to process NQ320k?
The text was updated successfully, but these errors were encountered: