Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NQ320k preprocessing? #16

Open
kisozinov opened this issue Apr 22, 2024 · 4 comments
Open

NQ320k preprocessing? #16

kisozinov opened this issue Apr 22, 2024 · 4 comments

Comments

@kisozinov
Copy link

kisozinov commented Apr 22, 2024

Hi, I tried to reproduce the results of your experiments on NQ320k dataset as per the table from your paper
image

To do this, I referred to your script from old repository, but I ran into the problem that simply by changing NUM_TRAIN=307000 and NUM_EVAL=7000 script terminates in the middle, probably due to the repeated titles (stop at ~107000).

for ind in rand_inds:
        title = data[ind]['document']['title']  # we use title as the doc identifier to prevent two docs have the same text
        if title not in title_set:
            title_set.add(title)

Hence I have a question, what script or settings (train/val split) do you use to process NQ320k?

@ArvinZhuang
Copy link
Owner

ArvinZhuang commented May 2, 2024

Hi @kisozinov, sorry for the late reply.

I think the reason is that the NQ dataset on huggingface dataset hub now only has 10.6k train data
image

But originally it should have 307k. I am not sure what happened..

@kisozinov
Copy link
Author

kisozinov commented May 2, 2024

@ArvinZhuang This is not a problem. I've successfully downloaded this dataset again today, it has 307/7 k samples (maybe bug?). In the case of a full dataset, did you use the standard split into train/test (307/7 k)? The restriction on the unique doc title from your script is not suitable, as far as I understand :)

@ArvinZhuang
Copy link
Owner

@kisozinov Yes I was using standard train/test split. If nothing about the dataset is wrong, it is what it is.
The reason I have the doc title filtering is that it is not suitable to have two different document IDs for the same document (identified by title). The filtering makes sure this won't happen. So we got fewer documents at the end because the train/test had fewer unique documents.

Maybe the proper way is to sample some other documents from Wikipedia and add them to the corpus to make up a corpus with 320k docs.

@kisozinov
Copy link
Author

I got it, thanks for the answer :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants