NQ320k preprocessing? #16

kisozinov · 2024-04-22T07:37:56Z

Hi, I tried to reproduce the results of your experiments on NQ320k dataset as per the table from your paper

To do this, I referred to your script from old repository, but I ran into the problem that simply by changing NUM_TRAIN=307000 and NUM_EVAL=7000 script terminates in the middle, probably due to the repeated titles (stop at ~107000).

for ind in rand_inds:
        title = data[ind]['document']['title']  # we use title as the doc identifier to prevent two docs have the same text
        if title not in title_set:
            title_set.add(title)

Hence I have a question, what script or settings (train/val split) do you use to process NQ320k?

The text was updated successfully, but these errors were encountered:

ArvinZhuang · 2024-05-02T01:30:24Z

Hi @kisozinov, sorry for the late reply.

I think the reason is that the NQ dataset on huggingface dataset hub now only has 10.6k train data

But originally it should have 307k. I am not sure what happened..

kisozinov · 2024-05-02T20:28:40Z

@ArvinZhuang This is not a problem. I've successfully downloaded this dataset again today, it has 307/7 k samples (maybe bug?). In the case of a full dataset, did you use the standard split into train/test (307/7 k)? The restriction on the unique doc title from your script is not suitable, as far as I understand :)

ArvinZhuang · 2024-05-03T00:22:57Z

@kisozinov Yes I was using standard train/test split. If nothing about the dataset is wrong, it is what it is.
The reason I have the doc title filtering is that it is not suitable to have two different document IDs for the same document (identified by title). The filtering makes sure this won't happen. So we got fewer documents at the end because the train/test had fewer unique documents.

Maybe the proper way is to sample some other documents from Wikipedia and add them to the corpus to make up a corpus with 320k docs.

kisozinov · 2024-05-03T07:20:21Z

I got it, thanks for the answer :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NQ320k preprocessing? #16

NQ320k preprocessing? #16

kisozinov commented Apr 22, 2024 •

edited

Loading

ArvinZhuang commented May 2, 2024 •

edited

Loading

kisozinov commented May 2, 2024 •

edited

Loading

ArvinZhuang commented May 3, 2024

kisozinov commented May 3, 2024

NQ320k preprocessing? #16

NQ320k preprocessing? #16

Comments

kisozinov commented Apr 22, 2024 • edited Loading

ArvinZhuang commented May 2, 2024 • edited Loading

kisozinov commented May 2, 2024 • edited Loading

ArvinZhuang commented May 3, 2024

kisozinov commented May 3, 2024

kisozinov commented Apr 22, 2024 •

edited

Loading

ArvinZhuang commented May 2, 2024 •

edited

Loading

kisozinov commented May 2, 2024 •

edited

Loading