Question for the preprocess data in the reader #73

szhang42 · 2020-10-26T21:19:25Z

Hello,

Thanks for this nice work! I have incurred one question while I was doing the preprocess data in the reader part. In the reader, before the reader training, I first preprocessed the json file (I used the dpr provided checkpoint for these json files) from the retriever using the script as below. After the preprocessing, it would generate nq-dev-dpr.0.pkl -> nq-dev-dpr.15.pkl and nq-train-dpr.0.pkl -> nq-train-dpr.15.pkl. So there are 16 pkl for train and 16 pkl for dev. However, if I directly use the checkpoints provided at DPR at data/download_data.py as below, the checkpoint for the train only has 9 pkl and the checkpoint for the dev only has 1 pkl.

Is there a reason why the number of pkl files I generate from the preprocess is different from the provided checkpoints?
Also, I didn't preprocess the retrieve nq-test json file in the reader. If I used my own preprocessed train and dev json file to train the reader model, would it be ok that I could directly use the dpr provided test.0.pkl from the below (dpr provided preprocessed data from retriever results) in testing the reader? Thanks very much!

'data.reader.nq.single.train': {
's3_url': ['https://dl.fbaipublicfiles.com/dpr/data/reader/nq/single/train.{}.pkl'.format(i) for i in range(8)],
'original_ext': '.pkl',
'compressed': False,
'desc': 'Reader model NQ train dataset input data preprocessed from retriever results (also trained on NQ)',
'license_files': NQ_LICENSE_FILES,
},

'data.reader.nq.single.dev': {
    's3_url': 'https://dl.fbaipublicfiles.com/dpr/data/reader/nq/single/dev.0.pkl',
    'original_ext': '.pkl',
    'compressed': False,
    'desc': 'Reader model NQ dev dataset input data preprocessed from retriever results (also trained on NQ)',
    'license_files': NQ_LICENSE_FILES,
},

'data.reader.nq.single.test': {
    's3_url': 'https://dl.fbaipublicfiles.com/dpr/data/reader/nq/single/test.0.pkl',
    'original_ext': '.pkl',
    'compressed': False,
    'desc': 'Reader model NQ test dataset input data preprocessed from retriever results (also trained on NQ)',
    'license_files': NQ_LICENSE_FILES,

python3 preprocess_reader_data.py
--retriever_results output/dpr_retrieval/nq-dev, train. json
--gold_passages output/data/gold_passages_info/nq_dev, train.json
--do_lower_case
--pretrained_model_cfg bert-base-uncased
--encoder_model_type hf_bert
--out_file output/dpr_retrieval/nq-dev-dpr
#--is_train_set # specify this only when it is train data

The text was updated successfully, but these errors were encountered:

vlad-karpukhin · 2020-10-28T18:47:40Z

Hello,

"... the checkpoint for the train only has 9 pkl and the checkpoint for the dev only has 1 pkl...." - there is a parameter that defines the number of concurrent threads to preprocess the data. Since the dev size is quite small I just used 1 process and thus there is only 1 pkl file.
So it doesn't really matter how many files you have at the end, they are merged into one dataset when needed.
"Also, I didn't preprocess the retrieve nq-test json file in the reader." - then the reader code will trigger preprocessing automatically and place the .pkl files near the json file. If you use our provided preprocessed file that means you are using retrieved result from our retriever checkpoint. It is fine as long as it suits your research story. I.e. if you change something in the reader model - is probably fine to use it, if you change the retriever model/checkpoint - then it is mostly likely not fair to use our results.

szhang42 · 2020-10-29T17:49:36Z

Hello,

Got it. Thanks very much for your responses!

szhang42 closed this as completed Oct 29, 2020

UmerTariq1 mentioned this issue Aug 25, 2022

Data from retriever is not working with reader #229

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question for the preprocess data in the reader #73

Question for the preprocess data in the reader #73

szhang42 commented Oct 26, 2020 •

edited

Loading

vlad-karpukhin commented Oct 28, 2020

szhang42 commented Oct 29, 2020

Question for the preprocess data in the reader #73

Question for the preprocess data in the reader #73

Comments

szhang42 commented Oct 26, 2020 • edited Loading

vlad-karpukhin commented Oct 28, 2020

szhang42 commented Oct 29, 2020

szhang42 commented Oct 26, 2020 •

edited

Loading