Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Question for the preprocess data in the reader #73

Closed
szhang42 opened this issue Oct 26, 2020 · 2 comments
Closed

Question for the preprocess data in the reader #73

szhang42 opened this issue Oct 26, 2020 · 2 comments

Comments

@szhang42
Copy link

szhang42 commented Oct 26, 2020

Hello,

Thanks for this nice work! I have incurred one question while I was doing the preprocess data in the reader part. In the reader, before the reader training, I first preprocessed the json file (I used the dpr provided checkpoint for these json files) from the retriever using the script as below. After the preprocessing, it would generate nq-dev-dpr.0.pkl -> nq-dev-dpr.15.pkl and nq-train-dpr.0.pkl -> nq-train-dpr.15.pkl. So there are 16 pkl for train and 16 pkl for dev. However, if I directly use the checkpoints provided at DPR at data/download_data.py as below, the checkpoint for the train only has 9 pkl and the checkpoint for the dev only has 1 pkl.

  1. Is there a reason why the number of pkl files I generate from the preprocess is different from the provided checkpoints?

  2. Also, I didn't preprocess the retrieve nq-test json file in the reader. If I used my own preprocessed train and dev json file to train the reader model, would it be ok that I could directly use the dpr provided test.0.pkl from the below (dpr provided preprocessed data from retriever results) in testing the reader? Thanks very much!

'data.reader.nq.single.train': {
's3_url': ['https://dl.fbaipublicfiles.com/dpr/data/reader/nq/single/train.{}.pkl'.format(i) for i in range(8)],
'original_ext': '.pkl',
'compressed': False,
'desc': 'Reader model NQ train dataset input data preprocessed from retriever results (also trained on NQ)',
'license_files': NQ_LICENSE_FILES,
},

'data.reader.nq.single.dev': {
    's3_url': 'https://dl.fbaipublicfiles.com/dpr/data/reader/nq/single/dev.0.pkl',
    'original_ext': '.pkl',
    'compressed': False,
    'desc': 'Reader model NQ dev dataset input data preprocessed from retriever results (also trained on NQ)',
    'license_files': NQ_LICENSE_FILES,
},

'data.reader.nq.single.test': {
    's3_url': 'https://dl.fbaipublicfiles.com/dpr/data/reader/nq/single/test.0.pkl',
    'original_ext': '.pkl',
    'compressed': False,
    'desc': 'Reader model NQ test dataset input data preprocessed from retriever results (also trained on NQ)',
    'license_files': NQ_LICENSE_FILES,

python3 preprocess_reader_data.py
--retriever_results output/dpr_retrieval/nq-dev, train. json
--gold_passages output/data/gold_passages_info/nq_dev, train.json
--do_lower_case
--pretrained_model_cfg bert-base-uncased
--encoder_model_type hf_bert
--out_file output/dpr_retrieval/nq-dev-dpr
#--is_train_set # specify this only when it is train data

@vlad-karpukhin
Copy link
Contributor

Hello,

  1. "... the checkpoint for the train only has 9 pkl and the checkpoint for the dev only has 1 pkl...." - there is a parameter that defines the number of concurrent threads to preprocess the data. Since the dev size is quite small I just used 1 process and thus there is only 1 pkl file.
    So it doesn't really matter how many files you have at the end, they are merged into one dataset when needed.

  2. "Also, I didn't preprocess the retrieve nq-test json file in the reader." - then the reader code will trigger preprocessing automatically and place the .pkl files near the json file. If you use our provided preprocessed file that means you are using retrieved result from our retriever checkpoint. It is fine as long as it suits your research story. I.e. if you change something in the reader model - is probably fine to use it, if you change the retriever model/checkpoint - then it is mostly likely not fair to use our results.

@szhang42
Copy link
Author

Hello,

Got it. Thanks very much for your responses!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants