[RAG] Generate Dense Embeddings Fix #3869

klshuster · 2021-07-27T21:24:37Z

Patch description
#3868 pointed out a couple of current shortcomings in the custom embeddings generation process for RAG/FiD DPR Models.

If specifying a ParlAI RAG or FiD model, one needs to set --dpr-model True; this has been clarified in the README.
If specifying specifically a FiD-RAG model, we need to make sure that the document encoder is initialized appropriately (this was not correctly handled prior)

Testing steps
I tried running the provided command in #3868 with --dpr-model True, after the change, and it appears the script works correctly:

$ python parlai/agents/rag/scripts/generate_dense_embeddings.py -mf zoo:hallucination/bart_fid_rag/model --dpr-model true --passages-file data/models/hallucination/wow_passages/wow_articles.paragraphs.tsv --outfile /tmp/out --num-shards 1 --shard-id 0 -bs 32
17:12:05 | Overriding opt["model"] to dpr_agent (previously: fid)
17:12:05 | Overriding opt["interactive_candidates"] to inline (previously: fixed)
17:12:05 | Overriding opt["share_encoders"] to False (previously: True)
17:12:06 | Using CUDA
17:12:06 | DPR: full interactive mode on.
17:12:42 | Total parameters: 241,228,800 (240,442,368 trainable)
17:12:42 | Loading existing model parameters from /private/home/kshuster/ParlAI/data/models/hallucination/bart_fid_rag/model
17:12:49 | Loading data/models/hallucination/wow_passages/wow_articles.paragraphs.tsv
17:12:49 | Reading data from: data/models/hallucination/wow_passages/wow_articles.paragraphs.tsv
2863it [00:00, 105418.20it/s]
17:12:49 | Shard 0 of 1 encoding psg index 0 to 2862, out of 2862
17:12:51 | Encoded 320 out of 2862 passages
17:12:51 | Encoded 640 out of 2862 passages
17:12:52 | Encoded 960 out of 2862 passages
17:12:52 | Encoded 1280 out of 2862 passages
17:12:53 | Encoded 1600 out of 2862 passages
17:12:53 | Encoded 1920 out of 2862 passages
17:12:54 | Encoded 2240 out of 2862 passages
17:12:54 | Encoded 2560 out of 2862 passages
17:12:55 | Writing results to /tmp/out_0.pt
17:12:55 | Writing ids to /tmp/ids_0

mojtaba-komeili · 2021-07-27T21:36:02Z

parlai/agents/rag/dpr.py

@@ -203,13 +203,19 @@ def _get_build_options(cls, opt: Opt):
        try:
            # determine if loading a RAG model
            loaded_opt = Opt.load(f"{query_path}.opt")
-            if loaded_opt['model'] == 'rag' and loaded_opt['query_model'] in [
+            document_path = loaded_opt.get('dpr_model_file', document_path)


Does this slow things down if you have a fid model and you load document_path later again? In that case does it make sense to move this to the else condition of the following if?

im not entirely sure what you're asking here --> we're only getting the document path from the .opt file. We should never actually build a FiD model in this scenario

mojtaba-komeili

LGTM.

gen dense embs fix

b725cee

klshuster requested review from spencerp and mojtaba-komeili July 27, 2021 21:24

facebook-github-bot added the CLA Signed label Jul 27, 2021

klshuster mentioned this pull request Jul 27, 2021

generate_dense_embeddings.py doesn't load custom TSV? #3868

Closed

mojtaba-komeili reviewed Jul 27, 2021

View reviewed changes

mojtaba-komeili approved these changes Jul 27, 2021

View reviewed changes

klshuster merged commit ce02a0e into master Jul 27, 2021

klshuster deleted the rag_gen_dense_embs_Fix branch July 27, 2021 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RAG] Generate Dense Embeddings Fix #3869

[RAG] Generate Dense Embeddings Fix #3869

klshuster commented Jul 27, 2021

mojtaba-komeili Jul 27, 2021

klshuster Jul 27, 2021

mojtaba-komeili left a comment

[RAG] Generate Dense Embeddings Fix #3869

[RAG] Generate Dense Embeddings Fix #3869

Conversation

klshuster commented Jul 27, 2021

mojtaba-komeili Jul 27, 2021

Choose a reason for hiding this comment

klshuster Jul 27, 2021

Choose a reason for hiding this comment

mojtaba-komeili left a comment

Choose a reason for hiding this comment