Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[RAG] Generate Dense Embeddings Fix #3869

Merged
merged 1 commit into from
Jul 27, 2021
Merged

Conversation

klshuster
Copy link
Contributor

Patch description
#3868 pointed out a couple of current shortcomings in the custom embeddings generation process for RAG/FiD DPR Models.

  1. If specifying a ParlAI RAG or FiD model, one needs to set --dpr-model True; this has been clarified in the README.
  2. If specifying specifically a FiD-RAG model, we need to make sure that the document encoder is initialized appropriately (this was not correctly handled prior)

Testing steps
I tried running the provided command in #3868 with --dpr-model True, after the change, and it appears the script works correctly:

$ python parlai/agents/rag/scripts/generate_dense_embeddings.py -mf zoo:hallucination/bart_fid_rag/model --dpr-model true --passages-file data/models/hallucination/wow_passages/wow_articles.paragraphs.tsv --outfile /tmp/out --num-shards 1 --shard-id 0 -bs 32
17:12:05 | Overriding opt["model"] to dpr_agent (previously: fid)
17:12:05 | Overriding opt["interactive_candidates"] to inline (previously: fixed)
17:12:05 | Overriding opt["share_encoders"] to False (previously: True)
17:12:06 | Using CUDA
17:12:06 | DPR: full interactive mode on.
17:12:42 | Total parameters: 241,228,800 (240,442,368 trainable)
17:12:42 | Loading existing model parameters from /private/home/kshuster/ParlAI/data/models/hallucination/bart_fid_rag/model
17:12:49 | Loading data/models/hallucination/wow_passages/wow_articles.paragraphs.tsv
17:12:49 | Reading data from: data/models/hallucination/wow_passages/wow_articles.paragraphs.tsv
2863it [00:00, 105418.20it/s]
17:12:49 | Shard 0 of 1 encoding psg index 0 to 2862, out of 2862
17:12:51 | Encoded 320 out of 2862 passages
17:12:51 | Encoded 640 out of 2862 passages
17:12:52 | Encoded 960 out of 2862 passages
17:12:52 | Encoded 1280 out of 2862 passages
17:12:53 | Encoded 1600 out of 2862 passages
17:12:53 | Encoded 1920 out of 2862 passages
17:12:54 | Encoded 2240 out of 2862 passages
17:12:54 | Encoded 2560 out of 2862 passages
17:12:55 | Writing results to /tmp/out_0.pt
17:12:55 | Writing ids to /tmp/ids_0

@@ -203,13 +203,19 @@ def _get_build_options(cls, opt: Opt):
try:
# determine if loading a RAG model
loaded_opt = Opt.load(f"{query_path}.opt")
if loaded_opt['model'] == 'rag' and loaded_opt['query_model'] in [
document_path = loaded_opt.get('dpr_model_file', document_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this slow things down if you have a fid model and you load document_path later again? In that case does it make sense to move this to the else condition of the following if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not entirely sure what you're asking here --> we're only getting the document path from the .opt file. We should never actually build a FiD model in this scenario

Copy link
Contributor

@mojtaba-komeili mojtaba-komeili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@klshuster klshuster merged commit ce02a0e into master Jul 27, 2021
@klshuster klshuster deleted the rag_gen_dense_embs_Fix branch July 27, 2021 22:16
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants