[Wizard of Internet] Gold retrieved documents FiD Agent. #4123

mojtaba-komeili · 2021-10-28T22:40:20Z

Patch description
GoldDocRetrieverFiDAgent: It uses a custom retriever that instead of retrieving from a corpus, fetches documents included in the example itself. For the Wizard of Internet dataset this is the documents that were shown to the crowdsources at the time of the data collection (Hence, gold retrieved docs).

mojtaba-komeili · 2021-10-28T22:41:22Z

Haven't finished the document chunk splitter yet. Will add that soon as well.

parlai/agents/rag/retrievers.py

klshuster · 2021-10-29T13:35:01Z

parlai/agents/rag/retrievers.py

+        return self._delimiter
+
+    def pick_chunk(self, doc_text: str, selected_sentences: List[str]):
+        # TODO replace with a better doc chunk selector.


we have an arg for doc chunk length, right? could probably use this to start

This method is still a place-holder until the rest of the code works. Trying to resolve another issue with the code.

this is the code for using the data from a teacher? yes, we won't need this if so because we can use the mutator i just wrote?

actually, yeah, let's have no model-side logic for this, don't want any unintended side effects

klshuster · 2021-10-29T13:48:29Z

parlai/agents/fid/fid.py

+        super().__init__(opt, shared=shared)
+
+    def observe(self, observation: Union[Dict, Message]) -> Message:
+        self.model.retriever.set_retrieve_doc(


I'd check if 'text_vec' not in Batch before setting, so we can skip padding exapmles

Yeah, I should check for padding too.

ditto for model_api

klshuster

I'm assuming we'll want this retriever to handle batching, yeah?

I think the correct way to do this would be to handle within batchify, rather than observe; at that point, we have access to all observations, so we can set the retrieved documents for each observation (you can check BlenderBot2RagAgent.batchify to see how this is handled)

parlai/agents/fid/fid.py

klshuster · 2021-10-29T17:22:24Z

parlai/agents/fid/fid.py

+            return observation
+
+        if not observation.get(consts.RETRIEVED_DOCS):
+            self.model.retriever.set_retrieve_doc(


can we do self.model_api.retriever here? that's a protection against dataparallel

I move the model_api to RagAgent. That should cover it for BB2 as well, so I remove it from there. Let's see how many things are gonna break now 😃

klshuster · 2021-10-29T17:22:39Z

parlai/agents/fid/fid.py

+        super().__init__(opt, shared=shared)
+
+    def observe(self, observation: Union[Dict, Message]) -> Message:
+        self.model.retriever.set_retrieve_doc(


ditto for model_api

klshuster · 2021-10-29T17:23:12Z

parlai/agents/fid/fid.py

+                retrieved_docs=None, selected_docs=None, selected_sentences=None
+            )
+        else:
+            rertrieved_docs = []


nit: "retrieved*"

parlai/agents/fid/fid.py

klshuster · 2021-10-29T17:27:00Z

parlai/agents/rag/retrievers.py

+
+    def retrieve_and_score(
+        self, query: torch.LongTensor
+    ) -> Tuple[List[List[Document]], torch.Tensor]:


hmm so this still doesn't handle batching?

klshuster

thanks, batching looks great, although I had one minor concern related to the uniqueness of the queries...

have you encountered any issues with the same queries mapping to 2 sets of documents? if not, perhaps we can just leave as is and make a warning or something?

parlai/agents/rag/rag.py

parlai/agents/rag/retrievers.py

klshuster · 2021-11-01T18:02:37Z

parlai/agents/rag/retrievers.py

+
+    def add_retrieve_doc(self, query: str, retrieved_docs: List[Document]):
+        new_idx = len(self._querie_ids)
+        self._querie_ids[query] = new_idx


what if query is already in querie_ids?

we might want to make this more robust --> maybe hash the observation? or use the query<sep>label?

i.e., lots of conversation starters can be similar for some datasets

I see what you mean. I believe we I checked this for WizInt and we didn't have collisions.
My reason for ignoring it was that it makes sense for them to have the same knowledge if the conversation is the same. Given that retrieve_and_score only sees the query (not the label) it gets slightly more complicated to include the label for the query encoding of this agent.

yeah, i suppose we're operating under the (strong) assumption that the query is going to be the full dialogue context, and not the last turn of dialogue

for example, one could imagine a mid-conversation utterance, "that's so cool!", to elicit different documents depending on the preceding dialogue context

parlai/agents/rag/retrievers.py

mojtaba-komeili · 2021-11-01T22:20:44Z

One more major change: the GoldDocRetrieverFiDAgent was too specific to wizard of internet. Created a new abstract class with that name for general use later. The new WizIntGoldDocRetrieverFiDAgent is replacing it now.

klshuster · 2021-11-01T22:51:05Z

parlai/agents/fid/fid.py

+        if observation.get(consts.RETRIEVED_DOCS):
+            for doc_id, doc_title, doc_txt in zip(
+                observation[consts.RETRIEVED_DOCS_URLS],
+                observation[consts.RETRIEVED_DOCS_TITLES],
+                observation[consts.RETRIEVED_DOCS],
+            ):
+                retrieved_docs.append(
+                    Document(docid=doc_id, title=doc_title, text=doc_txt)
+                )


this is redundant now, right?

mojtaba-komeili · 2021-11-02T14:00:30Z

parlai/agents/fid/fid.py

+    def __init__(self, opt: Opt, shared: TShared = None):
+        opt = deepcopy(opt)
+        opt['rag_retriever_type'] = RetrieverType.OBSERVATION_ECHO_RETRIEVER.value
+        if opt['rag_retriever_query'] != 'full_history':


@klshuster regarding our previous conversation on possibility of collision between context for document retrieval, I see this as an easy solution to avoid that. I believe forcing the use of full history for this type of agent makes sense, given that the whole retrieval process is recalling some past conversation. Let me know what you think.

yep, makes sense to me

reformat

ac565eb

mojtaba-komeili requested review from klshuster and jaseweston October 28, 2021 22:40

facebook-github-bot added the CLA Signed label Oct 28, 2021

mojtaba-komeili commented Oct 28, 2021

View reviewed changes

parlai/agents/rag/retrievers.py Outdated Show resolved Hide resolved

bs bug

10074a1

mojtaba-komeili changed the title ~~reformat~~ [Wizard of Internet] Gold retrieved documents FiD Agent. Oct 29, 2021

klshuster reviewed Oct 29, 2021

View reviewed changes

mojtaba-komeili added 2 commits October 29, 2021 06:53

removed chunk selection logic from the mutator.

bb08886

padding check

3532c37

mojtaba-komeili marked this pull request as ready for review October 29, 2021 15:05

added titles

a337c3c

jaseweston approved these changes Oct 29, 2021

View reviewed changes

klshuster suggested changes Oct 29, 2021

View reviewed changes

mojtaba-komeili added 3 commits October 29, 2021 10:53

PR comments

1b434b9

batchifing the fid gold model

ee11c6c

teacher test updates

425517c

klshuster reviewed Nov 1, 2021

View reviewed changes

pr comment, minor

d778e83

klshuster reviewed Nov 1, 2021

View reviewed changes

mojtaba-komeili added 2 commits November 1, 2021 18:42

removed redundant code

ad6ffae

forcing rag_retriever_query to be full_history

cfd0f5c

mojtaba-komeili commented Nov 2, 2021

View reviewed changes

klshuster approved these changes Nov 2, 2021

View reviewed changes

mojtaba-komeili merged commit 06ac02d into main Nov 2, 2021

mojtaba-komeili deleted the retrieve_gold branch November 2, 2021 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Wizard of Internet] Gold retrieved documents FiD Agent. #4123

[Wizard of Internet] Gold retrieved documents FiD Agent. #4123

mojtaba-komeili commented Oct 28, 2021

mojtaba-komeili commented Oct 28, 2021

klshuster Oct 29, 2021

mojtaba-komeili Oct 29, 2021

jaseweston Oct 29, 2021

klshuster Oct 29, 2021

klshuster Oct 29, 2021

mojtaba-komeili Oct 29, 2021

klshuster Oct 29, 2021

klshuster left a comment

klshuster Oct 29, 2021

mojtaba-komeili Oct 29, 2021 •

edited

Loading

klshuster Oct 29, 2021

klshuster Oct 29, 2021

klshuster Oct 29, 2021

klshuster left a comment

klshuster Nov 1, 2021

klshuster Nov 1, 2021

mojtaba-komeili Nov 1, 2021

klshuster Nov 1, 2021

mojtaba-komeili commented Nov 1, 2021

klshuster Nov 1, 2021

mojtaba-komeili Nov 2, 2021

klshuster Nov 2, 2021

[Wizard of Internet] Gold retrieved documents FiD Agent. #4123

[Wizard of Internet] Gold retrieved documents FiD Agent. #4123

Conversation

mojtaba-komeili commented Oct 28, 2021

mojtaba-komeili commented Oct 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klshuster left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mojtaba-komeili Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klshuster left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mojtaba-komeili commented Nov 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mojtaba-komeili Oct 29, 2021 •

edited

Loading