-
Notifications
You must be signed in to change notification settings - Fork 2.1k
FiD gold retrieved docs agent n-docs debug #4146
Conversation
0cd22ec
to
b1f9102
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good for the WizIntGoldDocRetriever
, but this should probably be respected at the top level as well --> i.e., in _set_query_vec
, we should cut the retrieved documents to be no more than opt['n_docs']
parlai/agents/fid/fid.py
Outdated
continue | ||
|
||
retrieved_docs.append(self._extract_doc_from_message(message, doc_idx)) | ||
if len(retrieved_docs) == self._n_docs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not checked before this for-loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we gain much by adding it before the for loop. It also makes the code a bit more succinct by having the loop that breaks right after start. But I can move it to the top of the loop to avoid extra check on already_added_doc_idx
.
@@ -352,6 +353,14 @@ def get_retrieved_knowledge(self, message): | |||
|
|||
def _set_query_vec(self, observation: Message) -> Message: | |||
retrieved_docs = self.get_retrieved_knowledge(observation) | |||
if len(retrieved_docs) > self._n_docs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klshuster picking the right documents, if there are more than n_docs
, depends a lot on the data set. I add this warning here and trim it down to the first n_docs
. I hope this properly addresses your comment. LMK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me. given how interactive this retriever is (i.e., interactive with the user --> the user has total control over what documents are being passed), it makes sense that the onus is on them to either 1) provide the right number of documents, or 2) set --n-docs
correctly
@@ -352,6 +353,14 @@ def get_retrieved_knowledge(self, message): | |||
|
|||
def _set_query_vec(self, observation: Message) -> Message: | |||
retrieved_docs = self.get_retrieved_knowledge(observation) | |||
if len(retrieved_docs) > self._n_docs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me. given how interactive this retriever is (i.e., interactive with the user --> the user has total control over what documents are being passed), it makes sense that the onus is on them to either 1) provide the right number of documents, or 2) set --n-docs
correctly
parlai/agents/fid/fid.py
Outdated
f'Your `get_retrieved_knowledge` method returned {len(retrieved_docs)} Documents, ' | ||
f'instead of the expected {self._n_docs}. ' | ||
f'This agent will only use the first {self._n_docs} Documents. ' | ||
'Consider modifying your implementation of `get_retrieved_knowledge` to avoid unexpected results.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also say 'or set the --n-docs parameter accordingly'
to inform the user that they can do that too?
Patch description
Working with
WizIntGoldDocRetrieverFiDAgent
, I noticed it OOMing more often that it should. Looking into it, I realized that the agent was including all the retrieved docs. WoI doc chunk mutators split a document into 100 chunks or more andObservationEchoRetriever
returns these all docs---ignoring then_docs
parameter.This patch fixed this issue by picking
n_docs
number of docs from the list of docs and then adding some filler docs.NOTE: this problem could be fixed in the upstream by setting
--woi-doc-max-chunks
fromwoi_dropout_retrieved_docs
as well. But this makes sure that it is safe to run it with other mutators, teachers as well.Testing it
Ran 100 steps of training with a debug level message that shows the number of docs before and after trimming.