More flexible token metadata logging #4427

c-flaherty · 2022-03-16T21:15:24Z

Patch description

This is a followup PR to #4169. In that PR, I added support to log token probabilities and token ranks for outputs of ParlAI models. However, after using it, it became clear that we would like to log additional token-level metadata, such as top 10 tokens and top ranked token (relevant for sampling-based decoding methods).

Rather, than add these features directly, I am instead making the token-level metadata object more flexible. In this PR, each token has associated with it a typed dictionary _PathSelectionTokenDetails that contains the token_score and token_rank of the relevant token. No code outside the TreeSearch:select_paths method implementations and this typed dictionaries' definition make any reference to the specific fields in this dictionary. This makes it easy to override this dictionary's definition and a TreeSearch:select_paths implementation to add more verbose metadata. Since different research use-cases may want to generate different data token-level metadata, this approach will be more future-proof.

Additionally, I make a small change to how token probabilities are logged in nucleus sampling. Instead of logging token probs from the truncated (nucleus) distribution, we will now log token probas from the non-truncated distribution.

Finally, I also return normalized probabilities instead of logprobs for token-level probabilities.

Testing steps
pytest tests/test_tga.py

parlai dm --model-file zoo:unittest/transformer_generator2/model --truncate 1024 -v --task integration_tests:multiturn_nocandidate -ne 1 --inference beam --beam-size 3

…ead of logprobs

emilydinan

nice @c-flaherty! this will make this code much more flexible!!!

high level comment -- can we call token_scores token_probs throughout?

also flagging for @stephenroller to take a look before merging

parlai/core/torch_generator_agent.py

stephenroller

Lgtm

parlai/core/torch_generator_agent.py

tests/test_tga.py

passing test

ddd4b2c

facebook-github-bot added the CLA Signed label Mar 16, 2022

dont truncate nucleus sampling distr and output normalized probs inst…

58fd67c

…ead of logprobs

c-flaherty requested a review from emilydinan March 16, 2022 22:22

emilydinan requested a review from stephenroller March 17, 2022 13:01

emilydinan reviewed Mar 17, 2022

View reviewed changes

parlai/core/torch_generator_agent.py Show resolved Hide resolved

parlai/core/torch_generator_agent.py Outdated Show resolved Hide resolved

parlai/core/torch_generator_agent.py Outdated Show resolved Hide resolved

parlai/core/torch_generator_agent.py Show resolved Hide resolved

address emily feedback

3b5675a

c-flaherty requested a review from emilydinan March 17, 2022 15:35

stephenroller approved these changes Mar 20, 2022

View reviewed changes

parlai/core/torch_generator_agent.py Show resolved Hide resolved

parlai/core/torch_generator_agent.py Show resolved Hide resolved

parlai/core/torch_generator_agent.py Show resolved Hide resolved

parlai/core/torch_generator_agent.py Outdated Show resolved Hide resolved

stephenroller approved these changes Mar 23, 2022

View reviewed changes

tests/test_tga.py Outdated Show resolved Hide resolved

addressed emily and stephen pr feedback

685eb23

c-flaherty merged commit 5214f42 into facebookresearch:main Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More flexible token metadata logging #4427

More flexible token metadata logging #4427

c-flaherty commented Mar 16, 2022 •

edited

Loading

emilydinan left a comment

stephenroller left a comment

More flexible token metadata logging #4427

More flexible token metadata logging #4427

Conversation

c-flaherty commented Mar 16, 2022 • edited Loading

emilydinan left a comment

Choose a reason for hiding this comment

stephenroller left a comment

Choose a reason for hiding this comment

c-flaherty commented Mar 16, 2022 •

edited

Loading