-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Open-sourcing SM-Turn and SM-Dialog from the human eval paper #4333
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mephisto stuff looks good to me! Cool to see the ModelChatBlueprint
is successfully working in a re-use/extension pattern.
import parlai.utils.logging as logging | ||
from parlai.crowdsourcing.tasks.model_chat.model_chat_blueprint import ( | ||
BLUEPRINT_TYPE, | ||
) # noqa: F401 # For registering the blueprint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annoyingly I think you need to put this comment on line 16 for the lint to successfully ignore this import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good point, just changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Besides unit tests, should be just a few minor improvements left in next steps.
worker_id = task_unit['worker_id'] | ||
assignment_id = task_unit['assignment_id'] | ||
|
||
# # Determining whether the task unit should be skipped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's stick to one # for comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point - so, my convention has often been to use two hash symbols for a new section of code: i.e., this comment is indicating that the following 100-ish lines deal with checking whether this task should be skipped. But if this isn't clear to others, then maybe something more obvious should be used. Is there a particular notation that you use for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, often if I find a chunk of code is complex enough to warrant something like that, it's complex enough to move into a nicely-named helper function.
if should_skip_unit(...):
continue
I understand that in this scripting setup, this chunk also has some external effects (changing convos counts, doing some data extraction, etc). But in that case, those are critical to its function, and thus the whole code block is doing more than "determining whether the task unit should be skipped". Overall feels a little like an anti-pattern to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - @JackUrb definitely agreed that the original phrasing of that comment was incomplete. I've just removed that comment entirely to avoid confusion. Thanks!
conversation_start_mode: 'hi' | ||
annotation_question: "Please answer the following:" | ||
conversations_needed_string: "blender_90M:10" | ||
final_rating_question: "Please rate how much you'd prefer to talk to your partner for a long conversation. (1: Would not at all prefer, 5: Would very much prefer)|Please rate how human your partner sounds. (1: Very inhuman, 5: Very human)|Please rate how interesting your partner is. (1: Very boring, 5: Very interesting)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we wrap this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good call - fixed this and then re-tested to make sure the questions still render properly
Patch description
Open-sourcing the single-model technique from the human eval comparison paper: this technique runs the evaluations for SM-Turn and SM-Dialog in that work. A sample YAML file, the model opt file and onboarding examples used in the paper, and an analysis script are also released.
We also modify the existing model chat task (in
parlai/crowdsourcing/tasks/model_chat/worlds.py
) to fix an error when loading in HITs using Mephisto'sDataBrowser
.Testing steps
Commands
To launch local HITs:
To launch analysis of HITs:
Unit tests forthcoming shortly (prioritizing initial code release).
Screenshots
Onboarding:
Conversation flow: