Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Open-sourcing SM-Turn and SM-Dialog from the human eval paper #4333

Merged
merged 17 commits into from
Feb 2, 2022

Conversation

EricMichaelSmith
Copy link
Contributor

Patch description

Open-sourcing the single-model technique from the human eval comparison paper: this technique runs the evaluations for SM-Turn and SM-Dialog in that work. A sample YAML file, the model opt file and onboarding examples used in the paper, and an analysis script are also released.

We also modify the existing model chat task (in parlai/crowdsourcing/tasks/model_chat/worlds.py) to fix an error when loading in HITs using Mephisto's DataBrowser.

Testing steps

Commands

To launch local HITs:

python parlai/crowdsourcing/projects/humaneval/single_model_eval/run.py

To launch analysis of HITs:

python parlai/crowdsourcing/projects/humaneval/single_model_eval/analysis/compile_results.py \
--task-name single_model_eval \
--output-folder ${ANALYSIS_SAVE_FOLDER}

Unit tests forthcoming shortly (prioritizing initial code release).

Screenshots

Onboarding:
Screen Shot 2022-01-20 at 5 00 29 PM

Conversation flow:
Screen Shot 2022-01-20 at 5 01 50 PM

Screen Shot 2022-01-20 at 5 03 47 PM

Copy link
Contributor

@JackUrb JackUrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mephisto stuff looks good to me! Cool to see the ModelChatBlueprint is successfully working in a re-use/extension pattern.

import parlai.utils.logging as logging
from parlai.crowdsourcing.tasks.model_chat.model_chat_blueprint import (
BLUEPRINT_TYPE,
) # noqa: F401 # For registering the blueprint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoyingly I think you need to put this comment on line 16 for the lint to successfully ignore this import.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good point, just changed

Copy link
Contributor

@Rebecca-Qian Rebecca-Qian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Besides unit tests, should be just a few minor improvements left in next steps.

worker_id = task_unit['worker_id']
assignment_id = task_unit['assignment_id']

# # Determining whether the task unit should be skipped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's stick to one # for comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point - so, my convention has often been to use two hash symbols for a new section of code: i.e., this comment is indicating that the following 100-ish lines deal with checking whether this task should be skipped. But if this isn't clear to others, then maybe something more obvious should be used. Is there a particular notation that you use for this?

Copy link
Contributor

@JackUrb JackUrb Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, often if I find a chunk of code is complex enough to warrant something like that, it's complex enough to move into a nicely-named helper function.

if should_skip_unit(...):
  continue

I understand that in this scripting setup, this chunk also has some external effects (changing convos counts, doing some data extraction, etc). But in that case, those are critical to its function, and thus the whole code block is doing more than "determining whether the task unit should be skipped". Overall feels a little like an anti-pattern to me.

Copy link
Contributor Author

@EricMichaelSmith EricMichaelSmith Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - @JackUrb definitely agreed that the original phrasing of that comment was incomplete. I've just removed that comment entirely to avoid confusion. Thanks!

conversation_start_mode: 'hi'
annotation_question: "Please answer the following:"
conversations_needed_string: "blender_90M:10"
final_rating_question: "Please rate how much you'd prefer to talk to your partner for a long conversation. (1: Would not at all prefer, 5: Would very much prefer)|Please rate how human your partner sounds. (1: Very inhuman, 5: Very human)|Please rate how interesting your partner is. (1: Very boring, 5: Very interesting)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call - fixed this and then re-tested to make sure the questions still render properly

@EricMichaelSmith EricMichaelSmith merged commit 3d15cea into main Feb 2, 2022
@EricMichaelSmith EricMichaelSmith deleted the smturn branch February 2, 2022 21:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants