Create Per-Turn Evaluation Folder in ParlAI #4323

Rebecca-Qian · 2022-01-26T01:00:12Z

Patch description
This PR creates the initial open-sourcing of the Per-Turn Evaluation project from the parlai-internal repo. The PR creates a new pairwise_per_turn_eval directory in parlai/crowdsourcing/tasks.

Refactors and further cleanup will come in a separate PR.

File structure

analysis/: Analysis code for compiling per-turn evaluation experiment results.
frontend/: All task UX components used to render the chat task, including onboarding and error panes.
hydra_configs/conf/: Path to hydra configs used to specify individual experiment run parameters. Initialized with an example example_model_comparison.yaml file. Some task specific parameters have been removed or stubbed out.
task_config/: Task data and configs, eg. onboarding data, JSON configs.
README.md: Shortened README giving an overview of the per-turn eval project.
bot_agent.py
impl.py
per_turn_eval_blueprint.py
run.py
utils.py
worlds.py

Testing steps
Ran HITs end-to-end in MTurk sandbox to verify correctness.

Onboarding:

Chat task:

Analysis script run:

(per_turn_eval_env_clone) ➜  ParlAI git:(rebeccaqian/refactor_pw_turn) ✗ CONF=example_model_comparison &&
python parlai/crowdsourcing/tasks/pairwise_per_turn_eval/analysis/compile_results.py \
--task-name per_turn_eval__engaging \
--output-folder /checkpoint/rebeccaqian/test/${CONF}
18:53:11 | Retrieving task data from Mephisto.
18:53:13 | Data for 2 units loaded successfully.
18:53:13 | 0 conversations found with no save data.
18:53:13 | 0 conversations found with the wrong status.
18:53:13 | 2 complete conversations found:
18:53:13 | 	0 unacceptable conversations.
18:53:13 | 	2 acceptable conversations.
18:53:13 | ---blender_3B:blender_90M---
18:53:13 | Turn 1, blender_3B: 1 (50.00%), blender_90M: 1 (50.00%)
18:53:13 | Turn 2, blender_3B: 2 (100.00%), blender_90M: 0 (0.00%)
18:53:13 | Turn 3, blender_3B: 2 (100.00%), blender_90M: 0 (0.00%)
18:53:13 | Turn 4, blender_3B: 1 (50.00%), blender_90M: 1 (50.00%)
18:53:13 | Turn 5, blender_3B: 1 (50.00%), blender_90M: 1 (50.00%)
18:53:13 | Turn 6, blender_3B: 1 (50.00%), blender_90M: 1 (50.00%)
18:53:13 | human_utterance_count: 12
18:53:13 | human_word_count: 74 (6.17)
18:53:13 | human_question_count: 1 (0.0833)
18:53:13 | total: 12 (100.00%)
18:53:13 | blender_3B: 8 (66.67%)
18:53:13 | blender_90M: 4 (33.33%)
18:53:13 | acceptable_convos: 2
18:53:13 | Printing worker IDs not already in block list to add...
18:53:13 | Done printing bad workers.
18:53:13 |
Worker conversation counts: {'4950': 2}
18:53:13 | Saving worker statistical results to /checkpoint/rebeccaqian/test/example_model_comparison/worker_results.csv.
18:53:13 | Saving MTurk IDs of workers with unacceptable conversations to /checkpoint/rebeccaqian/test/example_model_comparison/unacceptable_worker_ids.txt.
18:53:13 | Saving win rates cut by date to /checkpoint/rebeccaqian/test/example_model_comparison/win_rates_by_date.csv.
18:53:13 | Saving mean word count of different stats, cut by date, to /checkpoint/rebeccaqian/test/example_model_comparison/stat_mean_length_by_date.csv.
18:53:13 | Saving mean completion time stats to /checkpoint/rebeccaqian/test/example_model_comparison/mean_completion_times.csv.
18:53:13 | Wrote results file to /checkpoint/rebeccaqian/test/example_model_comparison/results.csv.

EricMichaelSmith

Yeah looks great, thanks for adding this @Rebecca-Qian ! So I know I have a ton of comments - most of these are little peculiarities of the code that you would have had no way of knowing. And, before merging, it'll be useful to test out that compile_results.py script on dummy data so we know that it works correctly =) And don't worry about the unittests_osx check failing - if you look at https://github.com/facebookresearch/ParlAI/commits/main you'll see that that test has been failing for a week now =/

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md

.../crowdsourcing/tasks/pairwise_per_turn_eval/hydra_configs/conf/example_model_comparison.yaml

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/task_config/model_opts.yaml

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/task_config/onboard_task_data.json

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/task_config/onboard_task_data__humanness.json

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/task_config/task_description.html

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/utils.py

EricMichaelSmith · 2022-01-27T22:56:34Z

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/README.md

@@ -0,0 +1,29 @@
+# Per-turn Evaluation Crowdsourcing Task


Oh ha, almost forgot - would be good to link to the paper itself at the top of this README :P I'll be doing that too with my SM-Turn README

Include a bibtex too please

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/analysis/compile_results.py

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/bot_agent.py

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/utils.py

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/analysis/compile_results.py

Rebecca-Qian added 4 commits January 25, 2022 15:15

Auto fixes

8204f9d

Remove worker blocklists

5927a72

Add __init__.py

2d4101f

Add __init__.py

f0f87e3

facebook-github-bot added the CLA Signed label Jan 26, 2022

Rebecca-Qian requested review from EricMichaelSmith, jaseweston, stephenroller and ylannb January 26, 2022 01:01

EricMichaelSmith reviewed Jan 27, 2022

View reviewed changes

EricMichaelSmith requested a review from JackUrb January 28, 2022 17:38

EricMichaelSmith reviewed Jan 28, 2022

View reviewed changes

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/analysis/compile_results.py Outdated Show resolved Hide resolved

EricMichaelSmith reviewed Jan 28, 2022

View reviewed changes

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/bot_agent.py Show resolved Hide resolved

EricMichaelSmith reviewed Jan 28, 2022

View reviewed changes

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/utils.py Outdated Show resolved Hide resolved

EricMichaelSmith reviewed Jan 28, 2022

View reviewed changes

parlai/crowdsourcing/tasks/pairwise_per_turn_eval/analysis/compile_results.py Outdated Show resolved Hide resolved

Rebecca-Qian added 7 commits January 28, 2022 13:24

Lint fix

53e144d

Rename task configs

7ceb80f

More lint error fixes

1bb4b1e

Update Per Turn Eval README with link to paper

5da57fb

Add configs to example

a3c7783

Remove commented out lines

6887f20

More README improvements

61c11c2

EricMichaelSmith self-requested a review February 1, 2022 18:13

EricMichaelSmith approved these changes Feb 1, 2022

View reviewed changes

Rebecca-Qian added 2 commits February 3, 2022 19:19

Add bibtex to README

beeaf5f

address comment

f19e52a

Rebecca-Qian merged commit 2d06290 into main Feb 4, 2022

Rebecca-Qian deleted the rebeccaqian/refactor_pw_turn branch February 4, 2022 03:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Per-Turn Evaluation Folder in ParlAI #4323

Create Per-Turn Evaluation Folder in ParlAI #4323

Rebecca-Qian commented Jan 26, 2022 •

edited

Loading

EricMichaelSmith left a comment

EricMichaelSmith Jan 27, 2022

stephenroller Jan 30, 2022

Create Per-Turn Evaluation Folder in ParlAI #4323

Create Per-Turn Evaluation Folder in ParlAI #4323

Conversation

Rebecca-Qian commented Jan 26, 2022 • edited Loading

Onboarding:

Chat task:

Analysis script run:

EricMichaelSmith left a comment

Choose a reason for hiding this comment

EricMichaelSmith Jan 27, 2022

Choose a reason for hiding this comment

stephenroller Jan 30, 2022

Choose a reason for hiding this comment

Rebecca-Qian commented Jan 26, 2022 •

edited

Loading