Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Create Per-Turn Evaluation Folder in ParlAI #4323

Merged
merged 13 commits into from
Feb 4, 2022

Conversation

Rebecca-Qian
Copy link
Contributor

@Rebecca-Qian Rebecca-Qian commented Jan 26, 2022

Patch description
This PR creates the initial open-sourcing of the Per-Turn Evaluation project from the parlai-internal repo. The PR creates a new pairwise_per_turn_eval directory in parlai/crowdsourcing/tasks.

Refactors and further cleanup will come in a separate PR.

File structure

  • analysis/: Analysis code for compiling per-turn evaluation experiment results.
  • frontend/: All task UX components used to render the chat task, including onboarding and error panes.
  • hydra_configs/conf/: Path to hydra configs used to specify individual experiment run parameters. Initialized with an example example_model_comparison.yaml file. Some task specific parameters have been removed or stubbed out.
  • task_config/: Task data and configs, eg. onboarding data, JSON configs.
  • README.md: Shortened README giving an overview of the per-turn eval project.
  • bot_agent.py
  • impl.py
  • per_turn_eval_blueprint.py
  • run.py
  • utils.py
  • worlds.py

Testing steps
Ran HITs end-to-end in MTurk sandbox to verify correctness.

Onboarding:

Screen Shot 2022-01-25 at 12 09 13 AM

Chat task:

Screen Shot 2022-01-25 at 3 47 39 PM

Analysis script run:

(per_turn_eval_env_clone) ➜  ParlAI git:(rebeccaqian/refactor_pw_turn) ✗ CONF=example_model_comparison &&
python parlai/crowdsourcing/tasks/pairwise_per_turn_eval/analysis/compile_results.py \
--task-name per_turn_eval__engaging \
--output-folder /checkpoint/rebeccaqian/test/${CONF}
18:53:11 | Retrieving task data from Mephisto.
18:53:13 | Data for 2 units loaded successfully.
18:53:13 | 0 conversations found with no save data.
18:53:13 | 0 conversations found with the wrong status.
18:53:13 | 2 complete conversations found:
18:53:13 | 	0 unacceptable conversations.
18:53:13 | 	2 acceptable conversations.
18:53:13 | ---blender_3B:blender_90M---
18:53:13 | Turn 1, blender_3B: 1 (50.00%), blender_90M: 1 (50.00%)
18:53:13 | Turn 2, blender_3B: 2 (100.00%), blender_90M: 0 (0.00%)
18:53:13 | Turn 3, blender_3B: 2 (100.00%), blender_90M: 0 (0.00%)
18:53:13 | Turn 4, blender_3B: 1 (50.00%), blender_90M: 1 (50.00%)
18:53:13 | Turn 5, blender_3B: 1 (50.00%), blender_90M: 1 (50.00%)
18:53:13 | Turn 6, blender_3B: 1 (50.00%), blender_90M: 1 (50.00%)
18:53:13 | human_utterance_count: 12
18:53:13 | human_word_count: 74 (6.17)
18:53:13 | human_question_count: 1 (0.0833)
18:53:13 | total: 12 (100.00%)
18:53:13 | blender_3B: 8 (66.67%)
18:53:13 | blender_90M: 4 (33.33%)
18:53:13 | acceptable_convos: 2
18:53:13 | Printing worker IDs not already in block list to add...
18:53:13 | Done printing bad workers.
18:53:13 |
Worker conversation counts: {'4950': 2}
18:53:13 | Saving worker statistical results to /checkpoint/rebeccaqian/test/example_model_comparison/worker_results.csv.
18:53:13 | Saving MTurk IDs of workers with unacceptable conversations to /checkpoint/rebeccaqian/test/example_model_comparison/unacceptable_worker_ids.txt.
18:53:13 | Saving win rates cut by date to /checkpoint/rebeccaqian/test/example_model_comparison/win_rates_by_date.csv.
18:53:13 | Saving mean word count of different stats, cut by date, to /checkpoint/rebeccaqian/test/example_model_comparison/stat_mean_length_by_date.csv.
18:53:13 | Saving mean completion time stats to /checkpoint/rebeccaqian/test/example_model_comparison/mean_completion_times.csv.
18:53:13 | Wrote results file to /checkpoint/rebeccaqian/test/example_model_comparison/results.csv.

Copy link
Contributor

@EricMichaelSmith EricMichaelSmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks great, thanks for adding this @Rebecca-Qian ! So I know I have a ton of comments - most of these are little peculiarities of the code that you would have had no way of knowing. And, before merging, it'll be useful to test out that compile_results.py script on dummy data so we know that it works correctly =) And don't worry about the unittests_osx check failing - if you look at https://github.com/facebookresearch/ParlAI/commits/main you'll see that that test has been failing for a week now =/

@@ -0,0 +1,29 @@
# Per-turn Evaluation Crowdsourcing Task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ha, almost forgot - would be good to link to the paper itself at the top of this README :P I'll be doing that too with my SM-Turn README

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include a bibtex too please

@EricMichaelSmith EricMichaelSmith self-requested a review February 1, 2022 18:13
@Rebecca-Qian Rebecca-Qian merged commit 2d06290 into main Feb 4, 2022
@Rebecca-Qian Rebecca-Qian deleted the rebeccaqian/refactor_pw_turn branch February 4, 2022 03:23
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants