ROSCOE suite of metrics #4839

Golovneva · 2022-10-25T15:14:31Z

Patch description

This is a set of scripts and data to support paper "ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning".
New repository "projects/roscoe" contains the code and reproduction steps for our paper.
New tasks support data loading, as well as synthetic perturbations described in paper.
Fixed protobuf dependency issue that was failing CircleCI builds

Testing steps

Unit testing:

olggol@learnfair0614:~/ParlAI$ pytest tests/nightly/cpu/test_roscoe.py
=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.8.13, pytest-5.3.2, py-1.11.0, pluggy-0.13.1
rootdir: /private/home/olggol/ParlAI, inifile: pytest.ini
plugins: anyio-3.6.1, requests-mock-1.7.0, datadir-1.3.1, hydra-core-1.2.0, regressions-2.1.1
collected 6 items

tests/nightly/cpu/test_roscoe.py ......                                                                                                                                                                                             [100%]

======================================================================================================== slowest 10 test durations ========================================================================================================
0.10s setup    tests/nightly/cpu/test_roscoe.py::TestEvaluator::test_compute_ppl_scores

(0.00 durations hidden.  Use -vv to show these durations.)
====================================================================================================== 6 passed, 3 warnings in 2.13s ======================================================================================================

olggol@learnfair0614:~/ParlAI$ pytest tests/tasks/reasoning_teacher/test_abstract_reasoning_teacher.py
=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.8.13, pytest-5.3.2, py-1.11.0, pluggy-0.13.1
rootdir: /private/home/olggol/ParlAI, inifile: pytest.ini
plugins: anyio-3.6.1, requests-mock-1.7.0, datadir-1.3.1, hydra-core-1.2.0, regressions-2.1.1
collected 1 item

tests/tasks/reasoning_teacher/test_abstract_reasoning_teacher.py .                                                                                                                                                                  [100%]

======================================================================================================== slowest 10 test durations ========================================================================================================
0.04s call     tests/tasks/reasoning_teacher/test_abstract_reasoning_teacher.py::TestAbstractReasoningTeacher::test_cases

(0.00 durations hidden.  Use -vv to show these durations.)
====================================================================================================== 1 passed, 2 warnings in 3.47s ======================================================================================================

olggol@learnfair0614:~/ParlAI$ pytest tests/tasks/reasoning_teacher/test_math_dataset_step_by_step_teacher.py
=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.8.13, pytest-5.3.2, py-1.11.0, pluggy-0.13.1
rootdir: /private/home/olggol/ParlAI, inifile: pytest.ini
plugins: anyio-3.6.1, requests-mock-1.7.0, datadir-1.3.1, hydra-core-1.2.0, regressions-2.1.1
collected 1 item

tests/tasks/reasoning_teacher/test_math_dataset_step_by_step_teacher.py .                                                                                                                                                           [100%]

======================================================================================================== slowest 10 test durations ========================================================================================================
10.40s call     tests/tasks/reasoning_teacher/test_math_dataset_step_by_step_teacher.py::TestMathDatasetStepByStepReasoningTeacher::test_get_boxed_answer

(0.00 durations hidden.  Use -vv to show these durations.)
===================================================================================================== 1 passed, 4 warnings in 18.86s ======================================================================================================
olggol@learnfair0614:~/ParlAI$ pytest tests/tasks/reasoning_teacher/test_question_answer.py
=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.8.13, pytest-5.3.2, py-1.11.0, pluggy-0.13.1
rootdir: /private/home/olggol/ParlAI, inifile: pytest.ini
plugins: anyio-3.6.1, requests-mock-1.7.0, datadir-1.3.1, hydra-core-1.2.0, regressions-2.1.1
collected 1 item

tests/tasks/reasoning_teacher/test_question_answer.py .                                                                                                                                                                             [100%]

======================================================================================================== slowest 10 test durations ========================================================================================================

(0.00 durations hidden.  Use -vv to show these durations.)
====================================================================================================== 1 passed, 2 warnings in 1.45s ======================================================================================================
olggol@learnfair0614:~/ParlAI$ pytest tests/tasks/reasoning_teacher/test_step_by_step.py
=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.8.13, pytest-5.3.2, py-1.11.0, pluggy-0.13.1
rootdir: /private/home/olggol/ParlAI, inifile: pytest.ini
plugins: anyio-3.6.1, requests-mock-1.7.0, datadir-1.3.1, hydra-core-1.2.0, regressions-2.1.1
collected 1 item

tests/tasks/reasoning_teacher/test_step_by_step.py .                                                                                                                                                                                [100%]

======================================================================================================== slowest 10 test durations ========================================================================================================
0.89s call     tests/tasks/reasoning_teacher/test_step_by_step.py::TestStepPertubations::test_cases

(0.00 durations hidden.  Use -vv to show these durations.)
====================================================================================================== 1 passed, 4 warnings in 6.16s ======================================================================================================

Making sure all commands are runnable:

olggol@learnfair0541:~ /ParlAI$ bash projects/roscoe/roscoe_data/generate_perturbed_data.sh
…
Writing 4546 samples to ./projects/roscoe/roscoe_data//synthetic_sentinel_50%/math_dataset_synthetic/50%_DuplicateOneStep_SwapOneStep_ShuffleNumbers_test.jsonl
Writing 2825 samples to ./projects/roscoe/roscoe_data//synthetic_sentinel_50%/math_dataset_synthetic/50%_DuplicateOneStep_SwapOneStep_ShuffleOperations_test.jsonl
olggol@learnfair0493:~/ParlAI$ ls projects/roscoe/roscoe_data//synthetic_50%
aqua_synthetic  asdiv_synthetic  entailment_bank_synthetic  eqasc_synthetic  math_dataset_synthetic  proofwriter_synthetic

olggol@learnfair0493:~ /ParlAI$ python projects/roscoe/roscoe.py
10/25/2022 06:51:15 - INFO - sentence_transformers.SentenceTransformer -   Load pretrained SentenceTransformer: all-mpnet-base-v2
10/25/2022 06:51:18 - INFO - sentence_transformers.SentenceTransformer -   Use pytorch device: cuda
…
Scores written to ./projects/roscoe/scores/all-mpnet-base-v2/scores_cosmosqa_valid_gpt3_expl.tsv
Max GPU Memory Allocated: 7494 MB
olggol@learnfair0493:~ /ParlAI$ ls projects/roscoe/scores/all-mpnet-base-v2/
scores_cosmos_valid_gpt3_expl.tsv  scores_cosmosqa_valid_gpt3_expl.tsv  scores_drop_valid_gpt3_expl.tsv  scores_esnli_valid_gpt3_expl.tsv  scores_gsm8k_valid_gpt3_expl.tsv  scores_semevalcommonsense_gpt3_expl.tsv

olggol@learnfair0493:~ /ParlAI$ bash projects/roscoe/synthetic_evaluation/score_all.sh sim_sce ./projects/roscoe/model/roscoe-512-roberta-base
…
Evaluating ./projects/roscoe/roscoe_data/synthetic_50%/aqua_synthetic/50%_DuplicateOneStep_test.jsonl
…


olggol@learnfair0493:~ /ParlAI$ python projects/roscoe/meta_evaluation/roscoe_correlations.py
Dataset: drop
ended up with 210 rows.
Scores written to: ./projects/roscoe/correlations/drop_all_scores_roscoe-512-roberta-base.txt
…
Correlations written to: ./projects/roscoe/correlations/drop_all_scores.CORRELS.txt
Granular summary of drop written to: ./projects/roscoe/correlations/drop_summary_granular.csv
Granular summary of drop written to: ./projects/roscoe/correlations/drop_summary_granular.tex
…

olggol@learnfair0614:~ /ParlAI$ bash projects/roscoe/meta_evaluation/run_synthetic_correlations.sh
Reading scores: roscoe-512-roberta-base 50%_DuplicateOneStep_test
Reading scores: roscoe-512-roberta-base 50%_ExtrinsicHallucinatedStep_test
Reading scores: inference 50%_DuplicateOneStep_test
Reading scores: inference 50%_ExtrinsicHallucinatedStep_test
Reading scores: language 50%_DuplicateOneStep_test
Getting summary
Final results path is  ./projects/roscoe/correlations/final/aqua.csv
summary written to ./projects/roscoe/correlations/final/aqua_summary.csv
…

.circleci/config.yml

parlai/tasks/reasoning/reason_types/step_by_step.py

projects/roscoe/score.py

moyapchen

Reviewed the baseline code + some of the synthetic code - Couple of places with local homedirs. :)

projects/roscoe/baselines/scores.py

moyapchen · 2022-10-27T16:40:58Z

projects/roscoe/baselines/scores.py

+        )
+        # Path here to fine-tuend BART Model
+        self.scorer.load(
+            "/private/home/mpchen/BARTScore/train/reproduce/trained/bart_6000.pth"


Ahhh we might need to upload this to the AWS bucket as well and provide a URL to it here (or otherwise download it)

moyapchen · 2022-10-27T16:41:20Z

projects/roscoe/baselines/scores.py

+# sacrebleu>=1.4.8#
+# torch>=1.4.0
+prism = SourceFileLoader(
+    "prism", "/private/home/aslic/Evaluation/BARTScore/SUM/prism.py"


Ditto here - this one also needs to be a const to the file...

moyapchen · 2022-10-27T16:41:30Z

projects/roscoe/baselines/scores.py

+class PrismBaselineScorer(BaselineScorer):
+    def __init__(self):
+        self.scorer = prism.Prism(
+            model_dir='/private/home/aslic/Evaluation/BARTScore/SUM/models/m39v1/',


moyapchen · 2022-10-27T16:42:05Z

projects/roscoe/baselines/scores.py

+class BleurtBaselineScorer(BaselineScorer):
+    def __init__(self):
+        self.scorer = bleurt_score.BleurtScorer(
+            "/private/home/aslic/Evaluation/BARTScore_old/bleurt/bleurt/test_checkpoint"


lol, another part where we might just need to have the model checkpoint path installed

moyapchen · 2022-10-27T16:45:16Z

Thanks for putting together this (behemoth!) of a diff. :)

moyapchen

Accepting to unblock (also see comment about where the fine-tune model is uploaded)

moyapchen · 2022-10-28T13:42:58Z

projects/roscoe/baselines/scores.py

+            device=DEFAULT_DEVICE, checkpoint='facebook/bart-large-cnn'
+        )
+        # Path here to fine-tuend BART Model
+        self.scorer.load(BART_SCORE_REPO + "/train/reproduce/trained/bart_6000.pth")


Model is here: https://dl.fbaipublicfiles.com/parlai/projects/roscoe/fine_tuned_bartscore.pth

ROSCOE suite of metrics

eeeba1e

Golovneva requested review from spencerp and moyapchen October 25, 2022 15:14

facebook-github-bot added the CLA Signed label Oct 25, 2022

Golovneva added 26 commits October 25, 2022 08:24

updating tests

5357ce6

lint

65ccb86

fixing protobuf version to stop cleaninstall failures

16a5c9c

updating requirements

f144fe6

convert to absolute path

529eda3

moving tests because of the dependency issues

fb81f87

adding new dependencies in tests

a6bcd83

add test dependencies

9667593

fixing deps

b4820fc

updating task list

89a6d9f

checklist deps can't be installed on circleci

a594e21

actually fix protobuf version

7343f21

protobuf range

9b7e70d

protobuf conflict with google-api-core

ad1ffbe

return tests

c4e344e

convert imports to absolute path

f90359b

trying checklist again

f7f7659

trying to avoid checklist failures

50967f0

checklist to teacher tests

974efad

add user option to avoid installation failure

51dce30

jupiter as well

1eef47d

typo

a8a9fa9

moving into virtual env setup

77bbd9d

user param not allowed in virtual env

39f69bf

move spacy to circleCI because it's big

99e9085

replace local model with HF

6ff94a1

spencerp reviewed Oct 27, 2022

View reviewed changes

.circleci/config.yml Show resolved Hide resolved

parlai/tasks/reasoning/reason_types/step_by_step.py Outdated Show resolved Hide resolved

projects/roscoe/score.py Outdated Show resolved Hide resolved

moyapchen reviewed Oct 27, 2022

View reviewed changes

Golovneva and others added 2 commits October 27, 2022 09:54

fixes based on comments

f6ce201

remove unused nli scores, fix tests

c43706f

moyapchen approved these changes Oct 28, 2022

View reviewed changes

Added path to BART model

0bdcb42

Golovneva merged commit 0f129e9 into main Oct 28, 2022

Golovneva deleted the olggol/roscoe branch October 28, 2022 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROSCOE suite of metrics #4839

ROSCOE suite of metrics #4839

Golovneva commented Oct 25, 2022 •

edited by spencerp

Loading

moyapchen left a comment •

edited

Loading

moyapchen Oct 27, 2022

moyapchen Oct 27, 2022

moyapchen Oct 27, 2022

moyapchen Oct 27, 2022

moyapchen commented Oct 27, 2022

moyapchen left a comment

moyapchen Oct 28, 2022

ROSCOE suite of metrics #4839

ROSCOE suite of metrics #4839

Conversation

Golovneva commented Oct 25, 2022 • edited by spencerp Loading

moyapchen left a comment • edited Loading

Choose a reason for hiding this comment

moyapchen Oct 27, 2022

Choose a reason for hiding this comment

moyapchen Oct 27, 2022

Choose a reason for hiding this comment

moyapchen Oct 27, 2022

Choose a reason for hiding this comment

moyapchen Oct 27, 2022

Choose a reason for hiding this comment

moyapchen commented Oct 27, 2022

moyapchen left a comment

Choose a reason for hiding this comment

moyapchen Oct 28, 2022

Choose a reason for hiding this comment

Golovneva commented Oct 25, 2022 •

edited by spencerp

Loading

moyapchen left a comment •

edited

Loading