Skip to content
This repository was archived by the owner on Nov 3, 2023. It is now read-only.

ROSCOE suite of metrics #4839

Merged
merged 30 commits into from
Oct 28, 2022
Merged

ROSCOE suite of metrics #4839

merged 30 commits into from
Oct 28, 2022

Conversation

Golovneva
Copy link
Contributor

@Golovneva Golovneva commented Oct 25, 2022

Patch description

  1. This is a set of scripts and data to support paper "ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning".
    New repository "projects/roscoe" contains the code and reproduction steps for our paper.
    New tasks support data loading, as well as synthetic perturbations described in paper.

  2. Fixed protobuf dependency issue that was failing CircleCI builds

Testing steps

  1. Unit testing:
olggol@learnfair0614:~/ParlAI$ pytest tests/nightly/cpu/test_roscoe.py
=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.8.13, pytest-5.3.2, py-1.11.0, pluggy-0.13.1
rootdir: /private/home/olggol/ParlAI, inifile: pytest.ini
plugins: anyio-3.6.1, requests-mock-1.7.0, datadir-1.3.1, hydra-core-1.2.0, regressions-2.1.1
collected 6 items

tests/nightly/cpu/test_roscoe.py ......                                                                                                                                                                                             [100%]

======================================================================================================== slowest 10 test durations ========================================================================================================
0.10s setup    tests/nightly/cpu/test_roscoe.py::TestEvaluator::test_compute_ppl_scores

(0.00 durations hidden.  Use -vv to show these durations.)
====================================================================================================== 6 passed, 3 warnings in 2.13s ======================================================================================================

olggol@learnfair0614:~/ParlAI$ pytest tests/tasks/reasoning_teacher/test_abstract_reasoning_teacher.py
=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.8.13, pytest-5.3.2, py-1.11.0, pluggy-0.13.1
rootdir: /private/home/olggol/ParlAI, inifile: pytest.ini
plugins: anyio-3.6.1, requests-mock-1.7.0, datadir-1.3.1, hydra-core-1.2.0, regressions-2.1.1
collected 1 item

tests/tasks/reasoning_teacher/test_abstract_reasoning_teacher.py .                                                                                                                                                                  [100%]

======================================================================================================== slowest 10 test durations ========================================================================================================
0.04s call     tests/tasks/reasoning_teacher/test_abstract_reasoning_teacher.py::TestAbstractReasoningTeacher::test_cases

(0.00 durations hidden.  Use -vv to show these durations.)
====================================================================================================== 1 passed, 2 warnings in 3.47s ======================================================================================================

olggol@learnfair0614:~/ParlAI$ pytest tests/tasks/reasoning_teacher/test_math_dataset_step_by_step_teacher.py
=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.8.13, pytest-5.3.2, py-1.11.0, pluggy-0.13.1
rootdir: /private/home/olggol/ParlAI, inifile: pytest.ini
plugins: anyio-3.6.1, requests-mock-1.7.0, datadir-1.3.1, hydra-core-1.2.0, regressions-2.1.1
collected 1 item

tests/tasks/reasoning_teacher/test_math_dataset_step_by_step_teacher.py .                                                                                                                                                           [100%]

======================================================================================================== slowest 10 test durations ========================================================================================================
10.40s call     tests/tasks/reasoning_teacher/test_math_dataset_step_by_step_teacher.py::TestMathDatasetStepByStepReasoningTeacher::test_get_boxed_answer

(0.00 durations hidden.  Use -vv to show these durations.)
===================================================================================================== 1 passed, 4 warnings in 18.86s ======================================================================================================
olggol@learnfair0614:~/ParlAI$ pytest tests/tasks/reasoning_teacher/test_question_answer.py
=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.8.13, pytest-5.3.2, py-1.11.0, pluggy-0.13.1
rootdir: /private/home/olggol/ParlAI, inifile: pytest.ini
plugins: anyio-3.6.1, requests-mock-1.7.0, datadir-1.3.1, hydra-core-1.2.0, regressions-2.1.1
collected 1 item

tests/tasks/reasoning_teacher/test_question_answer.py .                                                                                                                                                                             [100%]

======================================================================================================== slowest 10 test durations ========================================================================================================

(0.00 durations hidden.  Use -vv to show these durations.)
====================================================================================================== 1 passed, 2 warnings in 1.45s ======================================================================================================
olggol@learnfair0614:~/ParlAI$ pytest tests/tasks/reasoning_teacher/test_step_by_step.py
=========================================================================================================== test session starts ===========================================================================================================
platform linux -- Python 3.8.13, pytest-5.3.2, py-1.11.0, pluggy-0.13.1
rootdir: /private/home/olggol/ParlAI, inifile: pytest.ini
plugins: anyio-3.6.1, requests-mock-1.7.0, datadir-1.3.1, hydra-core-1.2.0, regressions-2.1.1
collected 1 item

tests/tasks/reasoning_teacher/test_step_by_step.py .                                                                                                                                                                                [100%]

======================================================================================================== slowest 10 test durations ========================================================================================================
0.89s call     tests/tasks/reasoning_teacher/test_step_by_step.py::TestStepPertubations::test_cases

(0.00 durations hidden.  Use -vv to show these durations.)
====================================================================================================== 1 passed, 4 warnings in 6.16s ======================================================================================================
  1. Making sure all commands are runnable:
olggol@learnfair0541:~ /ParlAI$ bash projects/roscoe/roscoe_data/generate_perturbed_data.sh
…
Writing 4546 samples to ./projects/roscoe/roscoe_data//synthetic_sentinel_50%/math_dataset_synthetic/50%_DuplicateOneStep_SwapOneStep_ShuffleNumbers_test.jsonl
Writing 2825 samples to ./projects/roscoe/roscoe_data//synthetic_sentinel_50%/math_dataset_synthetic/50%_DuplicateOneStep_SwapOneStep_ShuffleOperations_test.jsonl
olggol@learnfair0493:~/ParlAI$ ls projects/roscoe/roscoe_data//synthetic_50%
aqua_synthetic  asdiv_synthetic  entailment_bank_synthetic  eqasc_synthetic  math_dataset_synthetic  proofwriter_synthetic

olggol@learnfair0493:~ /ParlAI$ python projects/roscoe/roscoe.py
10/25/2022 06:51:15 - INFO - sentence_transformers.SentenceTransformer -   Load pretrained SentenceTransformer: all-mpnet-base-v2
10/25/2022 06:51:18 - INFO - sentence_transformers.SentenceTransformer -   Use pytorch device: cuda
…
Scores written to ./projects/roscoe/scores/all-mpnet-base-v2/scores_cosmosqa_valid_gpt3_expl.tsv
Max GPU Memory Allocated: 7494 MB
olggol@learnfair0493:~ /ParlAI$ ls projects/roscoe/scores/all-mpnet-base-v2/
scores_cosmos_valid_gpt3_expl.tsv  scores_cosmosqa_valid_gpt3_expl.tsv  scores_drop_valid_gpt3_expl.tsv  scores_esnli_valid_gpt3_expl.tsv  scores_gsm8k_valid_gpt3_expl.tsv  scores_semevalcommonsense_gpt3_expl.tsv

olggol@learnfair0493:~ /ParlAI$ bash projects/roscoe/synthetic_evaluation/score_all.sh sim_sce ./projects/roscoe/model/roscoe-512-roberta-base
…
Evaluating ./projects/roscoe/roscoe_data/synthetic_50%/aqua_synthetic/50%_DuplicateOneStep_test.jsonl
…


olggol@learnfair0493:~ /ParlAI$ python projects/roscoe/meta_evaluation/roscoe_correlations.py
Dataset: drop
ended up with 210 rows.
Scores written to: ./projects/roscoe/correlations/drop_all_scores_roscoe-512-roberta-base.txt
…
Correlations written to: ./projects/roscoe/correlations/drop_all_scores.CORRELS.txt
Granular summary of drop written to: ./projects/roscoe/correlations/drop_summary_granular.csv
Granular summary of drop written to: ./projects/roscoe/correlations/drop_summary_granular.tex
…

olggol@learnfair0614:~ /ParlAI$ bash projects/roscoe/meta_evaluation/run_synthetic_correlations.sh
Reading scores: roscoe-512-roberta-base 50%_DuplicateOneStep_test
Reading scores: roscoe-512-roberta-base 50%_ExtrinsicHallucinatedStep_test
Reading scores: inference 50%_DuplicateOneStep_test
Reading scores: inference 50%_ExtrinsicHallucinatedStep_test
Reading scores: language 50%_DuplicateOneStep_test
Getting summary
Final results path is  ./projects/roscoe/correlations/final/aqua.csv
summary written to ./projects/roscoe/correlations/final/aqua_summary.csv
…

Copy link
Contributor

@moyapchen moyapchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the baseline code + some of the synthetic code - Couple of places with local homedirs. :)

)
# Path here to fine-tuend BART Model
self.scorer.load(
"/private/home/mpchen/BARTScore/train/reproduce/trained/bart_6000.pth"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh we might need to upload this to the AWS bucket as well and provide a URL to it here (or otherwise download it)

# sacrebleu>=1.4.8#
# torch>=1.4.0
prism = SourceFileLoader(
"prism", "/private/home/aslic/Evaluation/BARTScore/SUM/prism.py"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here - this one also needs to be a const to the file...

class PrismBaselineScorer(BaselineScorer):
def __init__(self):
self.scorer = prism.Prism(
model_dir='/private/home/aslic/Evaluation/BARTScore/SUM/models/m39v1/',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aslic

class BleurtBaselineScorer(BaselineScorer):
def __init__(self):
self.scorer = bleurt_score.BleurtScorer(
"/private/home/aslic/Evaluation/BARTScore_old/bleurt/bleurt/test_checkpoint"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, another part where we might just need to have the model checkpoint path installed

@moyapchen
Copy link
Contributor

Thanks for putting together this (behemoth!) of a diff. :)

Copy link
Contributor

@moyapchen moyapchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting to unblock (also see comment about where the fine-tune model is uploaded)

device=DEFAULT_DEVICE, checkpoint='facebook/bart-large-cnn'
)
# Path here to fine-tuend BART Model
self.scorer.load(BART_SCORE_REPO + "/train/reproduce/trained/bart_6000.pth")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Golovneva Golovneva merged commit 0f129e9 into main Oct 28, 2022
@Golovneva Golovneva deleted the olggol/roscoe branch October 28, 2022 18:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants