Gauntlet v0.1 #674

bmosaicml · 2023-10-13T14:28:45Z

This PR introduces v0.1 of the Pre-training Gauntlet. We introduce chain-of-thought QA tasks, as well as 16 new benchmarks, and a new Safety category.

Test 7B models on 8 A100 80GB: gauntlet-v0-1-cfXQE4 without programming. Run time is 10,000 seconds.

| model_name               |   core_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |   safety |
|:-------------------------|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|---------:|
| mosaicml/mpt-7b-instruct |        0.29908 |          0.409036 |                0.311075 |                 0.371509 |                   0.139737 |                0.264042 | 0.158681 |

| Category                 | Benchmark                        | Subtask                             |   Accuracy | Number few shot   | Model                    |
|:-------------------------|:---------------------------------|:------------------------------------|-----------:|:------------------|:-------------------------|
| world_knowledge          | jeopardy                         | Average                             | 0.458112   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | american_history                    | 0.51816    | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | literature                          | 0.540816   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | science                             | 0.34874    | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | word_origins                        | 0.287671   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | world_history                       | 0.595174   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          | bigbench_qa_wikidata             |                                     | 0.694503   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          | arc_easy                         |                                     | 0.748737   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          | arc_challenge                    |                                     | 0.47099    | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          | mmlu                             | Average                             | 0.312989   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | abstract_algebra                    | 0.31       | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | anatomy                             | 0.311111   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | astronomy                           | 0.315789   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | business_ethics                     | 0.26       | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | clinical_knowledge                  | 0.316981   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | college_biology                     | 0.256944   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | college_chemistry                   | 0.33       | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | college_computer_science            | 0.29       | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | college_mathematics                 | 0.29       | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | college_medicine                    | 0.271676   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | college_physics                     | 0.264706   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | computer_security                   | 0.37       | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | conceptual_physics                  | 0.33617    | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | econometrics                        | 0.192982   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | electrical_engineering              | 0.324138   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | elementary_mathematics              | 0.259259   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | formal_logic                        | 0.301587   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | global_facts                        | 0.35       | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_biology                 | 0.33871    | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_chemistry               | 0.270936   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_computer_science        | 0.29       | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_european_history        | 0.30303    | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_geography               | 0.388889   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_government_and_politics | 0.362694   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_macroeconomics          | 0.325641   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_mathematics             | 0.288889   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_microeconomics          | 0.331933   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_physics                 | 0.311258   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_psychology              | 0.308257   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_statistics              | 0.388889   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_us_history              | 0.27451    | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | high_school_world_history           | 0.261603   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | human_aging                         | 0.372197   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | human_sexuality                     | 0.374046   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | international_law                   | 0.31405    | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | jurisprudence                       | 0.342593   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | logical_fallacies                   | 0.226994   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | machine_learning                    | 0.241071   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | management                          | 0.339806   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | marketing                           | 0.320513   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | medical_genetics                    | 0.34       | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | miscellaneous                       | 0.386973   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | moral_disputes                      | 0.323699   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | moral_scenarios                     | 0.251397   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | nutrition                           | 0.366013   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | philosophy                          | 0.37299    | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | prehistory                          | 0.33642    | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | professional_accounting             | 0.27305    | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | professional_law                    | 0.273794   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | professional_medicine               | 0.220588   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | professional_psychology             | 0.287582   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | public_relations                    | 0.418182   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | security_studies                    | 0.334694   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | sociology                           | 0.308458   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | us_foreign_policy                   | 0.37       | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | virology                            | 0.385542   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          |                                  | world_religions                     | 0.263158   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          | bigbench_misconceptions          |                                     | 0.598173   | 10-shot           | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | siqa                             |                                     | 0.512794   | 10-shot           | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | commonsense_qa                   |                                     | 0.22932    | 10-shot           | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | piqa                             |                                     | 0.806311   | 10-shot           | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | bigbench_novel_concepts          |                                     | 0.53125    | 10-shot           | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | bigbench_strange_stories         |                                     | 0.701149   | 10-shot           | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | bigbench_strategy_qa             |                                     | 0.59633    | 10-shot           | mosaicml/mpt-7b-instruct |
| language_understanding   | hellaswag                        |                                     | 0.769767   | 10-shot           | mosaicml/mpt-7b-instruct |
| language_understanding   | bigbench_language_identification |                                     | 0.2568     | 10-shot           | mosaicml/mpt-7b-instruct |
| language_understanding   | bigbench_conceptual_combinations |                                     | 0.320388   | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_elementary_math_qa      |                                     | 0.270466   | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_dyck_languages          |                                     | 0.314      | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_cs_algorithms           |                                     | 0.496212   | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_logical_deduction       |                                     | 0.262667   | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_operators               |                                     | 0.352381   | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | bigbench_repeat_copy_logic       |                                     | 0.3125     | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | simple_arithmetic_nospaces       |                                     | 0.078      | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | simple_arithmetic_withspaces     |                                     | 0.086      | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | math_qa                          |                                     | 0.257459   | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | logi_qa                          |                                     | 0.264209   | 10-shot           | mosaicml/mpt-7b-instruct |
| reading_comprehension    | pubmed_qa_labeled                |                                     | 0.59       | 10-shot           | mosaicml/mpt-7b-instruct |
| reading_comprehension    | squad                            |                                     | 0.586944   | 10-shot           | mosaicml/mpt-7b-instruct |
| reading_comprehension    | bigbench_understanding_fables    |                                     | 0.195767   | 10-shot           | mosaicml/mpt-7b-instruct |
| reading_comprehension    | boolq                            |                                     | 0.777064   | 10-shot           | mosaicml/mpt-7b-instruct |
| safety                   | winogender_mc_female             |                                     | 0.566667   | 10-shot           | mosaicml/mpt-7b-instruct |
| safety                   | winogender_mc_male               |                                     | 0.583333   | 10-shot           | mosaicml/mpt-7b-instruct |
| safety                   | enterprise_pii_classification    |                                     | 0.585862   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge          | triviaqa_sm_sub                  |                                     | 0.470667   | 3-shot            | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | gsm8k                            |                                     | 0.0318423  | 3-shot            | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | agi_eval_sat_math                |                                     | 0.0227273  | 3-shot            | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | aqua                             |                                     | 0.00816326 | 3-shot            | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | svamp                            |                                     | 0.306667   | 3-shot            | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving | agi_eval_lsat_ar                 |                                     | 0.26087    | 3-shot            | mosaicml/mpt-7b-instruct |
| reading_comprehension    | agi_eval_lsat_rc                 |                                     | 0.216418   | 3-shot            | mosaicml/mpt-7b-instruct |
| reading_comprehension    | agi_eval_lsat_lr                 |                                     | 0.270588   | 3-shot            | mosaicml/mpt-7b-instruct |
| reading_comprehension    | agi_eval_sat_en                  |                                     | 0.262136   | 3-shot            | mosaicml/mpt-7b-instruct |
| safety                   | bbq                              | Average                             | 0.581501   | 3-shot            | mosaicml/mpt-7b-instruct |
| safety                   |                                  | Age                                 | 0.566304   | 3-shot            | mosaicml/mpt-7b-instruct |
| safety                   |                                  | Disability_status                   | 0.562339   | 3-shot            | mosaicml/mpt-7b-instruct |
| safety                   |                                  | Gender_identity                     | 0.639281   | 3-shot            | mosaicml/mpt-7b-instruct |
| safety                   |                                  | Nationality                         | 0.582792   | 3-shot            | mosaicml/mpt-7b-instruct |
| safety                   |                                  | Physical_appearance                 | 0.579315   | 3-shot            | mosaicml/mpt-7b-instruct |
| safety                   |                                  | Race_ethnicity                      | 0.56061    | 3-shot            | mosaicml/mpt-7b-instruct |
| safety                   |                                  | Race_x_SES                          | 0.544355   | 3-shot            | mosaicml/mpt-7b-instruct |
| safety                   |                                  | Race_x_gender                       | 0.584774   | 3-shot            | mosaicml/mpt-7b-instruct |
| safety                   |                                  | Religion                            | 0.586667   | 3-shot            | mosaicml/mpt-7b-instruct |
| safety                   |                                  | SES                                 | 0.641463   | 3-shot            | mosaicml/mpt-7b-instruct |
| safety                   |                                  | Sexual_orientation                  | 0.548611   | 3-shot            | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | copa                             |                                     | 0.83       | 0-shot            | mosaicml/mpt-7b-instruct |
| commonsense_reasoning    | openbook_qa                      |                                     | 0.436      | 0-shot            | mosaicml/mpt-7b-instruct |
| language_understanding   | lambada_openai                   |                                     | 0.69086    | 0-shot            | mosaicml/mpt-7b-instruct |
| language_understanding   | winograd                         |                                     | 0.846154   | 0-shot            | mosaicml/mpt-7b-instruct |
| language_understanding   | winogrande                       |                                     | 0.67719    | 0-shot            | mosaicml/mpt-7b-instruct |
| language_understanding   | bigbench_conlang_translation     |                                     | 0.0670732  | 0-shot            | mosaicml/mpt-7b-instruct |
| reading_comprehension    | coqa                             |                                     | 0.454716   | 0-shot            | mosaicml/mpt-7b-instruct |

Original gauntlet on same hardware: mpt-eval-UBBxvo. Run time is 5176.9 seconds

| model_name               |   core_average |   lm_task_average |   lite_average |   world_knowledge |   commonsense_reasoning |   language_understanding |   symbolic_problem_solving |   reading_comprehension |   world_knowledge_lm_task_subscore |   language_understanding_lm_task_subscore |   symbolic_problem_solving_lm_task_subscore |   reading_comprehension_lm_task_subscore |   world_knowledge_lite |   commonsense_reasoning_lite |   language_understanding_lite |   symbolic_problem_solving_lite |   reading_comprehension_lite |
|:-------------------------|---------------:|------------------:|---------------:|------------------:|------------------------:|-------------------------:|---------------------------:|------------------------:|-----------------------------------:|------------------------------------------:|--------------------------------------------:|-----------------------------------------:|-----------------------:|-----------------------------:|------------------------------:|--------------------------------:|-----------------------------:|
| mosaicml/mpt-7b-instruct |       0.355008 |          0.454518 |       0.497581 |          0.399493 |                0.415171 |                 0.372755 |                   0.172231 |                0.415391 |                           0.575704 |                                   0.37916 |                                    0.273642 |                                 0.589567 |               0.375284 |                     0.635223 |                      0.692503 |                        0.195327 |                     0.589567 |

| Category                                  | Benchmark                        | Subtask                             |   Accuracy | Number few shot   | Model                    |
|:------------------------------------------|:---------------------------------|:------------------------------------|-----------:|:------------------|:-------------------------|
| world_knowledge_lite                      | jeopardy                         | Average                             |  0.457052  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge_lite                      |                                  | american_history                    |  0.51816   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge_lite                      |                                  | literature                          |  0.538776  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge_lite                      |                                  | science                             |  0.35084   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge_lite                      |                                  | word_origins                        |  0.287671  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge_lite                      |                                  | world_history                       |  0.589812  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge_lm_task_subscore          | bigbench_qa_wikidata             |                                     |  0.694356  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           | arc_easy                         |                                     |  0.747054  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge_lite                      | arc_challenge                    |                                     |  0.470137  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           | mmlu                             | Average                             |  0.312861  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | abstract_algebra                    |  0.31      | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | anatomy                             |  0.318519  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | astronomy                           |  0.302632  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | business_ethics                     |  0.25      | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | clinical_knowledge                  |  0.316981  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | college_biology                     |  0.256944  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | college_chemistry                   |  0.34      | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | college_computer_science            |  0.29      | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | college_mathematics                 |  0.27      | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | college_medicine                    |  0.277457  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | college_physics                     |  0.27451   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | computer_security                   |  0.37      | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | conceptual_physics                  |  0.348936  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | econometrics                        |  0.192982  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | electrical_engineering              |  0.331034  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | elementary_mathematics              |  0.256614  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | formal_logic                        |  0.285714  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | global_facts                        |  0.36      | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_biology                 |  0.354839  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_chemistry               |  0.256158  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_computer_science        |  0.29      | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_european_history        |  0.309091  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_geography               |  0.393939  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_government_and_politics |  0.362694  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_macroeconomics          |  0.333333  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_mathematics             |  0.3       | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_microeconomics          |  0.331933  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_physics                 |  0.317881  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_psychology              |  0.300917  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_statistics              |  0.393519  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_us_history              |  0.259804  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | high_school_world_history           |  0.265823  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | human_aging                         |  0.376682  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | human_sexuality                     |  0.374046  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | international_law                   |  0.305785  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | jurisprudence                       |  0.333333  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | logical_fallacies                   |  0.226994  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | machine_learning                    |  0.25      | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | management                          |  0.320388  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | marketing                           |  0.324786  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | medical_genetics                    |  0.34      | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | miscellaneous                       |  0.38825   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | moral_disputes                      |  0.317919  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | moral_scenarios                     |  0.252514  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | nutrition                           |  0.346405  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | philosophy                          |  0.360129  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | prehistory                          |  0.330247  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | professional_accounting             |  0.265957  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | professional_law                    |  0.271838  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | professional_medicine               |  0.220588  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | professional_psychology             |  0.292484  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | public_relations                    |  0.436364  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | security_studies                    |  0.338776  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | sociology                           |  0.308458  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | us_foreign_policy                   |  0.38      | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | virology                            |  0.391566  | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           |                                  | world_religions                     |  0.25731   | 10-shot           | mosaicml/mpt-7b-instruct |
| world_knowledge                           | bigbench_misconceptions          |                                     |  0.60274   | 10-shot           | mosaicml/mpt-7b-instruct |
| commonsense_reasoning_lite                | piqa                             |                                     |  0.805223  | 10-shot           | mosaicml/mpt-7b-instruct |
| commonsense_reasoning                     | bigbench_novel_concepts          |                                     |  0.53125   | 10-shot           | mosaicml/mpt-7b-instruct |
| commonsense_reasoning                     | bigbench_strange_stories         |                                     |  0.701149  | 10-shot           | mosaicml/mpt-7b-instruct |
| commonsense_reasoning                     | bigbench_strategy_qa             |                                     |  0.597641  | 10-shot           | mosaicml/mpt-7b-instruct |
| language_understanding_lite               | hellaswag                        |                                     |  0.770464  | 10-shot           | mosaicml/mpt-7b-instruct |
| language_understanding                    | bigbench_language_identification |                                     |  0.2562    | 10-shot           | mosaicml/mpt-7b-instruct |
| language_understanding                    | bigbench_conceptual_combinations |                                     |  0.330097  | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving_lite             | bigbench_elementary_math_qa      |                                     |  0.270309  | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving_lite             | bigbench_dyck_languages          |                                     |  0.314     | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving_lm_task_subscore | bigbench_cs_algorithms           |                                     |  0.49697   | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving                  | bigbench_logical_deduction       |                                     |  0.262667  | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving_lite             | bigbench_operators               |                                     |  0.352381  | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving_lite             | bigbench_repeat_copy_logic       |                                     |  0.3125    | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving_lite             | simple_arithmetic_nospaces       |                                     |  0.079     | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving_lite             | simple_arithmetic_withspaces     |                                     |  0.087     | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving                  | math_qa                          |                                     |  0.263158  | 10-shot           | mosaicml/mpt-7b-instruct |
| symbolic_problem_solving                  | logi_qa                          |                                     |  0.264209  | 10-shot           | mosaicml/mpt-7b-instruct |
| reading_comprehension_lite                | pubmed_qa_labeled                |                                     |  0.592     | 10-shot           | mosaicml/mpt-7b-instruct |
| reading_comprehension_lite                | squad                            |                                     |  0.587133  | 10-shot           | mosaicml/mpt-7b-instruct |
| reading_comprehension                     | bigbench_understanding_fables    |                                     |  0.195767  | 10-shot           | mosaicml/mpt-7b-instruct |
| reading_comprehension                     | boolq                            |                                     |  0.77737   | 10-shot           | mosaicml/mpt-7b-instruct |
| commonsense_reasoning_lite                | copa                             |                                     |  0.83      | 0-shot            | mosaicml/mpt-7b-instruct |
| commonsense_reasoning                     | openbook_qa                      |                                     |  0.436     | 0-shot            | mosaicml/mpt-7b-instruct |
| language_understanding_lite               | lambada_openai                   |                                     |  0.691248  | 0-shot            | mosaicml/mpt-7b-instruct |
| language_understanding_lite               | winograd                         |                                     |  0.846154  | 0-shot            | mosaicml/mpt-7b-instruct |
| language_understanding                    | winogrande                       |                                     |  0.674822  | 0-shot            | mosaicml/mpt-7b-instruct |
| language_understanding_lm_task_subscore   | bigbench_conlang_translation     |                                     |  0.0670732 | 0-shot            | mosaicml/mpt-7b-instruct |

… into model_gauntlet_v0.1

…ry into human_eval_simple

…y into execution_prediction

scripts/eval/yamls/eval_gauntlet_v0.1.yaml

scripts/eval/yamls/tasks_v0.1.yaml

maxisawesome · 2023-11-15T00:16:01Z

There are a lot of yamls in scirpts/eval/yamls. I'll add several for long_context as well. IMO we should split them up into folders to make things cleaner.

scripts/eval/local_data/MODEL_GAUNTLET.md

maxisawesome · 2023-11-15T00:22:18Z

README.md says "This is version v0, in the coming weeks we will update the mixture to include more benchmarks." We should list it as v0.1.0 and not promise to add things so quickly lol

dakinggg

Functional changes look fine to me. Leaving to others to approve the gauntlet itself/dataset descriptions

scripts/eval/yamls/coding_tasks.yaml

scripts/eval/yamls/copa.yaml

scripts/eval/yamls/eval_gauntlet_v0.1.yaml

scripts/eval/yamls/tasks_v0.1.yaml

scripts/eval/local_data/MODEL_GAUNTLET.md

maxisawesome · 2023-11-17T18:20:16Z

LGTM

This reverts commit ab5577b.

bmosaicml and others added 30 commits August 11, 2023 12:21

update yaml

a4c83d3

adding datasets

fbf43b0

adding datasets

7d9c367

Merge branch 'model_gauntlet_v0.1' of github.com:mosaicml/llm-foundry…

5054667

… into model_gauntlet_v0.1

added agi eval

b32224a

test CoT eval

46b27a3

fix broken eval yaml

4d82180

fix broken eval yaml

b1bbff1

Merge branch 'model_gauntlet_v0.1' of github.com:mosaicml/llm-foundry…

7554b01

… into model_gauntlet_v0.1

merge main

1845ffa

debugging

1508a9f

debugging

b10c16c

commit

85f081e

commit

de28be4

commit

193b4e0

commit

e896c09

commit

a16aea1

restore mcli

6f5fc2c

adding simple tasks

7968ca4

Merge branch 'main' into human_eval_simple

0f8b160

add simple human_eval

a84dda0

fix yaml

90a2dca

Merge branch 'main' into mike/human-eval-simple

4c54e3d

Merge branch 'main' into mike/human-eval-simple

bcda9ef

fix yaml

48060e1

Merge branch 'mike/human-eval-simple' of github.com:mcarbin/llm-found…

4eee5c1

…ry into human_eval_simple

remove breakpoint

9534d73

remove breakpoint

51d1b72

change bsz

e11bb34

Merge branch 'main' into human_eval_simple

fb750f3

bmosaicml and others added 4 commits November 10, 2023 13:18

Merge branch 'main' into execution_prediction

cd08024

update readm; rename gauntlet yamls

f0c0b9d

edit yamls

4572056

Merge branch 'main' into execution_prediction

297bfa2

bmosaicml changed the title ~~[WIP] Gauntlet v0.1~~ Gauntlet v0.1 Nov 13, 2023

bmosaicml added 3 commits November 13, 2023 12:09

fix yamllint

e6b696f

Merge branch 'execution_prediction' of github.com:mosaicml/llm-foundr…

a1f8ffb

…y into execution_prediction

restore mpt eval

6014dd4

bmosaicml requested review from samhavens, tbarton16, codestar12, maxisawesome, dakinggg and abhi-mosaic November 13, 2023 19:03

maxisawesome reviewed Nov 15, 2023

View reviewed changes

scripts/eval/yamls/eval_gauntlet_v0.1.yaml Show resolved Hide resolved

maxisawesome reviewed Nov 15, 2023

View reviewed changes

scripts/eval/yamls/tasks_v0.1.yaml Show resolved Hide resolved

maxisawesome reviewed Nov 15, 2023

View reviewed changes

scripts/eval/local_data/MODEL_GAUNTLET.md Show resolved Hide resolved

dakinggg reviewed Nov 15, 2023

View reviewed changes

scripts/eval/yamls/coding_tasks.yaml Show resolved Hide resolved

scripts/eval/yamls/copa.yaml Show resolved Hide resolved

scripts/eval/yamls/eval_gauntlet_v0.1.yaml Show resolved Hide resolved

scripts/eval/yamls/tasks_v0.1.yaml Show resolved Hide resolved

bmosaicml commented Nov 15, 2023

View reviewed changes

scripts/eval/local_data/MODEL_GAUNTLET.md Show resolved Hide resolved

bmosaicml requested a review from maxisawesome November 17, 2023 17:45

maxisawesome approved these changes Nov 17, 2023

View reviewed changes

Merge branch 'main' into execution_prediction

90ef388

bmosaicml enabled auto-merge (squash) November 20, 2023 19:35

Merge branch 'main' into execution_prediction

1357cb8

bmosaicml merged commit ab5577b into main Nov 20, 2023
12 checks passed

bmosaicml added a commit that referenced this pull request Nov 20, 2023

Revert "Gauntlet v0.1 (#674)"

713dde3

This reverts commit ab5577b.

bmosaicml mentioned this pull request Nov 20, 2023

Gauntlet v0.1.0 yaml fixes #748

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gauntlet v0.1 #674

Gauntlet v0.1 #674

bmosaicml commented Oct 13, 2023 •

edited

Loading

maxisawesome commented Nov 15, 2023

maxisawesome commented Nov 15, 2023

dakinggg left a comment

maxisawesome commented Nov 17, 2023

Gauntlet v0.1 #674

Gauntlet v0.1 #674

Conversation

bmosaicml commented Oct 13, 2023 • edited Loading

maxisawesome commented Nov 15, 2023

maxisawesome commented Nov 15, 2023

dakinggg left a comment

Choose a reason for hiding this comment

maxisawesome commented Nov 17, 2023

bmosaicml commented Oct 13, 2023 •

edited

Loading