Add code eval #587

samhavens · 2023-09-08T01:13:28Z

This is a re-do of Rishab's PR #441 , but now we have a stable version of Composer.

This has been tested somewhat but there are issues with nondeterminism. As well, we are still figuring out how to eval using the secure sandbox on AWS.

bmosaicml

Lgtm, just left some comments

scripts/train/yamls/finetune/mpt-7b-code.yaml

scripts/eval/yamls/tasks.yaml

scripts/eval/yamls/eval_gauntlet.yaml

add three shot to coding tasks

make 0 and 3 shot for gauntlet, add programming_lite

samhavens · 2023-09-12T07:18:31Z

I used the following YAML to run the new coding evals against MPT-7b-8k. I reproduced our previous value for HumanEval (Python) of 11.6%

integrations:
- integration_type: git_repo
  git_repo: mosaicml/llm-foundry
  git_branch: sam/add-coding-eval
  pip_install: -e ".[gpu]"
  ssh_clone: false
- integration_type: wandb
  project: code-lora
  tags:
  - peft
  - eval
  entity: mosaic-ml

command: |
  cd llm-foundry/scripts
  composer eval/eval.py /mnt/config/parameters.yaml  || (echo "Command failed - killing python" && pkill python && exit 1)


run_name: humaneval-mpt-7b-8k
gpu_num: 8
gpu_type: a100_40gb
cluster: r7z2

resumable: false
priority: medium

image: mosaicml/examples:llm-latest

env_variables:
  - key: DATABRICKS_HOST
    value: redacted
  - key: DATABRICKS_TOKEN
    value: redacted
  - key: CODE_EVAL_DEVICE
    value: LOCAL

parameters:
  dist_timeout: 6000
  seed: 1
  max_seq_len: 1024
  device_eval_batch_size: 4
  precision: amp_fp16

  loggers:
    wandb: {}
    mlflow:
      experiment_name: redacted
      tracking_uri: databricks

  model_name_or_path: mosaicml/mpt-7b-8k

  models:
  -
    model_name: ${model_name_or_path}
    model:
      name: hf_causal_lm
      pretrained_model_name_or_path: ${model_name_or_path}
      init_device: cpu
      pretrained: true
    tokenizer:
      name: ${model_name_or_path}
      kwargs:
        model_max_length: ${max_seq_len}

  fsdp_config:
    sharding_strategy: FULL_SHARD
    mixed_precision: PURE
    activation_checkpointing: true
    activation_checkpointing_reentrant: false
    activation_cpu_offload: false
    limit_all_gathers: true
    verbose: false

  icl_tasks: 'eval/yamls/coding_tasks.yaml'

| Category   | Benchmark      | Subtask   |   Accuracy | Number few shot   | Model              |
|:-----------|:---------------|:----------|-----------:|:------------------|:-------------------|
|            | human_eval     |           |  0.115854  | 0-shot            | mosaicml/mpt-7b-8k |
|            | human_eval_cpp |           |  0.0621118 | 0-shot            | mosaicml/mpt-7b-8k |
|            | human_eval_js  |           |  0.0487805 | 0-shot            | mosaicml/mpt-7b-8k |
|            | human_eval_c   |           |  0.222222  | 0-shot            | mosaicml/mpt-7b-8k |

Note that this runs the code eval "locally" (on the same machine as eval.py). We need to very this works in Lambdas as well.

mcarbin · 2023-09-12T19:25:31Z

My repro using @samhavens yaml produces different results:

Benchmark	Accuracy	Number few shot	Model
human_eval	0.0914634	0-shot	mosaicml/mpt-7b-8k
human_eval_cpp	0.0372671	0-shot	mosaicml/mpt-7b-8k
human_eval_js	0.0426829	0-shot	mosaicml/mpt-7b-8k
human_eval_c	0.111111	0-shot	mosaicml/mpt-7b-8k

@samhavens and I suspected generate/decode params and @samhavens found sampling is enabled for code eval, hence introducing nondeterminism:

https://github.com/mosaicml/composer/blob/71cdfad53c5962fed8bcf340bb2bbfe403472678/composer/datasets/in_context_learning_evaluation.py#L1048

dakinggg · 2023-09-12T19:46:46Z

@samhavens @mcarbin I would still expect it to be deterministic if we are seeding everything properly...if you run on the same hardaware, batch size, etc, the data order should be the same. Perhaps we should see closer to the generate call to make it more deterministic.

mcarbin · 2023-09-14T13:52:36Z

non-determinism due to a subtle bug in seeding eval trainer. fixed with a2c877b

before

Benchmark	Accuracy	Number few shot	Model
human_eval	0.0914634	0-shot	mosaicml/mpt-7b-8k
human_eval_cpp	0.0372671	0-shot	mosaicml/mpt-7b-8k
human_eval_js	0.0487805	0-shot	mosaicml/mpt-7b-8k
human_eval_c	0	0-shot	mosaicml/mpt-7b-8k

Benchmark	Accuracy	Number few shot	Model
human_eval	0.0731707	0-shot	mosaicml/mpt-7b-8k
human_eval_cpp	0.0496894	0-shot	mosaicml/mpt-7b-8k
human_eval_js	0.0243902	0-shot	mosaicml/mpt-7b-8k
human_eval_c	0	0-shot	mosaicml/mpt-7b-8k

Benchmark	Accuracy	Number few shot	Model
human_eval	0.109756	0-shot	mosaicml/mpt-7b-8k
human_eval_cpp	0.0248447	0-shot	mosaicml/mpt-7b-8k
human_eval_js	0.0426829	0-shot	mosaicml/mpt-7b-8k
human_eval_c	0	0-shot	mosaicml/mpt-7b-8k

after

Benchmark	Accuracy	Number few shot	Model
human_eval	0.109756	0-shot	mosaicml/mpt-7b-8k
human_eval_cpp	0.0372671	0-shot	mosaicml/mpt-7b-8k
human_eval_js	0.0365854	0-shot	mosaicml/mpt-7b-8k
human_eval_c	0	0-shot	mosaicml/mpt-7b-8k

Benchmark	Accuracy	Number few shot	Model
human_eval	0.109756	0-shot	mosaicml/mpt-7b-8k
human_eval_cpp	0.0372671	0-shot	mosaicml/mpt-7b-8k
human_eval_js	0.0365854	0-shot	mosaicml/mpt-7b-8k
human_eval_c	0	0-shot	mosaicml/mpt-7b-8k

Benchmark	Accuracy	Number few shot	Model
human_eval	0.109756	0-shot	mosaicml/mpt-7b-8k
human_eval_cpp	0.0372671	0-shot	mosaicml/mpt-7b-8k
human_eval_js	0.0365854	0-shot	mosaicml/mpt-7b-8k
human_eval_c	0	0-shot	mosaicml/mpt-7b-8k

mcarbin · 2023-09-14T13:54:28Z

Non-determinism also exposed an issue with code eval pass@k metric. fixing is WIP

dakinggg · 2023-09-16T00:21:21Z

Closes #362

mcarbin · 2023-09-22T18:14:48Z

Waiting on PR to composer: mosaicml/composer#2550

dakinggg

@bmosaicml could you rereview please? want to make sure the gauntlet changes are ok with you

llmfoundry/utils/builders.py

scripts/eval/eval.py

dakinggg · 2023-09-26T01:29:17Z

Also, please bump the mosaicml version pin in setup.py to >=0.16.3. Then once its released CI should pass

dakinggg

Will approve once composer release is out, version is bumped, and CI passes

scripts/eval/yamls/eval_gauntlet.yaml

bmosaicml · 2023-09-26T01:47:08Z

Lgtm, any chance you could include the time to eval for whichever model/num GPUs you tested this with? It'll be useful reference to know once we run >30B models

…al' into sam/add-coding-eval

scripts/eval/yamls/tasks.yaml

mcarbin · 2023-09-26T19:34:01Z

For MPT30b, 16xA100-80g is the minimum spec. Here are reference results:

Ran mosaicml/mpt-30b eval in: 15220.648153305054 seconds 
Printing complete results for all models 
| Category   | Benchmark      | Subtask   |   Accuracy | Number few shot   | Model            | 
|:-----------|:---------------|:----------|-----------:|:------------------|:-----------------| 
|            | human_eval     |           |  0.143598  | 0-shot            | mosaicml/mpt-30b | 
|            | human_eval_cpp |           |  0.0928571 | 0-shot            | mosaicml/mpt-30b | 
|            | human_eval_js  |           |  0.0951219 | 0-shot            | mosaicml/mpt-30b | 
|            | human_eval_c   |           |  0.0888889 | 0-shot            | mosaicml/mpt-30b |

llmfoundry/utils/builders.py

add code eval

a57dde8

samhavens requested review from dakinggg and bmosaicml September 8, 2023 01:13

bmosaicml approved these changes Sep 8, 2023

View reviewed changes

scripts/train/yamls/finetune/mpt-7b-code.yaml Outdated Show resolved Hide resolved

scripts/eval/yamls/tasks.yaml Show resolved Hide resolved

scripts/eval/yamls/eval_gauntlet.yaml Show resolved Hide resolved

samhavens added 4 commits September 11, 2023 22:10

Create coding_tasks.yaml

a63c4aa

Update coding_tasks.yaml

71de0bf

Update tasks.yaml

85b727f

add three shot to coding tasks

Update eval_gauntlet.yaml

b0670e8

make 0 and 3 shot for gauntlet, add programming_lite

samhavens and others added 4 commits September 13, 2023 00:19

update readme add code eval

1e868d8

torch nograd eval_model

c6e54ae

integrate seed

a2c877b

Merge branch 'mosaicml:sam/add-coding-eval' into sam/add-coding-eval

071ddde

pass_at_k threading

12c0175

mcarbin added 5 commits September 22, 2023 09:14

remove extraneous print

2755e8d

update num_beams to default of 20

06ae4fb

update num_beamss for coding eval to 20 as the standard for pass@1

bfbbf17

remove C from default coding tasks because it's too small

e042eb7

update num_beams in full task list to default of 20

1fa60ec

mcarbin added 5 commits September 25, 2023 15:04

Merge branch 'main' into sam/add-coding-eval

9e241f5

fix break from merge and precommit

4b5f2d9

yamllint

c3819de

remove code finetuning

cb45e4b

Merge branch 'main' into sam/add-coding-eval

e3d7dda

mcarbin marked this pull request as ready for review September 25, 2023 19:35

mcarbin self-assigned this Sep 25, 2023

Merge branch 'main' into sam/add-coding-eval

0acec3d

dakinggg reviewed Sep 26, 2023

View reviewed changes

llmfoundry/utils/builders.py Show resolved Hide resolved

scripts/eval/eval.py Outdated Show resolved Hide resolved

Merge branch 'main' into sam/add-coding-eval

79d1d8c

dakinggg requested changes Sep 26, 2023

View reviewed changes

bmosaicml reviewed Sep 26, 2023

View reviewed changes

scripts/eval/yamls/eval_gauntlet.yaml Outdated Show resolved Hide resolved

bmosaicml reviewed Sep 26, 2023

View reviewed changes

scripts/eval/yamls/eval_gauntlet.yaml Outdated Show resolved Hide resolved

bmosaicml reviewed Sep 26, 2023

View reviewed changes

scripts/eval/yamls/eval_gauntlet.yaml Outdated Show resolved Hide resolved

mcarbin added 4 commits September 25, 2023 22:21

bump version for CI

628e020

Merge remote-tracking branch 'refs/remotes/mosaicml/sam/add-coding-ev…

797c690

…al' into sam/add-coding-eval

update gauntlet

0860f91

update tasks

fbb7342

bmosaicml reviewed Sep 26, 2023

View reviewed changes

scripts/eval/yamls/tasks.yaml Outdated Show resolved Hide resolved

Merge branch 'main' into sam/add-coding-eval

b10beaf

mcarbin and others added 2 commits September 26, 2023 15:43

remove torch.no_grad

63470b4

Merge branch 'main' into sam/add-coding-eval

707a2e7

dakinggg reviewed Sep 26, 2023

View reviewed changes

llmfoundry/utils/builders.py Show resolved Hide resolved

dakinggg approved these changes Sep 26, 2023

View reviewed changes

mcarbin merged commit fd36398 into main Sep 26, 2023
9 checks passed

dakinggg mentioned this pull request Oct 5, 2023

HumanEval Benchmark #362

Closed

dakinggg deleted the sam/add-coding-eval branch October 11, 2023 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add code eval #587

Add code eval #587

samhavens commented Sep 8, 2023 •

edited

Loading

bmosaicml left a comment

samhavens commented Sep 12, 2023 •

edited

Loading

mcarbin commented Sep 12, 2023 •

edited

Loading

dakinggg commented Sep 12, 2023

mcarbin commented Sep 14, 2023

mcarbin commented Sep 14, 2023

dakinggg commented Sep 16, 2023

mcarbin commented Sep 22, 2023

dakinggg left a comment

dakinggg commented Sep 26, 2023 •

edited

Loading

dakinggg left a comment

bmosaicml commented Sep 26, 2023

mcarbin commented Sep 26, 2023

Add code eval #587

Add code eval #587

Conversation

samhavens commented Sep 8, 2023 • edited Loading

bmosaicml left a comment

Choose a reason for hiding this comment

samhavens commented Sep 12, 2023 • edited Loading

mcarbin commented Sep 12, 2023 • edited Loading

dakinggg commented Sep 12, 2023

mcarbin commented Sep 14, 2023

before

after

mcarbin commented Sep 14, 2023

dakinggg commented Sep 16, 2023

mcarbin commented Sep 22, 2023

dakinggg left a comment

Choose a reason for hiding this comment

dakinggg commented Sep 26, 2023 • edited Loading

dakinggg left a comment

Choose a reason for hiding this comment

bmosaicml commented Sep 26, 2023

mcarbin commented Sep 26, 2023

samhavens commented Sep 8, 2023 •

edited

Loading

samhavens commented Sep 12, 2023 •

edited

Loading

mcarbin commented Sep 12, 2023 •

edited

Loading

dakinggg commented Sep 26, 2023 •

edited

Loading