Allows MMLU to have the system_prompt provided to it #197

RobotSail · 2024-12-12T04:29:13Z

The MMLU evaluator currently performs MMLU evaluation by calling out to the lm_eval harness. The lm-eval harness itself handles the evaluation by creating its own prompt clientside and then passing it to /v1/completions of whatever OpenAI-compatible API that it's expecting to receive it at the other end.

Certain models however require to have their chat template used to be used during inference in order to get the best results. This PR adjusts the MMLU evaluator by allowing the system prompt to be provided to the model and then will enable the chat template mode to be used when it's present.

Signed-off-by: Oleg S 97077423+RobotSail@users.noreply.github.com

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

relyt0925 · 2024-12-12T21:09:37Z

I am currently running a e2e test on granite-8b-starter and still seeing failures (manually brought in patches)

relyt0925 · 2024-12-12T21:10:16Z

(app-root) /$ cat /opt/app-root/lib/python3.11/site-packages/instructlab/eval/mmlu.py | grep system_prompt
        system_prompt   system prompt to be used when applying the chat template
        system_prompt: Optional[str] = None,
        self.system_prompt = system_prompt
        should_apply_chat_template = self.system_prompt is not None
            system_instruction=self.system_prompt,
        system_prompt   system prompt to be used when applying the chat template
        system_prompt: Optional[str] = None,
            system_prompt=system_prompt,
        system_prompt   system prompt to be used when applying the chat template

(app-root) /$ cat /opt/app-root/lib/python3.11/site-packages/instructlab/model/evaluate.py | grep system_prompt
            system_prompt = get_sysprompt(get_model_arch(model_path))
                system_prompt=system_prompt,
            base_model_system_prompt = get_sysprompt(get_model_arch(base_model_path))
            model_system_prompt = get_sysprompt(get_model_arch(model_path))
                    system_prompt=model_system_prompt,
                    base_model_system_prompt,
                    system_prompt=model_system_prompt,
(app-root) /$

Confirmed my image had patches for both libraries and then proceeded to run mmlu eval

danmcp

Changes look good, might be good to pass an example system_prompt from a unit test:

eval/tests/test_mmlu.py

Line 65 in d8d6097

mmlu = MMLUEvaluator(model_path=MODEL_EXAMPLE, tasks=tasks)

It wouldn't test much but at least to make sure the arg passing works.

The other place to potentially add an example would be in the test scripts:

https://github.com/instructlab/eval/blob/d8d609780ed035a533be1b1fdd7166d263092461/scripts/test_mmlu.py

That's less valuable since they aren't run anywhere. But could be a helpful example.

relyt0925 · 2024-12-12T21:10:41Z

Hit with this same error

Requesting API:   0%|                                                                                             | 52/56168 [00:16<2:48:58,  5.54it/s]WARNING 2024-12-12 20:57:37,031 lm-eval:347: API request failed with error message: Internal Server Error. Retrying...
WARNING 2024-12-12 20:57:38,197 lm-eval:347: API request failed with error message: Internal Server Error. Retrying...
WARNING 2024-12-12 20:57:39,361 lm-eval:347: API request failed with error message: Internal Server Error. Retrying...
INFO 2024-12-12 20:57:47,539 instructlab.model.backends.vllm:475: Waiting for GPU VRAM reclamation...
Traceback (most recent call last):
  File "/opt/app-root/bin/ilab", line 8, in <module>
    sys.exit(ilab())
             ^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py", line 323, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/evaluate.py", line 817, in evaluate
    overall_score, individual_scores = evaluator.run(api_base)
                                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/eval/mmlu.py", line 147, in run
    results = self._run_mmlu(server_url)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/eval/mmlu.py", line 175, in _run_mmlu
    mmlu_output = self._simple_evaluate_with_error_handling(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/eval/mmlu.py", line 193, in _simple_evaluate_with_error_handling
    return simple_evaluate(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/evaluator.py", line 301, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/evaluator.py", line 500, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/api/model.py", line 378, in loglikelihood
    return self._loglikelihood_tokens(new_reqs, disable_tqdm=disable_tqdm)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/models/api_models.py", line 502, in _loglikelihood_tokens
    outputs = retry(
              ^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/tenacity/__init__.py", line 418, in exc_check
    raise retry_exc.reraise()
          ^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/tenacity/__init__.py", line 185, in reraise
    raise self.last_attempt.result()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/opt/app-root/lib64/python3.11/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/models/api_models.py", line 350, in model_call
    response.raise_for_status()
  File "/opt/app-root/lib64/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://127.0.0.1:43683/v1/completions
Requesting API:   0%|                                                                                            | 52/56168 [00:36<11:00:59,  1.41it/s]

relyt0925 · 2024-12-12T21:11:32Z

I am happy to bring in patches and do a e2e test to ensure we are at a place that everything is good on granite-8b-starter

relyt0925 · 2024-12-12T21:33:12Z

mmluoutputgranite8bwithpatch.txt.zip

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

RobotSail

Changes look good, might be good to pass an example system_prompt from a unit test:

eval/tests/test_mmlu.py

Line 65 in d8d6097

mmlu = MMLUEvaluator(model_path=MODEL_EXAMPLE, tasks=tasks)

I've updated the PR to include these changes, please take a look and let me know if there's anything I missed.

Regarding this comment, Tyler and I had a debug session where we found that adding the new prompt template causes MMLU to exceed the existing model's context (4096 tokens) when --few-shot=5 is used. This happens because lm_eval harness appears to truncate the prompt by default and just send the last 4096 tokens fitting into the context, which causes the system prompt to disappear and throw an error on certain models.

The resolution here is to use a lower number for --few-shot such as 1-3.

mergify bot added the ci-failure label Dec 12, 2024

RobotSail force-pushed the fix-mmlu branch from b1ec8b3 to 4432e60 Compare December 12, 2024 05:03

mergify bot added ci-failure and removed ci-failure labels Dec 12, 2024

RobotSail force-pushed the fix-mmlu branch from 4432e60 to a9eed0a Compare December 12, 2024 05:17

mergify bot added CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation ci-failure and removed ci-failure labels Dec 12, 2024

feat: allow MMLU to pass system_prompt to lm_eval

ab664b8

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

RobotSail force-pushed the fix-mmlu branch from a9eed0a to ab664b8 Compare December 12, 2024 05:22

mergify bot removed the ci-failure label Dec 12, 2024

RobotSail mentioned this pull request Dec 12, 2024

fix: provide system prompt to MMLU evaluators instructlab/instructlab#2778

Open

6 tasks

danmcp reviewed Dec 12, 2024

View reviewed changes

chore: update tests to include system prompt in MMLU evals

fd78adf

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

mergify bot added the testing Relates to testing label Dec 13, 2024

RobotSail commented Dec 13, 2024

View reviewed changes

danmcp approved these changes Dec 13, 2024

View reviewed changes

mergify bot added the one-approval label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allows MMLU to have the system_prompt provided to it #197

Allows MMLU to have the system_prompt provided to it #197

RobotSail commented Dec 12, 2024

relyt0925 commented Dec 12, 2024

relyt0925 commented Dec 12, 2024

danmcp left a comment

relyt0925 commented Dec 12, 2024

relyt0925 commented Dec 12, 2024

relyt0925 commented Dec 12, 2024

RobotSail left a comment

Allows MMLU to have the system_prompt provided to it #197

Are you sure you want to change the base?

Allows MMLU to have the system_prompt provided to it #197

Conversation

RobotSail commented Dec 12, 2024

relyt0925 commented Dec 12, 2024

relyt0925 commented Dec 12, 2024

danmcp left a comment

Choose a reason for hiding this comment

relyt0925 commented Dec 12, 2024

relyt0925 commented Dec 12, 2024

relyt0925 commented Dec 12, 2024

RobotSail left a comment

Choose a reason for hiding this comment