Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allows MMLU to have the system_prompt provided to it #197

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RobotSail
Copy link
Member

The MMLU evaluator currently performs MMLU evaluation by calling out to the lm_eval harness. The lm-eval harness itself handles the evaluation by creating its own prompt clientside and then passing it to /v1/completions of whatever OpenAI-compatible API that it's expecting to receive it at the other end.

Certain models however require to have their chat template used to be used during inference in order to get the best results. This PR adjusts the MMLU evaluator by allowing the system prompt to be provided to the model and then will enable the chat template mode to be used when it's present.

Signed-off-by: Oleg S 97077423+RobotSail@users.noreply.github.com

@mergify mergify bot added the ci-failure label Dec 12, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Dec 12, 2024
@mergify mergify bot added CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation ci-failure and removed ci-failure labels Dec 12, 2024
Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
@relyt0925
Copy link

I am currently running a e2e test on granite-8b-starter and still seeing failures (manually brought in patches)

@relyt0925
Copy link

(app-root) /$ cat /opt/app-root/lib/python3.11/site-packages/instructlab/eval/mmlu.py | grep system_prompt
        system_prompt   system prompt to be used when applying the chat template
        system_prompt: Optional[str] = None,
        self.system_prompt = system_prompt
        should_apply_chat_template = self.system_prompt is not None
            system_instruction=self.system_prompt,
        system_prompt   system prompt to be used when applying the chat template
        system_prompt: Optional[str] = None,
            system_prompt=system_prompt,
        system_prompt   system prompt to be used when applying the chat template
(app-root) /$ cat /opt/app-root/lib/python3.11/site-packages/instructlab/model/evaluate.py | grep system_prompt
            system_prompt = get_sysprompt(get_model_arch(model_path))
                system_prompt=system_prompt,
            base_model_system_prompt = get_sysprompt(get_model_arch(base_model_path))
            model_system_prompt = get_sysprompt(get_model_arch(model_path))
                    system_prompt=model_system_prompt,
                    base_model_system_prompt,
                    system_prompt=model_system_prompt,
(app-root) /$ 

Confirmed my image had patches for both libraries and then proceeded to run mmlu eval

Copy link
Member

@danmcp danmcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, might be good to pass an example system_prompt from a unit test:

mmlu = MMLUEvaluator(model_path=MODEL_EXAMPLE, tasks=tasks)

It wouldn't test much but at least to make sure the arg passing works.

The other place to potentially add an example would be in the test scripts:

https://github.com/instructlab/eval/blob/d8d609780ed035a533be1b1fdd7166d263092461/scripts/test_mmlu.py

That's less valuable since they aren't run anywhere. But could be a helpful example.

@relyt0925
Copy link

Hit with this same error

Requesting API:   0%|                                                                                             | 52/56168 [00:16<2:48:58,  5.54it/s]WARNING 2024-12-12 20:57:37,031 lm-eval:347: API request failed with error message: Internal Server Error. Retrying...
WARNING 2024-12-12 20:57:38,197 lm-eval:347: API request failed with error message: Internal Server Error. Retrying...
WARNING 2024-12-12 20:57:39,361 lm-eval:347: API request failed with error message: Internal Server Error. Retrying...
INFO 2024-12-12 20:57:47,539 instructlab.model.backends.vllm:475: Waiting for GPU VRAM reclamation...
Traceback (most recent call last):
  File "/opt/app-root/bin/ilab", line 8, in <module>
    sys.exit(ilab())
             ^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py", line 323, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/evaluate.py", line 817, in evaluate
    overall_score, individual_scores = evaluator.run(api_base)
                                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/eval/mmlu.py", line 147, in run
    results = self._run_mmlu(server_url)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/eval/mmlu.py", line 175, in _run_mmlu
    mmlu_output = self._simple_evaluate_with_error_handling(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/eval/mmlu.py", line 193, in _simple_evaluate_with_error_handling
    return simple_evaluate(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/evaluator.py", line 301, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/evaluator.py", line 500, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/api/model.py", line 378, in loglikelihood
    return self._loglikelihood_tokens(new_reqs, disable_tqdm=disable_tqdm)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/models/api_models.py", line 502, in _loglikelihood_tokens
    outputs = retry(
              ^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/tenacity/__init__.py", line 418, in exc_check
    raise retry_exc.reraise()
          ^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/tenacity/__init__.py", line 185, in reraise
    raise self.last_attempt.result()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/opt/app-root/lib64/python3.11/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/lm_eval/models/api_models.py", line 350, in model_call
    response.raise_for_status()
  File "/opt/app-root/lib64/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://127.0.0.1:43683/v1/completions
Requesting API:   0%|                                                                                            | 52/56168 [00:36<11:00:59,  1.41it/s]

@relyt0925
Copy link

I am happy to bring in patches and do a e2e test to ensure we are at a place that everything is good on granite-8b-starter

@relyt0925
Copy link

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
@mergify mergify bot added the testing Relates to testing label Dec 13, 2024
Copy link
Member Author

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, might be good to pass an example system_prompt from a unit test:

mmlu = MMLUEvaluator(model_path=MODEL_EXAMPLE, tasks=tasks)

I've updated the PR to include these changes, please take a look and let me know if there's anything I missed.

Regarding this comment, Tyler and I had a debug session where we found that adding the new prompt template causes MMLU to exceed the existing model's context (4096 tokens) when --few-shot=5 is used. This happens because lm_eval harness appears to truncate the prompt by default and just send the last 4096 tokens fitting into the context, which causes the system prompt to disappear and throw an error on certain models.

The resolution here is to use a lower number for --few-shot such as 1-3.

@mergify mergify bot added the one-approval label Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation one-approval testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants