Distributed `mlx_lm.evaluate` #1174

barronalex · 2024-12-19T08:07:15Z

Add a distributed version of mlx_lm.evaluate that runs on multiple nodes and produces identical outputs.

Also fix a few bugs:

Add masking so that changing the batch_size no longer affects the output
Fixed a bug in loglikelihood_rolling tasks, e.g. wiki text

mlx_lm.evaluate --model mlx-community/Qwen2.5-7B-Instruct-bf16 --tasks winogrande

On 1 M2 Ultra:

Acc:   0.6992896606156275
Time (post init): 64 sec

On 4 M2 Ultra:

Acc:   0.6985003946329913
Time (post init): 16 sec

ivanfioravanti · 2024-12-30T17:24:11Z

This is great! I'm testing it with M2 Ultra + 2 M4 Max. WOW! Great job @barronalex
When will this be reviewed and merged?

llms/mlx_lm/evaluate.py

ivanfioravanti · 2025-01-22T06:11:38Z

Any news on this PR? It would be great to speed up some distributed evals on DeepSeek R1 😜
I will give it a try.

barronalex · 2025-01-22T15:34:09Z

That’s awesome! I’ll get it in later today.

barronalex · 2025-01-23T20:46:12Z

@awni thanks for the comments! I think this is good to merge now.

barronalex · 2025-01-23T20:48:44Z

llms/mlx_lm/evaluate.py

@@ -346,11 +361,8 @@ def main():
    )
    parser.add_argument(
        "--apply-chat-template",


It was impossible to disable this before, so I've changed it to be off by default (which mirrors the lm_eval behavior)

I'm not sure about defaulting it to off for instruct models.. it seems like you would always want this on for most models that are used regularly? Does it make sense to change this to --ignore-chat-template instead to be able to shut it off if needed?

awni

Looks great! Just one comment. Let me know what you think. Otherwise LGTM!

awni reviewed Jan 6, 2025

View reviewed changes