Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed LORA training fails with METAL error #1185

Open
ivanfioravanti opened this issue Dec 30, 2024 · 14 comments
Open

Distributed LORA training fails with METAL error #1185

ivanfioravanti opened this issue Dec 30, 2024 · 14 comments

Comments

@ivanfioravanti
Copy link
Contributor

ivanfioravanti commented Dec 30, 2024

In the past I tried mlx_lm.lora distributed with M2 Ultra and M3 Max and everything was working well.
Today I tried again with 1 M2 Ultra and 2 M4 Max and I get following error.

Single run works on each host, I tried them 1 by 1 using: mpirun --hostfile ~/hosts.txt -n 1 /Users/ivan/.pyenv/shims/mlx_lm.lora --model mistralai/Mistral-7B-v0.3 --train --data ~/Desktop/data --max-seq-length 8192 --batch-size 1

As soon as I start a run on 2+ hosts the error appears. I used this for 3 hosts mpirun --hostfile ~/hosts.txt -n 3 --wdir /Users/ivan/.pyenv/shims mlx_lm.lora --model mistralai/Mistral-7B-v0.3 --train --data ~/Desktop/data --max-seq-length 8192 --batch-size 3

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
[Mac:46703] *** Process received signal ***
[Mac:46703] Signal: Abort trap: 6 (6)
[Mac:46703] Signal code: (0)
[Mac:46703] [ 0] 0 libsystem_platform.dylib 0x000000018ff66de4 _sigtramp + 56
[Mac:46703] [ 1] 0 libsystem_pthread.dylib 0x000000018ff2ff70 pthread_kill + 288
[Mac:46703] [ 2] 0 libsystem_c.dylib 0x000000018fe3c908 abort + 128
[Mac:46703] [ 3] 0 libc++abi.dylib 0x000000018fee644c _ZN10__cxxabiv130__aligned_malloc_with_fallbackEm + 0
[Mac:46703] [ 4] 0 libc++abi.dylib 0x000000018fed4a24 _ZL28demangling_terminate_handlerv + 320
[Mac:46703] [ 5] 0 libobjc.A.dylib 0x000000018fb7d3f4 _ZL15_objc_terminatev + 172
[Mac:46703] [ 6] 0 libc++abi.dylib 0x000000018fee5710 _ZSt11__terminatePFvvE + 16
[Mac:46703] [ 7] 0 libc++abi.dylib 0x000000018fee56b4 _ZSt9terminatev + 108
[Mac:46703] [ 8] 0 libdispatch.dylib 0x000000018fd7d688 _dispatch_client_callout4 + 40
[Mac:46703] [ 9] 0 libdispatch.dylib 0x000000018fd99c88 _dispatch_mach_msg_invoke + 464
[Mac:46703] [10] 0 libdispatch.dylib 0x000000018fd84a38 _dispatch_lane_serial_drain + 352
[Mac:46703] [11] 0 libdispatch.dylib 0x000000018fd9a9dc _dispatch_mach_invoke + 456
[Mac:46703] [12] 0 libdispatch.dylib 0x000000018fd84a38 _dispatch_lane_serial_drain + 352
[Mac:46703] [13] 0 libdispatch.dylib 0x000000018fd85764 _dispatch_lane_invoke + 432
[Mac:46703] [14] 0 libdispatch.dylib 0x000000018fd84a38 _dispatch_lane_serial_drain + 352
[Mac:46703] [15] 0 libdispatch.dylib 0x000000018fd85730 _dispatch_lane_invoke + 380
[Mac:46703] [16] 0 libdispatch.dylib 0x000000018fd909a0 _dispatch_root_queue_drain_deferred_wlh + 288
[Mac:46703] [17] 0 libdispatch.dylib 0x000000018fd901ec _dispatch_workloop_worker_thread + 540
[Mac:46703] [18] 0 libsystem_pthread.dylib 0x000000018ff2c3d8 _pthread_wqthread + 288
[Mac:46703] [19] 0 libsystem_pthread.dylib 0x000000018ff2b0f0 start_wqthread + 8
[Mac:46703] *** End of error message ***
/Users/ifioravanti/.pyenv/versions/3.12.7/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/Users/ivan/.pyenv/versions/3.12.8/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/Users/ivan/.pyenv/versions/3.12.7/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

prterun noticed that process rank 0 with PID 46703 on node Mac exited on
signal 6 (Abort trap: 6).

@ivanfioravanti
Copy link
Contributor Author

I am using MacOS 15.3 beta, can this be the culprit?

@ivanfioravanti
Copy link
Contributor Author

Happy to help debug/test in case. I tried all possible combinations but no luck

@ivanfioravanti
Copy link
Contributor Author

I did it! After hours of trying and debugging everything I have finally find a working configuration!
I had to compile open-mpi 5.0.5 from sources on all 3 machines and after that everything is working.
So the problem lies in open-mpi 5.0.6

Later I will try to compile from sources 5.0.6 to see if issue is in the brew package or there is really an incompatibility with METAL/MLX.

Keep you posted!

@ivanfioravanti
Copy link
Contributor Author

False positive... with 5.0.5 it started but it's not running using mx.distributed, because nodes are not present in the output. Process is starting but they are just 3 parallel training.

Node 0 of 3
Node 2 of 3
Node 1 of 3

@ivanfioravanti
Copy link
Contributor Author

I deep dive on this like crazy and I found something that can help.
If I keep validation below 5 seconds (my guess) there is no timeout, otherwise M2 Ultra finishes first and is stuck at line 168-169 in trainer.py.

    all_losses = mx.distributed.all_sum(all_losses)
    ntokens = mx.distributed.all_sum(ntokens)

val_batches = 3 works, anything above that leads to the timeout above. Hope this helps.

@ivanfioravanti
Copy link
Contributor Author

I have finally been able to do a pure thunderbolt connection between the 3 Macs and things are slightly better, I still need to keel val_batches value low, but at least it works. Should we keep this open or do you prefer to close it?

@awni
Copy link
Member

awni commented Jan 4, 2025

@ivanfioravanti sorry for the delayed response here. This is a known issue with Metal that the GPU can only wait up to ~5 seconds after which it times out and you get an error.

Your case especially exposes this since you are on heterogenous machines the Ultra is much faster and finishes the validation early and just waits > 5 seconds for the other machines to finish.

A simple fix is to the reduction on the CPU stream mx.distributed.all_sum(ntokens, stream=mx.cpu) in which case there should not be a timeout anywhere. I'm actively investigating a cleaner solution for this but I can't make any promises yet.. For now for communications in which having low latency is not critical running them on the CPU is a safer option.

@fblissjr
Copy link

fblissjr commented Jan 7, 2025

I hit what I think might be a similar error when trying to mlx_lm.convert / quantize a giant model - referenced here: [https://github.com//pull/1100#issuecomment-2576017636](Tencent HunYuan MOE model)

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
[1]    95682 abort      python -m mlx_lm.convert --hf-path ./models/hy --mlx-path ./models/hy-mlx-3bit  --q-bits 3

Any thoughts? I am creating a ton of logging and adding some mx.metal.x performance collection everywhere. have 192gb of memory of my m2 ultra mac studio, have wired limit set to 190.

Can conversion/quantization happen in chunks given the model I'm converting from bf16 is ~800 GB?

This isn't a distributed error (at least not to my knowledge) since I'm not using any distributed functions, but open to ideas here.

@awni
Copy link
Member

awni commented Jan 7, 2025

Quantization should always happen in chunks, so you shouldn't have memory issues quantizing that model. Are you using the latest MLX / MLX LM? If not, try upgrading to make sure that fixes things.

Do you have the model on an external hard disk? I've seen timeouts before when loading from disk takes too long.. but we added some patches to help fix that. The error you are getting while the same timeout is most likely from a different cause than the issues in this thread.

@fblissjr
Copy link

fblissjr commented Jan 7, 2025

@awni Yes, I do have the model on my NAS due to the size. I bet that's it - it's timing out on read.

I last upgraded mlx and mlx-lm yesterday from a git clone - any other PRs out there?

@awni
Copy link
Member

awni commented Jan 7, 2025

You can try running the load on the CPU to see if that is the issue. It shouldn't time out on the CPU.

Change this line to:

        weights.update(mx.load(wf, stream=mx.cpu))

@fblissjr
Copy link

fblissjr commented Jan 7, 2025

You can try running the load on the CPU to see if that is the issue. It shouldn't time out on the CPU.

Change this line to:

        weights.update(mx.load(wf, stream=mx.cpu))

Thanks. I ended up forcing it to cpu with a much worse method that just crashed my machine. 😂

Best part of this is at least I'm finally learning the internals of mlx core now.

@ivanfioravanti
Copy link
Contributor Author

mx.distributed.all_sum(ntokens, stream=mx.cpu) works! Thanks @awni

@angeloskath
Copy link
Member

That would actually be a good thing to PR as well if you feel like it (otherwise I 'll get around to it).

Unless some communication is latency sensitive there is no benefit in having it in the GPU stream. I should have written it that way in the first place. (For example the average gradients function runs the communication in the cpu stream).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants