-
Notifications
You must be signed in to change notification settings - Fork 923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed LORA training fails with METAL error #1185
Comments
I am using MacOS 15.3 beta, can this be the culprit? |
Happy to help debug/test in case. I tried all possible combinations but no luck |
I did it! After hours of trying and debugging everything I have finally find a working configuration! Later I will try to compile from sources 5.0.6 to see if issue is in the brew package or there is really an incompatibility with METAL/MLX. Keep you posted! |
False positive... with 5.0.5 it started but it's not running using mx.distributed, because nodes are not present in the output. Process is starting but they are just 3 parallel training. Node 0 of 3 |
I deep dive on this like crazy and I found something that can help. all_losses = mx.distributed.all_sum(all_losses)
ntokens = mx.distributed.all_sum(ntokens) val_batches = 3 works, anything above that leads to the timeout above. Hope this helps. |
I have finally been able to do a pure thunderbolt connection between the 3 Macs and things are slightly better, I still need to keel val_batches value low, but at least it works. Should we keep this open or do you prefer to close it? |
@ivanfioravanti sorry for the delayed response here. This is a known issue with Metal that the GPU can only wait up to ~5 seconds after which it times out and you get an error. Your case especially exposes this since you are on heterogenous machines the Ultra is much faster and finishes the validation early and just waits > 5 seconds for the other machines to finish. A simple fix is to the reduction on the CPU stream |
I hit what I think might be a similar error when trying to mlx_lm.convert / quantize a giant model - referenced here: [https://github.com//pull/1100#issuecomment-2576017636](Tencent HunYuan MOE model)
Any thoughts? I am creating a ton of logging and adding some mx.metal.x performance collection everywhere. have 192gb of memory of my m2 ultra mac studio, have wired limit set to 190. Can conversion/quantization happen in chunks given the model I'm converting from bf16 is ~800 GB? This isn't a distributed error (at least not to my knowledge) since I'm not using any distributed functions, but open to ideas here. |
Quantization should always happen in chunks, so you shouldn't have memory issues quantizing that model. Are you using the latest MLX / MLX LM? If not, try upgrading to make sure that fixes things. Do you have the model on an external hard disk? I've seen timeouts before when loading from disk takes too long.. but we added some patches to help fix that. The error you are getting while the same timeout is most likely from a different cause than the issues in this thread. |
@awni Yes, I do have the model on my NAS due to the size. I bet that's it - it's timing out on read. I last upgraded mlx and mlx-lm yesterday from a git clone - any other PRs out there? |
You can try running the load on the CPU to see if that is the issue. It shouldn't time out on the CPU. Change this line to: weights.update(mx.load(wf, stream=mx.cpu)) |
Thanks. I ended up forcing it to cpu with a much worse method that just crashed my machine. 😂 Best part of this is at least I'm finally learning the internals of mlx core now. |
mx.distributed.all_sum(ntokens, stream=mx.cpu) works! Thanks @awni |
That would actually be a good thing to PR as well if you feel like it (otherwise I 'll get around to it). Unless some communication is latency sensitive there is no benefit in having it in the GPU stream. I should have written it that way in the first place. (For example the average gradients function runs the communication in the cpu stream). |
In the past I tried mlx_lm.lora distributed with M2 Ultra and M3 Max and everything was working well.
Today I tried again with 1 M2 Ultra and 2 M4 Max and I get following error.
Single run works on each host, I tried them 1 by 1 using: mpirun --hostfile ~/hosts.txt -n 1 /Users/ivan/.pyenv/shims/mlx_lm.lora --model mistralai/Mistral-7B-v0.3 --train --data ~/Desktop/data --max-seq-length 8192 --batch-size 1
As soon as I start a run on 2+ hosts the error appears. I used this for 3 hosts mpirun --hostfile ~/hosts.txt -n 3 --wdir /Users/ivan/.pyenv/shims mlx_lm.lora --model mistralai/Mistral-7B-v0.3 --train --data ~/Desktop/data --max-seq-length 8192 --batch-size 3
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
[Mac:46703] *** Process received signal ***
[Mac:46703] Signal: Abort trap: 6 (6)
[Mac:46703] Signal code: (0)
[Mac:46703] [ 0] 0 libsystem_platform.dylib 0x000000018ff66de4 _sigtramp + 56
[Mac:46703] [ 1] 0 libsystem_pthread.dylib 0x000000018ff2ff70 pthread_kill + 288
[Mac:46703] [ 2] 0 libsystem_c.dylib 0x000000018fe3c908 abort + 128
[Mac:46703] [ 3] 0 libc++abi.dylib 0x000000018fee644c _ZN10__cxxabiv130__aligned_malloc_with_fallbackEm + 0
[Mac:46703] [ 4] 0 libc++abi.dylib 0x000000018fed4a24 _ZL28demangling_terminate_handlerv + 320
[Mac:46703] [ 5] 0 libobjc.A.dylib 0x000000018fb7d3f4 _ZL15_objc_terminatev + 172
[Mac:46703] [ 6] 0 libc++abi.dylib 0x000000018fee5710 _ZSt11__terminatePFvvE + 16
[Mac:46703] [ 7] 0 libc++abi.dylib 0x000000018fee56b4 _ZSt9terminatev + 108
[Mac:46703] [ 8] 0 libdispatch.dylib 0x000000018fd7d688 _dispatch_client_callout4 + 40
[Mac:46703] [ 9] 0 libdispatch.dylib 0x000000018fd99c88 _dispatch_mach_msg_invoke + 464
[Mac:46703] [10] 0 libdispatch.dylib 0x000000018fd84a38 _dispatch_lane_serial_drain + 352
[Mac:46703] [11] 0 libdispatch.dylib 0x000000018fd9a9dc _dispatch_mach_invoke + 456
[Mac:46703] [12] 0 libdispatch.dylib 0x000000018fd84a38 _dispatch_lane_serial_drain + 352
[Mac:46703] [13] 0 libdispatch.dylib 0x000000018fd85764 _dispatch_lane_invoke + 432
[Mac:46703] [14] 0 libdispatch.dylib 0x000000018fd84a38 _dispatch_lane_serial_drain + 352
[Mac:46703] [15] 0 libdispatch.dylib 0x000000018fd85730 _dispatch_lane_invoke + 380
[Mac:46703] [16] 0 libdispatch.dylib 0x000000018fd909a0 _dispatch_root_queue_drain_deferred_wlh + 288
[Mac:46703] [17] 0 libdispatch.dylib 0x000000018fd901ec _dispatch_workloop_worker_thread + 540
[Mac:46703] [18] 0 libsystem_pthread.dylib 0x000000018ff2c3d8 _pthread_wqthread + 288
[Mac:46703] [19] 0 libsystem_pthread.dylib 0x000000018ff2b0f0 start_wqthread + 8
[Mac:46703] *** End of error message ***
/Users/ifioravanti/.pyenv/versions/3.12.7/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/Users/ivan/.pyenv/versions/3.12.8/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/Users/ivan/.pyenv/versions/3.12.7/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
prterun noticed that process rank 0 with PID 46703 on node Mac exited on
signal 6 (Abort trap: 6).
The text was updated successfully, but these errors were encountered: