Accelerate refuse to work on `balanced_low_0` when GPU 0 is not filled. #2429

xkszltl · 2024-02-08T22:44:11Z

System Info

- `Accelerate` version: 0.26.1
- Platform: Linux-3.10.0-1160.105.1.el7.x86_64-x86_64-with-glibc2.35
- Python version: 3.10.12
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- GPU type: NVIDIA TITAN V
- `Accelerate` default config:
        Not found

accelerate/src/accelerate/accelerator.py

Lines 1324 to 1333 in 6f05bbd

    
           current_device = list(model_devices)[0] 
        
           current_device_index = current_device.index if isinstance(current_device, torch.device) else current_device 
        
           if torch.device(current_device_index) != self.device: 
        
               # if on the first device (GPU 0) we don't care 
        
               if (self.device.index is not None) or (current_device_index != 0): 
        
                   raise ValueError( 
        
                       "You can't train a model that has been loaded in 8-bit precision on a different device than the one " 
        
                       "you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}" 
        
                   )

This part would throw if model is loaded by transformers.AutoModelForCausalLM.from_pretrained(..., device_map="balanced_low_0", ...), because GPU 0 may be left completely unused.
This doesn't seem to be a good behavior as there's no way to tell what's the "first device" without computing the device map first.

The text was updated successfully, but these errors were encountered:

SunMarc · 2024-02-23T21:30:01Z

Hi @xkszltl , could you tell me more about your specific use case and the issue that you are facing ? I don't understand the part about the device map since we do compute it before and we do model_devices = model_devices = set(model.hf_device_map.values())
Note that the related PR to this section is this one.

xkszltl · 2024-02-24T03:22:18Z

We are loading LLAMA2 7B and Mistral for finetuning in a single node 8 GPU setup.
It works with balanced but not with balanced_low_0 because of the exception thrown here.
I assume that means model device is not GPU0 probably to make room for low_0 requirement?

xkszltl · 2024-02-24T03:25:07Z

The part I don’t understand is what make GPU 0 so special and worth asserting here?

SunMarc · 2024-02-26T15:17:29Z

Hi @xkszltl, this check might be outdated since we added the possibility to fine-tune BNB model with naive PP . I'll let @younesbelkada confirms this point !

github-actions · 2024-03-22T15:06:17Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

xkszltl · 2024-03-22T17:22:41Z

Not stale

xkszltl · 2024-03-22T17:23:17Z

@SunMarc @younesbelkada
Any update on this?

SunMarc · 2024-03-26T09:38:41Z

Hi @xkszltl, sorry for the delay. Would you like to submit a PR to fix this and check that the tests are passing on transformers and accelerate ? Thanks !

Seems `balanced_low_0` can leave GPU 0 empty and breaks this check. According to the discussion this check may be outdated. Resolve huggingface#2429

github-actions · 2024-04-19T15:06:47Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

xkszltl · 2024-04-19T21:50:30Z

Not stale

github-actions · 2024-05-14T15:07:19Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

SunMarc · 2024-05-14T15:23:37Z

Closing this since this issue should be solved by this PR

xkszltl mentioned this issue Mar 27, 2024

Remove check of device consistency for balanced_low_0. #2591

Closed

SunMarc mentioned this issue Apr 26, 2024

fix bnb multi gpu training #2714

Merged

SunMarc closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate refuse to work on `balanced_low_0` when GPU 0 is not filled. #2429

Accelerate refuse to work on `balanced_low_0` when GPU 0 is not filled. #2429

xkszltl commented Feb 8, 2024 •

edited

Loading

SunMarc commented Feb 23, 2024

xkszltl commented Feb 24, 2024 •

edited

Loading

xkszltl commented Feb 24, 2024

SunMarc commented Feb 26, 2024

github-actions bot commented Mar 22, 2024

xkszltl commented Mar 22, 2024

xkszltl commented Mar 22, 2024

SunMarc commented Mar 26, 2024

github-actions bot commented Apr 19, 2024

xkszltl commented Apr 19, 2024

github-actions bot commented May 14, 2024

SunMarc commented May 14, 2024

Accelerate refuse to work on balanced_low_0 when GPU 0 is not filled. #2429

Accelerate refuse to work on balanced_low_0 when GPU 0 is not filled. #2429

Comments

xkszltl commented Feb 8, 2024 • edited Loading

System Info

SunMarc commented Feb 23, 2024

xkszltl commented Feb 24, 2024 • edited Loading

xkszltl commented Feb 24, 2024

SunMarc commented Feb 26, 2024

github-actions bot commented Mar 22, 2024

xkszltl commented Mar 22, 2024

xkszltl commented Mar 22, 2024

SunMarc commented Mar 26, 2024

github-actions bot commented Apr 19, 2024

xkszltl commented Apr 19, 2024

github-actions bot commented May 14, 2024

SunMarc commented May 14, 2024

Accelerate refuse to work on `balanced_low_0` when GPU 0 is not filled. #2429

Accelerate refuse to work on `balanced_low_0` when GPU 0 is not filled. #2429

xkszltl commented Feb 8, 2024 •

edited

Loading

xkszltl commented Feb 24, 2024 •

edited

Loading