Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate refuse to work on balanced_low_0 when GPU 0 is not filled. #2429

Closed
xkszltl opened this issue Feb 8, 2024 · 12 comments
Closed

Accelerate refuse to work on balanced_low_0 when GPU 0 is not filled. #2429

xkszltl opened this issue Feb 8, 2024 · 12 comments

Comments

@xkszltl
Copy link

xkszltl commented Feb 8, 2024

System Info

- `Accelerate` version: 0.26.1
- Platform: Linux-3.10.0-1160.105.1.el7.x86_64-x86_64-with-glibc2.35
- Python version: 3.10.12
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- GPU type: NVIDIA TITAN V
- `Accelerate` default config:
        Not found

current_device = list(model_devices)[0]
current_device_index = current_device.index if isinstance(current_device, torch.device) else current_device
if torch.device(current_device_index) != self.device:
# if on the first device (GPU 0) we don't care
if (self.device.index is not None) or (current_device_index != 0):
raise ValueError(
"You can't train a model that has been loaded in 8-bit precision on a different device than the one "
"you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}"
)

This part would throw if model is loaded by transformers.AutoModelForCausalLM.from_pretrained(..., device_map="balanced_low_0", ...), because GPU 0 may be left completely unused.
This doesn't seem to be a good behavior as there's no way to tell what's the "first device" without computing the device map first.

@SunMarc
Copy link
Member

SunMarc commented Feb 23, 2024

Hi @xkszltl , could you tell me more about your specific use case and the issue that you are facing ? I don't understand the part about the device map since we do compute it before and we do model_devices = model_devices = set(model.hf_device_map.values())
Note that the related PR to this section is this one.

@xkszltl
Copy link
Author

xkszltl commented Feb 24, 2024

We are loading LLAMA2 7B and Mistral for finetuning in a single node 8 GPU setup.
It works with balanced but not with balanced_low_0 because of the exception thrown here.
I assume that means model device is not GPU0 probably to make room for low_0 requirement?

@xkszltl
Copy link
Author

xkszltl commented Feb 24, 2024

The part I don’t understand is what make GPU 0 so special and worth asserting here?

@SunMarc
Copy link
Member

SunMarc commented Feb 26, 2024

Hi @xkszltl, this check might be outdated since we added the possibility to fine-tune BNB model with naive PP . I'll let @younesbelkada confirms this point !

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@xkszltl
Copy link
Author

xkszltl commented Mar 22, 2024

Not stale

@xkszltl
Copy link
Author

xkszltl commented Mar 22, 2024

@SunMarc @younesbelkada
Any update on this?

@SunMarc
Copy link
Member

SunMarc commented Mar 26, 2024

Hi @xkszltl, sorry for the delay. Would you like to submit a PR to fix this and check that the tests are passing on transformers and accelerate ? Thanks !

xkszltl added a commit to xkszltl/accelerate that referenced this issue Mar 27, 2024
Seems `balanced_low_0` can leave GPU 0 empty and breaks this check.
According to the discussion this check may be outdated.

Resolve huggingface#2429
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@xkszltl
Copy link
Author

xkszltl commented Apr 19, 2024

Not stale

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@SunMarc
Copy link
Member

SunMarc commented May 14, 2024

Closing this since this issue should be solved by this PR

@SunMarc SunMarc closed this as completed May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants