-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference on multi-gpu #19
Comments
Same issue. |
Is the model at (It is stated here in case you missed it.) |
Thanks for your reminder. |
Good to hear! IIRC it is not a quick fix to change the model parallel configuration, as the code expects the exact name and number of layers indicated in the model files, but if all you want to do is run inference with the 13B model in a 8 GPU system maybe you could launch 4 processes, each taking 2 GPUs (using something like CUDA_VISIBLE_DEVICES to assign them) and splitting the inputs into 4 chunks of (almost) equal size? The throughput should be similar. |
The problem is one input(prompt) can not be split into many chunks. Your mentioned method cannot extend the GPU memory usage in one generation while |
I see. I'm afraid I am not familiar with that kind of setup, but there is already a HuggingFace version of Code Llama, so you may try running that instead and see if it fits your use case: https://huggingface.co/docs/transformers/main/model_doc/code_llama |
Can you please guide me how to run 13B and 34B model on Windows? I have single GPU and hence able to run 7B model whose Model parallel value=1. 13B model requires MP value=2 but I have only 1 GPU on which I want to to inference, what changes should I make in code and in which file so that I can run 13B model? |
The model parallel size (MP) is fixed. Which is:
Sadly when you don't change the llama loading code, you have to set num_gpus(n_procs_per_node) equal to MP size. model = Transformer(model_args)
checkpoint = torch.load(ckpt_dir + '/consolidated.00.pth', map_location="cpu")
deepspeed_generator, _, _, _ = deepspeed.initialize(model=model, model_parameters=checkpoint, config={
"fp16": {
"enabled": True
}})
model = Llama(deepspeed_generator, tokenizer) PS. If anyone finds an open source framework, please share it with me, Tks in advance. (╥﹏╥) |
Could you please give an exapmle of full code in llama/generation.py. I try to change the code, but i have the same error with memory allocation. I want to use codellama-7b on 2 gpus. |
Tried to run:
I have a long prompt (4000 tokens).
I have 4 Nvidia A10G each with 300W and 24GB VRAM. However I see only one GPU being used (on nvidia-smi).
The error I get is:
The whole tracelog is:
The text was updated successfully, but these errors were encountered: