Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anyone tried batch inference? #20

Closed
deep-diver opened this issue Mar 16, 2023 · 10 comments
Closed

anyone tried batch inference? #20

deep-diver opened this issue Mar 16, 2023 · 10 comments

Comments

@deep-diver
Copy link
Contributor

when I set pad token 0 and padding=True,
the generated text for the padded prompt shows always

@deep-diver
Copy link
Contributor Author

padding_side="left" do the trick

@benob
Copy link

benob commented Mar 16, 2023

I am getting the following error when trying batched inference. Did you need any trick?

../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [4,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.                      

@deep-diver
Copy link
Contributor Author

@benob You could do something like below

def evaluate(instructions, input=None):
    prompts = [generate_prompt(instructions) for instruction in instructions]
    encodings = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda')
    # input_ids = inputs["input_ids"].cuda()
    generation_outputs = model.generate(
        **encodings,
        generation_config=generation_config,
        max_new_tokens=256
    )

    returns tokenizer.batch_decode(generation_outputs)

@deep-diver
Copy link
Contributor Author

created the gradio app for this.

https://github.com/deep-diver/Alpaca-LoRA-Serve

@benob
Copy link

benob commented Mar 17, 2023

Thanks, the problem came from elsewhere. Note that I had to use

tokenizer = LlamaTokenizer.from_pretrained(config.backbone, padding_side='left')

@fringe-k
Copy link

Hello, I have tried batch decoding. And I set
tokenizer.padding_side = "left" \\ tokenizer.pad_token_id = tokenizer.bos_token_id
But the inference result is quite different with batch_size=1. When batch_size>1, the end of the output shows a sequence of "????" eg. Hello. ?? ?? ??
Do you have any idea about this problem?

@AngainorDev
Copy link
Contributor

the end of the output shows a sequence of "????" eg. Hello. ?? ?? ??

I had the same, using batch decoding and beam search with multiple beams.
I ended up filtering out those (fake) ? by .replace("\u2047", "").strip() the outputs.

@T-Atlas
Copy link
Contributor

T-Atlas commented Mar 28, 2023

@benob You could do something like below

def evaluate(instructions, input=None):
    prompts = [generate_prompt(instructions) for instruction in instructions]
    encodings = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda')
    # input_ids = inputs["input_ids"].cuda()
    generation_outputs = model.generate(
        **encodings,
        generation_config=generation_config,
        max_new_tokens=256
    )

    returns tokenizer.batch_decode(generation_outputs)

Hello @deep-diver , I tried batch decoding according to your settings, which is very helpful for performance. But I found a strange phenomenon. Suppose you have four pieces of content, and the results you generate for them are different from those you batch decode them at once.I asked detailed questions in the huggingface discussion area. I'll copy him here later.
https://discuss.huggingface.co/t/results-of-model-generate-are-different-for-different-batch-sizes-of-the-decode-only-model/34878
You Can Try This in Notebook

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from peft import PeftModel
import transformers
import gradio as gr

assert (
        "LlamaTokenizer" in transformers._import_structure["models.llama"]
), "LLaMA is now in HuggingFace's main branch.\nPlease reinstall it: pip uninstall transformers && pip install git+https://github.com/huggingface/transformers.git"
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig

tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf", cache_dir="./cache/")

BASE_MODEL = "decapoda-research/llama-7b-hf"
LORA_WEIGHTS = "tloen/alpaca-lora-7b"

if torch.cuda.is_available():
    device = "cuda"
model = LlamaForCausalLM.from_pretrained(
        BASE_MODEL,
        #FINETURED_MODEL,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map="auto", cache_dir="./cache/"
    )
model = PeftModel.from_pretrained(model, LORA_WEIGHTS, torch_dtype=torch.float16)


model.eval()
if torch.__version__ >= "2":
    model = torch.compile(model)
inputs = tokenizer("prompt", return_tensors="pt")
input_ids = inputs["input_ids"].to(device)
input_ids,inputs

(tensor([[ 1, 9508]], device='cuda:0'),
{'input_ids': tensor([[ 1, 9508]]), 'attention_mask': tensor([[1, 1]])})

if tokenizer.pad_token is None:
            tokenizer.add_special_tokens({'pad_token': '[PAD]'})
            model.resize_token_embeddings(len(tokenizer))
inputs_b = tokenizer(["prompt","prompt","prompt"], return_tensors="pt", padding=True).to(device)
input_idsb=inputs_b["input_ids"].to(device)
input_idsb,inputs_b

(tensor([[ 1, 9508],
[ 1, 9508],
[ 1, 9508]], device='cuda:0'),
{'input_ids': tensor([[ 1, 9508],
[ 1, 9508],
[ 1, 9508]], device='cuda:0'), 'attention_mask': tensor([[1, 1],
[1, 1],
[1, 1]], device='cuda:0')})

generation_config = GenerationConfig(
    temperature=1,
    top_p=1,
    top_k=50,
    num_beams=1,
    max_new_tokens=128,
)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
    )
generation_output

Output exceeds the size limit. Open the full output data in a text editor
GreedySearchDecoderOnlyOutput(sequences=tensor([[ 1, 9508, 368, 322, 29497, 29889, 13, 1576, 6938, 4091,
451, 367, 619, 519, 304, 278, 21886, 363, 738, 6410,
470, 18658, 17654, 491, 278, 21886, 408, 263, 1121, 310,
738, 9055, 297, 278, 28289, 310, 278, 7197, 29879, 313,

s = generation_output.sequences[0]
output = tokenizer.decode(s)
output

" promptly and efficiently.\nThe Company shall not be liable to the Customer for any loss or damage suffered by the Customer as a result of any delay in the delivery of the Goods (even if caused by the Company's negligence) unless the Customer has given written notice to the Company of the delay within 7 days of the date when the Goods were due to be delivered.\nThe Company shall not be liable to the Customer for any loss or damage suffered by the Customer as a result of any delay in the delivery of the Goods (even if caused by the Company's negligence) unless the"

generation_config = GenerationConfig(
    temperature=1,
    top_p=1,
    top_k=50,
    num_beams=1,
    max_new_tokens=128,
)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_idsb,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
    )
generation_output

Output exceeds the size limit. Open the full output data in a text editor
GreedySearchDecoderOnlyOutput(sequences=tensor([[ 1, 9508, 368, 322, 29497, 29889, 13, 1576, 6938, 338,
19355, 304, 5662, 3864, 393, 727, 338, 694, 5400, 8370,
1201, 470, 5199, 1020, 600, 860, 292, 297, 967, 11421,
521, 2708, 470, 297, 738, 760, 310, 967, 5381, 29889,
450, 6938, 5936, 4637, 393, 372, 756, 263, 23134, 304,

s = generation_output.sequences
output = tokenizer.batch_decode(s, skip_special_tokens=True)
output

[' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in',
' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in',
' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in']

@louisoutin
Copy link

louisoutin commented Apr 11, 2023

@benob You could do something like below

def evaluate(instructions, input=None):
    prompts = [generate_prompt(instructions) for instruction in instructions]
    encodings = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda')
    # input_ids = inputs["input_ids"].cuda()
    generation_outputs = model.generate(
        **encodings,
        generation_config=generation_config,
        max_new_tokens=256
    )

    returns tokenizer.batch_decode(generation_outputs)

I was using that few days ago and it was working fine. But now when generating with batch_size > 1, I get this error:

File "/llama-training/lora_finetuning/lora_finetuning/inference.py", line 156, in __call__
    generation_output = self.model.generate(
  File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 627, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1524, in generate
    return self.beam_search(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2810, in beam_search
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 289, in forward
    hidden_states = self.input_layernorm(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 84, in forward
    variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
RuntimeError: CUDA error: device-side assert triggered

Anyone had the same error and know how to fix? (I suspect some version update in peft or transformers library)

@smallccn
Copy link

mark, meet same issue in my side...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants