anyone tried batch inference? #20

deep-diver · 2023-03-16T08:17:42Z

when I set pad token 0 and padding=True,
the generated text for the padded prompt shows always

deep-diver · 2023-03-16T08:52:55Z

padding_side="left" do the trick

benob · 2023-03-16T10:48:00Z

I am getting the following error when trying batched inference. Did you need any trick?

../aten/src/ATen/native/cuda/Indexing.cu:1088: indexSelectSmallIndex: block: [4,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

deep-diver · 2023-03-16T12:24:50Z

@benob You could do something like below

def evaluate(instructions, input=None):
    prompts = [generate_prompt(instructions) for instruction in instructions]
    encodings = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda')
    # input_ids = inputs["input_ids"].cuda()
    generation_outputs = model.generate(
        **encodings,
        generation_config=generation_config,
        max_new_tokens=256
    )

    returns tokenizer.batch_decode(generation_outputs)

deep-diver · 2023-03-16T12:43:55Z

created the gradio app for this.

https://github.com/deep-diver/Alpaca-LoRA-Serve

benob · 2023-03-17T16:24:59Z

Thanks, the problem came from elsewhere. Note that I had to use

tokenizer = LlamaTokenizer.from_pretrained(config.backbone, padding_side='left')

fringe-k · 2023-03-23T04:02:06Z

Hello, I have tried batch decoding. And I set
tokenizer.padding_side = "left" \\ tokenizer.pad_token_id = tokenizer.bos_token_id
But the inference result is quite different with batch_size=1. When batch_size>1, the end of the output shows a sequence of "????" eg. Hello. ?? ?? ??
Do you have any idea about this problem?

AngainorDev · 2023-03-24T16:57:06Z

the end of the output shows a sequence of "????" eg. Hello. ?? ?? ??

I had the same, using batch decoding and beam search with multiple beams.
I ended up filtering out those (fake) ? by .replace("\u2047", "").strip() the outputs.

T-Atlas · 2023-03-28T08:26:25Z

@benob You could do something like below

def evaluate(instructions, input=None):
    prompts = [generate_prompt(instructions) for instruction in instructions]
    encodings = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda')
    # input_ids = inputs["input_ids"].cuda()
    generation_outputs = model.generate(
        **encodings,
        generation_config=generation_config,
        max_new_tokens=256
    )

    returns tokenizer.batch_decode(generation_outputs)

Hello @deep-diver , I tried batch decoding according to your settings, which is very helpful for performance. But I found a strange phenomenon. Suppose you have four pieces of content, and the results you generate for them are different from those you batch decode them at once.I asked detailed questions in the huggingface discussion area. I'll copy him here later.
https://discuss.huggingface.co/t/results-of-model-generate-are-different-for-different-batch-sizes-of-the-decode-only-model/34878
You Can Try This in Notebook

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from peft import PeftModel
import transformers
import gradio as gr

assert (
        "LlamaTokenizer" in transformers._import_structure["models.llama"]
), "LLaMA is now in HuggingFace's main branch.\nPlease reinstall it: pip uninstall transformers && pip install git+https://github.com/huggingface/transformers.git"
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig

tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf", cache_dir="./cache/")

BASE_MODEL = "decapoda-research/llama-7b-hf"
LORA_WEIGHTS = "tloen/alpaca-lora-7b"

if torch.cuda.is_available():
    device = "cuda"
model = LlamaForCausalLM.from_pretrained(
        BASE_MODEL,
        #FINETURED_MODEL,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map="auto", cache_dir="./cache/"
    )
model = PeftModel.from_pretrained(model, LORA_WEIGHTS, torch_dtype=torch.float16)


model.eval()
if torch.__version__ >= "2":
    model = torch.compile(model)

inputs = tokenizer("prompt", return_tensors="pt")
input_ids = inputs["input_ids"].to(device)
input_ids,inputs

(tensor([[ 1, 9508]], device='cuda:0'),
{'input_ids': tensor([[ 1, 9508]]), 'attention_mask': tensor([[1, 1]])})

if tokenizer.pad_token is None:
            tokenizer.add_special_tokens({'pad_token': '[PAD]'})
            model.resize_token_embeddings(len(tokenizer))
inputs_b = tokenizer(["prompt","prompt","prompt"], return_tensors="pt", padding=True).to(device)
input_idsb=inputs_b["input_ids"].to(device)
input_idsb,inputs_b

(tensor([[ 1, 9508],
[ 1, 9508],
[ 1, 9508]], device='cuda:0'),
{'input_ids': tensor([[ 1, 9508],
[ 1, 9508],
[ 1, 9508]], device='cuda:0'), 'attention_mask': tensor([[1, 1],
[1, 1],
[1, 1]], device='cuda:0')})

generation_config = GenerationConfig(
    temperature=1,
    top_p=1,
    top_k=50,
    num_beams=1,
    max_new_tokens=128,
)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
    )
generation_output

Output exceeds the size limit. Open the full output data in a text editor
GreedySearchDecoderOnlyOutput(sequences=tensor([[ 1, 9508, 368, 322, 29497, 29889, 13, 1576, 6938, 4091,
451, 367, 619, 519, 304, 278, 21886, 363, 738, 6410,
470, 18658, 17654, 491, 278, 21886, 408, 263, 1121, 310,
738, 9055, 297, 278, 28289, 310, 278, 7197, 29879, 313,

s = generation_output.sequences[0]
output = tokenizer.decode(s)
output

" promptly and efficiently.\nThe Company shall not be liable to the Customer for any loss or damage suffered by the Customer as a result of any delay in the delivery of the Goods (even if caused by the Company's negligence) unless the Customer has given written notice to the Company of the delay within 7 days of the date when the Goods were due to be delivered.\nThe Company shall not be liable to the Customer for any loss or damage suffered by the Customer as a result of any delay in the delivery of the Goods (even if caused by the Company's negligence) unless the"

generation_config = GenerationConfig(
    temperature=1,
    top_p=1,
    top_k=50,
    num_beams=1,
    max_new_tokens=128,
)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_idsb,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
    )
generation_output

Output exceeds the size limit. Open the full output data in a text editor
GreedySearchDecoderOnlyOutput(sequences=tensor([[ 1, 9508, 368, 322, 29497, 29889, 13, 1576, 6938, 338,
19355, 304, 5662, 3864, 393, 727, 338, 694, 5400, 8370,
1201, 470, 5199, 1020, 600, 860, 292, 297, 967, 11421,
521, 2708, 470, 297, 738, 760, 310, 967, 5381, 29889,
450, 6938, 5936, 4637, 393, 372, 756, 263, 23134, 304,

s = generation_output.sequences
output = tokenizer.batch_decode(s, skip_special_tokens=True)
output

[' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in',
' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in',
' promptly and efficiently.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business. The Company recognises that it has a responsibility to be proactive in ensuring that modern slavery is not taking place within its business or in its supply chains.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in its supply chains or in any part of its business.\nThe Company is committed to ensuring that there is no modern slavery or human trafficking in']

louisoutin · 2023-04-11T22:53:33Z

@benob You could do something like below

def evaluate(instructions, input=None):
    prompts = [generate_prompt(instructions) for instruction in instructions]
    encodings = tokenizer(prompts, return_tensors="pt", padding=True).to('cuda')
    # input_ids = inputs["input_ids"].cuda()
    generation_outputs = model.generate(
        **encodings,
        generation_config=generation_config,
        max_new_tokens=256
    )

    returns tokenizer.batch_decode(generation_outputs)

I was using that few days ago and it was working fine. But now when generating with batch_size > 1, I get this error:

File "/llama-training/lora_finetuning/lora_finetuning/inference.py", line 156, in __call__
    generation_output = self.model.generate(
  File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 627, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1524, in generate
    return self.beam_search(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2810, in beam_search
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 289, in forward
    hidden_states = self.input_layernorm(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 84, in forward
    variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
RuntimeError: CUDA error: device-side assert triggered

Anyone had the same error and know how to fix? (I suspect some version update in peft or transformers library)

smallccn · 2023-06-29T01:53:55Z

mark, meet same issue in my side...

deep-diver closed this as completed Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

anyone tried batch inference? #20

anyone tried batch inference? #20

deep-diver commented Mar 16, 2023

deep-diver commented Mar 16, 2023

benob commented Mar 16, 2023

deep-diver commented Mar 16, 2023

deep-diver commented Mar 16, 2023

benob commented Mar 17, 2023

fringe-k commented Mar 23, 2023

AngainorDev commented Mar 24, 2023

T-Atlas commented Mar 28, 2023 •

edited

Loading

louisoutin commented Apr 11, 2023 •

edited

Loading

smallccn commented Jun 29, 2023

anyone tried batch inference? #20

anyone tried batch inference? #20

Comments

deep-diver commented Mar 16, 2023

deep-diver commented Mar 16, 2023

benob commented Mar 16, 2023

deep-diver commented Mar 16, 2023

deep-diver commented Mar 16, 2023

benob commented Mar 17, 2023

fringe-k commented Mar 23, 2023

AngainorDev commented Mar 24, 2023

T-Atlas commented Mar 28, 2023 • edited Loading

louisoutin commented Apr 11, 2023 • edited Loading

smallccn commented Jun 29, 2023

T-Atlas commented Mar 28, 2023 •

edited

Loading

louisoutin commented Apr 11, 2023 •

edited

Loading