Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support whisper pipeline #1160

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .github/workflows/llm_bench-python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,17 @@ jobs:
run: |
wget -O ./ov_models/soulcard.safetensors https://civitai.com/api/download/models/72591
python ./tools/llm_bench/benchmark.py -m ./ov_models/dreamlike-art-dreamlike-anime-1.0/FP16/ -pf ./tools/llm_bench/prompts/stable-diffusion.jsonl -d cpu -n 1 --genai --lora ./ov_models/soulcard.safetensors --lora_alphas 0.7
- name: Test whisper-tiny on Linux
run: |
GIT_LFS_SKIP_SMUDGE=1 git clone --depth 1 --branch main --single-branch https://huggingface.co/datasets/facebook/multilingual_librispeech
cd multilingual_librispeech
git lfs pull -I /data/mls_polish/train/audio/3283_1447_000.tar.gz
mkdir data/mls_polish/train/audio/3283_1447_000
tar zxvf data/mls_polish/train/audio/3283_1447_000.tar.gz -C data/mls_polish/train/audio/3283_1447_000/
cd ..
optimum-cli export openvino --trust-remote-code --model openai/whisper-tiny ./ov_models/whisper-tiny
python ./tools/llm_bench/benchmark.py -m ./ov_models/whisper-tiny --media multilingual_librispeech/data/mls_polish/train/audio/3283_1447_000/3283_1447_000000.flac -d cpu -n 1
python ./tools/llm_bench/benchmark.py -m ./ov_models/whisper-tiny --media multilingual_librispeech/data/mls_polish/train/audio/3283_1447_000/3283_1447_000000.flac -d cpu -n 1 --genai
- name: WWB Tests
run: |
GIT_CLONE_PROTECTION_ACTIVE=false pip install -r ${{ env.WWB_PATH }}/requirements.txt
Expand Down
2 changes: 1 addition & 1 deletion tools/llm_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,4 +170,4 @@ OpenVINO is by default built with [oneTBB](https://github.com/oneapi-src/oneTBB/
## 7. Additional Resources

- **Error Troubleshooting:** Check the [NOTES.md](./doc/NOTES.md) for solutions to known issues.
- **Image Generation Configuration:** Refer to [IMAGE_GEN.md](./doc/IMAGE_GEN.md) for setting parameters for image generation models.
- **Syntax and attributes of prompt file:** Refer to [PROMPT.md](./doc/PROMPT.md) for writing a prompt file.
3 changes: 3 additions & 0 deletions tools/llm_bench/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import task.text_generation as bench_text
import task.image_generation as bench_image
import task.super_resolution_generation as bench_ldm_sr
import task.speech_to_text_generation as bench_speech

DEFAULT_TORCH_THREAD_NUMS = 16
mem_consumption = MemConsumption()
Expand Down Expand Up @@ -46,6 +47,7 @@ def get_argprser():
help='Prompt file(s) in jsonl format. Multiple prompt files should be separated with space(s).')
parser.add_argument('-pi', '--prompt_index', nargs='+', type=num_iters_type, default=None,
help='Run the specified prompt index. You can specify multiple prompt indexes, separated by spaces.')
parser.add_argument('--media', default=None, help='Media file path for speech or visual models.')
parser.add_argument(
'-ic',
'--infer_count',
Expand Down Expand Up @@ -153,6 +155,7 @@ def get_argprser():
'image_gen': bench_image.run_image_generation_benchmark,
'code_gen': bench_text.run_text_generation_benchmark,
'ldm_super_resolution': bench_ldm_sr.run_ldm_super_resolution_benchmark,
'speech2text': bench_speech.run_speech_2_txt_benchmark,
}


Expand Down
74 changes: 74 additions & 0 deletions tools/llm_bench/doc/NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Notes
## chatglm2-6b - AttributeError: can't set attribute
Download chatglm2-6b from hugginface, convert to OpenVINO IR files and run with benchmark.py, the following error may occur:
```bash
AttributeError: can't set attribute
```
Reproduced with https://huggingface.co/THUDM/chatglm2-6b 7fabe56db91e085c9c027f56f1c654d137bdba40 <br />
As on https://huggingface.co/THUDM/chatglm2-6b/discussions/99 <br />
Solution: update `tokenization_chatglm.py` as following: <br />
```Python
self.vocab_file = vocab_file
self.tokenizer = SPTokenizer(vocab_file)
+ kwargs.pop("eos_token", None)
+ kwargs.pop("pad_token", None)
+ kwargs.pop("unk_token", None)
self.special_tokens = {
"<bos>": self.tokenizer.bos_id,
"<eos>": self.tokenizer.eos_id,
```

> The solution works for chatglm3-6b as well.

## Qwen-7B-Chat-Int4 - Torch not compiled with CUDA enabled
Convert Qwen-7B-Chat-Int4 to OpenVINO IR files run with convert.py, the following error may occur:
```bash
raise AssertionError("Torch not compiled with CUDA enabled")
```
Reproduced with https://huggingface.co/Qwen/Qwen-7B-Chat-Int4 8750247cc50f2a7bb84bef322f7707159b700723 <br />
Solution: update `modeling_qwen.py` as following: <br />
```Python
-SUPPORT_CUDA = torch.cuda.is_available()
+SUPPORT_CUDA = False
SUPPORT_BF16 = SUPPORT_CUDA and torch.cuda.is_bf16_supported()
```

## Baichuan2-7B-Chat - AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'
Convert Baichuan2-7B-Chat to OpenVINO IR files run with convert.py, the following error may occur:
```bash
AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'
```
Reproduced with https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat 84603cde5ebffb6084e476cfaeceaf0b8b91fe54 <br />
Reference to https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/discussions/2 <br />
Solution: update `tokenization_baichuan.py` as following: <br />
```Python
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+ self.vocab_file = vocab_file
+ self.add_bos_token = add_bos_token
+ self.add_eos_token = add_eos_token
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+ self.sp_model.Load(vocab_file)
super().__init__(
bos_token=bos_token,
eos_token=eos_token,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
**kwargs,
)
- self.vocab_file = vocab_file
- self.add_bos_token = add_bos_token
- self.add_eos_token = add_eos_token
- self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
- self.sp_model.Load(vocab_file)
```

## CompressWeights Mode INT4 - ConnectionError: Couldn't reach 'wikitext' on the Hub (SSLError)
Download LLM from hugginface, convert to OpenVINO IR files and run with convert.py and CompressWeights Mode to INT4, the following error may occur:
```bash
raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e)._name_})")
ConnectionError: Couldn't reach 'wikitext' on the Hub (SSLError)
```
root cause: The wikitext data set was not downloaded correctly, or the Hugging Face Hub network could not be connected normally. <br />
Solution: <br />
Refer to https://huggingface.co/docs/datasets/loading#arrow , copy wikitext data set to ~/.cache/huggingface/datasets/ folder, set the environment variable HF_DATASETS_OFFLINE to 1.
39 changes: 39 additions & 0 deletions tools/llm_bench/doc/PROMPT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
## [!NOTE]
> Currently llm_bench only supports json files with the suffix .jsonl.
> If there is no prompt file, the default value is used.
> There can be multiple prompts in the prompt file. You can specify the prompt to run by using the option --prompt_index

## 1.Text Generation
Supported parameters that can be set are:
* `prompt` - input prompt text for the text generation
Prompt file example:
{"prompt": "what is openvino?"}
{"prompt": "A chat between a curious user and an artificial intelligence assistant."}

## 2.Stable-diffusion
Supported parameters that can be set are:
* `steps` - inference steps (default 20)
* `width` - resolution width (default 512)
* `height` - resolution height (default 512)
* `guidance_scale` - guidance scale
* `prompt` - input prompt text for the image generation
Prompt file example:
{"steps":"10", "width":"256", "height":"256", "guidance_scale":"1.0", "prompt": "side profile centered painted portrait, Gandhi rolling a blunt, Gloomhaven, matte painting concept art, art nouveau, 8K HD Resolution, beautifully background"}

## 3.Ldm-super-resolution
Supported parameters that can be set are:
* `steps` - inference steps (default 50)
* `width` - resize image width (default 128)
* `height` - resize image height (default 128)
* `prompt` - image path
Prompt file example:
{"steps": "20", "width": "256", "height": "256", "prompt": "./image_256x256_size/4.png"}

## 4.Whisper
Supported parameters that can be set are:
* `media` - audio file path
* `language` - language of audio (default <|en|>)
* `timestamp` - timestamp for whisper (default true)
Prompt file example:
{"media": "./audio/intel_ad_90s_128kbps.mp3", "language": "<|en|>", "timestamp":false}
{"media": "./audio/intel_ad_120s_128kbps.mp3", "language": "<|en|>", "timestamp":true}
6 changes: 4 additions & 2 deletions tools/llm_bench/llm_bench_utils/config_class.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@
OVModelForSeq2SeqLM,
OVStableDiffusionPipeline,
OVLatentConsistencyModelPipeline,
OVStableDiffusionXLPipeline
OVStableDiffusionXLPipeline,
OVModelForSpeechSeq2Seq
)
from llm_bench_utils.ov_model_classes import OVMPTModel, OVLDMSuperResolutionPipeline, OVChatGLMModel

Expand Down Expand Up @@ -41,6 +42,7 @@
'chatglm2': OVModelForCausalLM,
'chatglm3': OVModelForCausalLM,
'chatglm': OVChatGLMModel,
'whisper': OVModelForSpeechSeq2Seq,
}

PT_MODEL_CLASSES_MAPPING = {
Expand All @@ -56,7 +58,7 @@

USE_CASES = {
'image_gen': ['stable-diffusion-', 'ssd-', 'deepfloyd-if', 'tiny-sd', 'small-sd', 'lcm-', 'sdxl', 'dreamlike'],
'text2speech': ['whisper'],
'speech2text': ['whisper'],
'image_cls': ['vit'],
'code_gen': ['replit', 'codegen2', 'codegen', 'codet5', "stable-code"],
'text_gen': [
Expand Down
23 changes: 4 additions & 19 deletions tools/llm_bench/llm_bench_utils/gen_output_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@

def gen_iterate_data(
iter_idx='',
loop_idx='',
in_size='',
infer_count='',
out_size='',
Expand All @@ -17,37 +16,23 @@ def gen_iterate_data(
max_uss_mem='',
prompt_idx='',
tokenization_time=[],
loop_data=None
):
iter_data = {}
iter_data['iteration'] = iter_idx
iter_data['loop_idx'] = loop_idx
iter_data['input_size'] = in_size
iter_data['infer_count'] = infer_count
iter_data['output_size'] = out_size
iter_data['generation_time'] = gen_time
iter_data['latency'] = latency
iter_data['result_md5'] = res_md5
iter_data['first_token_latency'] = ''
iter_data['other_tokens_avg_latency'] = ''
iter_data['first_token_infer_latency'] = ''
iter_data['other_tokens_infer_avg_latency'] = ''
iter_data['max_rss_mem_consumption'] = max_rss_mem
iter_data['max_shared_mem_consumption'] = max_shared_mem
iter_data['max_uss_mem_consumption'] = max_uss_mem
iter_data['prompt_idx'] = prompt_idx
iter_data['tokenization_time'] = tokenization_time[0] if len(tokenization_time) > 0 else ''
iter_data['detokenization_time'] = tokenization_time[1] if len(tokenization_time) > 1 else ''

if loop_data is not None:
iter_data['enc_token_latency'] = loop_data['enc_token_time']
iter_data['enc_infer_latency'] = loop_data['enc_infer_time']
iter_data['first_token_latency'] = loop_data['dec_1st_token_time']
iter_data['other_tokens_avg_latency'] = loop_data['dec_2nd_tokens_time']
iter_data['first_token_infer_latency'] = loop_data['dec_1st_infer_time']
iter_data['other_tokens_infer_avg_latency'] = loop_data['dec_2nd_infers_time']
else:
iter_data['enc_token_latency'] = ''
iter_data['enc_infer_latency'] = ''
iter_data['first_token_latency'] = ''
iter_data['other_tokens_avg_latency'] = ''
iter_data['first_token_infer_latency'] = ''
iter_data['other_tokens_infer_avg_latency'] = ''

return iter_data
136 changes: 136 additions & 0 deletions tools/llm_bench/llm_bench_utils/hook_forward_whisper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
import time
import copy
import llm_bench_utils.hook_greedy_search


class WhisperHook:
def __init__(self):
self.enc_infer_count = 0
self.time_data = []
self.latency_list = []
self.tm_list = []
self.tm_infer_list = []
self.greedy_hook = None

def get_time_list(self):
first_token_latency = 0
for data in self.time_data:
if 'enc_token_time' in data:
first_token_latency += data['enc_token_time']
if 'dec_token_time' in data:
first_token_latency += data['dec_token_time'][0]
self.tm_list.extend(copy.deepcopy(data['dec_token_time'][1:]))
self.tm_list.insert(0, first_token_latency)
return self.tm_list

def get_time_infer_list(self):
first_infer_latency = 0
for data in self.time_data:
if 'enc_infer_time' in data:
first_infer_latency += data['enc_infer_time']
if 'dec_infer_time' in data:
first_infer_latency += data['dec_infer_time'][0]
self.tm_infer_list.extend(copy.deepcopy(data['dec_infer_time'][1:]))
self.tm_infer_list.insert(0, first_infer_latency)
return self.tm_infer_list

def get_whisper_latency(self):
self.latency_list.clear()
for data in self.time_data:
latency_data = {}
if 'enc_token_time' and 'enc_infer_time' in data:
latency_data['enc_token_time'] = round(data['enc_token_time'] * 1000, 2)
latency_data['enc_infer_time'] = round(data['enc_infer_time'] * 1000, 2)
if 'dec_token_time' in data:
dec_token_count = len(data['dec_token_time'])
dec_infer_count = len(data['dec_infer_time'])
latency_data['dec_token_count'] = dec_token_count
latency_data['dec_infer_count'] = dec_infer_count
latency_data['dec_1st_token_time'] = round(data['dec_token_time'][0] * 1000, 2) if dec_token_count > 0 else 'NA'
latency_data['dec_2nd_tokens_time'] = round(sum(data['dec_token_time'][1:]) * 1000 / (dec_token_count - 1), 2) if dec_token_count > 1 else 'NA'
latency_data['dec_1st_infer_time'] = round(data['dec_infer_time'][0] * 1000, 2) if dec_infer_count > 0 else 'NA'
latency_data['dec_2nd_infers_time'] = round(sum(data['dec_infer_time'][1:]) * 1000 / (dec_infer_count - 1), 2) if dec_infer_count > 1 else 'NA'
self.latency_list.append(latency_data)

def print_whisper_latency(self, iter, prompt_idx):
self.get_whisper_latency()
str = ''
for idx, data in enumerate(self.latency_list):
title = f'[ INFO ] [{iter}][P{prompt_idx}][L{idx}]'
if 'enc_token_time' and 'enc_infer_time' in data:
str += \
f"{title} encoder token latency: {data['enc_token_time']:.2f} ms/token, " \
f"encoder infers latency: {data['enc_infer_time']:.2f} ms/infer"
if 'dec_1st_token_time' and 'dec_2nd_tokens_time' in data:
str += \
f"\n{title} decoder first token latency: {data['dec_1st_token_time']} ms/token, " \
f"decoder other tokens latency: {data['dec_2nd_tokens_time']} ms/token, " \
f"decoder tokens count: {data['dec_token_count']}\n"
if 'dec_1st_infer_time' and 'dec_2nd_infers_time' in data:
str += \
f"{title} decoder first infer latency: {data['dec_1st_infer_time']} ms/infer, " \
f"decoder other infers latency: {data['dec_2nd_infers_time']} ms/infer, " \
f"decoder infers count: {data['dec_infer_count']}"
if idx < len(self.latency_list) - 1:
str += '\n'
return str

def clear_statistics(self):
self.enc_infer_count = 0
self.time_data.clear()
self.tm_list.clear()
self.tm_infer_list.clear()
if self.greedy_hook is not None:
self.greedy_hook.clear_time_list()
self.greedy_hook.clear_time_infer_list()

def new_text_encoder(self, pipe):
old_text_encoder = pipe.model.encoder.forward

def my_text_encoder(*args, **kwargs):
t1 = time.time()
r = old_text_encoder(*args, **kwargs)
t2 = time.time()
text_encoder_token_time = t2 - t1
if self.enc_infer_count > 0:
prev_loop_data = self.time_data[self.enc_infer_count - 1]
prev_loop_data['enc_token_time'] = text_encoder_token_time
return r
pipe.model.encoder.forward = my_text_encoder

def new_text_encoder_request(self, pipe):
old_text_encoder_request = pipe.model.encoder.request

def my_text_encoder_request(*args, **kwargs):
loop_data = {}
t1 = time.time()
r = old_text_encoder_request(*args, **kwargs)
t2 = time.time()
text_encoder_infer_time = t2 - t1
loop_data['enc_infer_time'] = text_encoder_infer_time
self.time_data.append(loop_data)
self.enc_infer_count += 1
return r
pipe.model.encoder.request = my_text_encoder_request

def new_text_sample(self, pipe):
self.greedy_hook = llm_bench_utils.hook_greedy_search.GreedySearchHook()
self.greedy_hook.new_forward(pipe.model)

def new_generate(self, pipe):
old_generate = pipe.model.generate

def my_generate(attention_mask, **kwargs):
r = old_generate(attention_mask, **kwargs)
self.set_decoder_time_data()
return r
pipe.model.generate = my_generate

def set_decoder_time_data(self):
if self.enc_infer_count > 0:
prev_loop_data = self.time_data[self.enc_infer_count - 1]
prev_loop_data['dec_token_time'] = copy.deepcopy(self.greedy_hook.get_time_list())
prev_loop_data['dec_infer_time'] = copy.deepcopy(self.greedy_hook.get_time_infer_list())
if self.greedy_hook is not None:
self.greedy_hook.clear_time_list()
self.greedy_hook.clear_time_infer_list()
Loading
Loading