You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to use On-device sLM using NPU which is currently equipped in "Intel(R) Core(TM) Ultra 5".
However, although I confirmed the operation of CPU and iGPU in the code below, no answer is output for NPU.
from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
import time
def make_template(context) :
instruction=f"""You are an assistant who translates meeting contents.
Translate the meeting contents given after #Context into English.
#Context:{context}
#Translation:"""
messages=[{"role": "user", "content": f"{instruction}"}]
input_ids=tokenizer.apply_chat_template(messages,
add_generation_prompt = True,
return_tensors="pt")
return input_ids
def translate(context) :
input_ids=make_template(context=context)
outputs=model.generate(input_ids,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
temperature=temperature,
top_p=top_p)
answer=tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
return answer.rstrip()
if __name__ == "__main__" :
model_id = "AIFunOver/gemma-2-2b-it-openvino-8bit"
model = OVModelForCausalLM.from_pretrained(model_id, device="npu")
tokenizer = AutoTokenizer.from_pretrained(model_id)
print(f"Model Device : {model.device}")
max_new_tokens=1024
do_sample=False
temperature=0.1
top_p=0.9
context = '''A: Hello.
B: Oh, yes, hello. I'm contacting you because I have a question. They're doing water pipe construction in my neighborhood, and I'm curious as to how long it will take.
A: Where is your area?
B: Daejeon Byeundae-dong.
A: The construction will continue until tomorrow, sir.
B: Oh really? Oh, but won't there be muddy water after the construction is over?
A: It's better to let out enough water before using it after the construction is over, sir.
B: How much water should I drain?
A: Let out for 2~3 minutes.
B: Okay, I understand. Then, can there be another problem?
A: The water pressure may temporarily drop slightly.
B: Temporarily?
A: Yes, it's a temporary phenomenon and will return to normal pressure right away.
B: What should I do if it lasts a long time?
A: In that case, you can report it to the Waterworks Headquarters.
B: Yes, I understand.
B: But they say it's going to rain tomorrow, so can the construction be finished tomorrow? I think they usually don't do construction on rainy days? A: In case of rain, construction may be slightly delayed. If it doesn't rain too much, construction will proceed as scheduled. Customer, please don't worry too much.
B: Oh, yes, I understand. Thank you.
A: Yes, thank you.'''
start_time = time.time()
generated_text = translate(context)
end_time = time.time()
print("generated_text:", generated_text)
num_generated_tokens = len(tokenizer.tokenize(generated_text))
total_time = end_time - start_time
avg_token_speed = total_time / num_generated_tokens if num_generated_tokens > 0 else float('inf')
print(f"Total Inference Time : {total_time} s")
print(f"Average token generation speed: {avg_token_speed:.4f} seconds/token")
However, the devices currently available for openvino include NPUs.
If there is a way to use NPU, can you tell me?
Thank you.
The text was updated successfully, but these errors were encountered:
Hello
I want to use On-device sLM using NPU which is currently equipped in "Intel(R) Core(TM) Ultra 5".
However, although I confirmed the operation of CPU and iGPU in the code below, no answer is output for NPU.
However, the devices currently available for openvino include NPUs.
If there is a way to use NPU, can you tell me?
Thank you.
The text was updated successfully, but these errors were encountered: