Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

[Model Enabling] llama3-8b-instruct-chat Enabling #225

Merged
merged 10 commits into from
Apr 19, 2024
Merged

Conversation

Zhenzhong1
Copy link
Contributor

@Zhenzhong1 Zhenzhong1 commented Apr 18, 2024

Type of Change

Supported llama3

Description

  • Validated models: llama3_8b_instruct-chat
  • Supported MHA to accerlate the llama3_8b_instruct-chat
  • Supported FFN to accerlate the llama3_8b_instruct-chat
  • Q4_J inference pass

Expected Behavior & Potential Risk

N/A

How has this PR been tested?

Perf: -m 0 -C 0-55 m4
model.init(model_name, weight_dtype="int4", compute_dtype="int8")
model.generate(inputs, streamer=streamer, max_new_tokens=33, threads=56, ctx_size=1062, do_sample=False)
32 in 32 out
image

1024 in 32 out
image

model.init(model_name, weight_dtype="int4", compute_dtype="int8", scale_dtype="bf16")
image

image

Inference Screenshots:
FP32:
image

Q4_J:
image

Dependency Change?

N/A

@Zhenzhong1 Zhenzhong1 changed the title [Model Enabling] llama3 Enabling [Model Enabling] llama3_8b_instruct-chat Enabling Apr 18, 2024
@Zhenzhong1 Zhenzhong1 changed the title [Model Enabling] llama3_8b_instruct-chat Enabling [Model Enabling] llama3-8b-instruct-chat Enabling Apr 18, 2024
@kevinintel
Copy link
Contributor

please update support list

neural_speed/models/llama/llama.cpp Show resolved Hide resolved
import faulthandler
import functools
import itertools
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about use convert llama, rather than add new script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will merge to one file when refactoring.

Copy link
Contributor

@zhentaoyu zhentaoyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add <eot_id> processing for the end of the message in a turn, see https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/ and ggerganov/llama.cpp#6751 (comment).

Copy link
Contributor

@a32543254 a32543254 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@VincyZhang VincyZhang merged commit fb7d16d into main Apr 19, 2024
11 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants