Skip to content

Commit

Permalink
Update cached models and benchmarks (#705)
Browse files Browse the repository at this point in the history
* fix(setup): specify protobuf minimum version

A minimum version is required by toch_neuronx (but not enforced).

* feat(decoder): always fuse qkv

* perf(decoder): use newest models in benchmarks

* ci: update llm cache files

* test(decoder): use llama model compatible with fuse_qkv

* test(docker): update mistral expectation

* chore(ami): use AWS DLAMI 2.20

* test(tgi): update decode expectation
  • Loading branch information
dacorvo authored Sep 27, 2024
1 parent 6ba8b37 commit 7180d48
Show file tree
Hide file tree
Showing 33 changed files with 42 additions and 258 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/inference_cache_llm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ jobs:
matrix:
config: [
gpt2,
llama3-8b,
llama,
llama3.1-70b,
llama3-70b,
llama2-7b-13b,
llama2-70b,
mistral,
llama-variants,
Expand Down
43 changes: 0 additions & 43 deletions benchmark/text-generation/llama2-7b.py

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,14 @@


def main():
NUM_CORES = 8
NUM_CORES = 12
num_cores = get_available_cores()
if num_cores < NUM_CORES:
raise ValueError(f"This benchmark can only run on an instance with at least {NUM_CORES} cores.")

model_configurations = {
"Llama-2-13B-BS1": ["meta-llama/Llama-2-13b-chat-hf", 1, 4096],
"Llama-2-13B-BS4": ["meta-llama/Llama-2-13b-chat-hf", 4, 4096],
"Llama-2-13B-BS8": ["meta-llama/Llama-2-13b-chat-hf", 8, 4096],
"Llama-2-13B-BS16": ["meta-llama/Llama-2-13b-chat-hf", 16, 4096],
"Mistral-Small-2409-BS1": ["mistralai/Mistral-Small-Instruct-2409", 1, 4096],
"Mistral-Small-2409-BS4": ["mistralai/Mistral-Small-Instruct-2409", 4, 4096],
}

for model_name, model_configuration in model_configurations.items():
Expand All @@ -27,7 +25,7 @@ def main():
export=True,
batch_size=batch_size,
sequence_length=seq_length,
auto_cast_type="fp16",
auto_cast_type="bf16",
num_cores=NUM_CORES,
)
with TemporaryDirectory() as tmpdir:
Expand Down
43 changes: 0 additions & 43 deletions benchmark/text-generation/mistralv2.py

This file was deleted.

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed docs/assets/benchmarks/inferentia-llama2-7b/ttft.png
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed docs/assets/benchmarks/inferentia-llama3-8b/ttft.png
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
12 changes: 4 additions & 8 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,10 @@
title: NeuronX Text-generation-inference for AWS inferentia2
title: How-To Guides
- sections:
- local: benchmarks/inferentia-llama2-7b
title: Llama2 7b on AWS Inferentia2
- local: benchmarks/inferentia-llama2-13b
title: Llama2 13b on AWS Inferentia2
- local: benchmarks/inferentia-mistral-v2
title: Mistral v0.2 7b on AWS Inferentia2
- local: benchmarks/inferentia-llama3-8b
title: Llama-3 8B on AWS Inferentia2
- local: benchmarks/inferentia-mistral-small
title: Mistral Small on AWS Inferentia2
- local: benchmarks/inferentia-llama3.1-8b
title: Llama-3.1 8B on AWS Inferentia2
title: Benchmarks
- sections:
- local: community/contributing
Expand Down
60 changes: 0 additions & 60 deletions docs/source/benchmarks/inferentia-llama2-13b.mdx

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,19 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

# Llama-3-8b performance on AWS Inferentia2 (Latency & Througput)
# Llama-3.1-8b performance on AWS Inferentia2 (Latency & Througput)

How fast is Llama-3-8b on Inferentia2? Let's figure out!
How fast is Llama-3.1-8b on Inferentia2? Let's figure out!

For this benchmark we will use the following configurations:

| Model type | batch_size | sequence_length |
|----------------|------------|-----------------|
| Llama3 8b BS1 | 1 | 4096 |
| Llama3 8b BS4 | 4 | 4096 |
| Llama3 8b BS8 | 8 | 4096 |
| Llama3 8b BS16 | 16 | 4096 |
| Llama3 8b BS32 | 32 | 4096 |
| Model type | batch_size | sequence_length |
|------------------|------------|-----------------|
| Llama3.1 8b BS1 | 1 | 4096 |
| Llama3.1 8b BS4 | 4 | 4096 |
| Llama3.1 8b BS8 | 8 | 4096 |
| Llama3.1 8b BS16 | 16 | 4096 |
| Llama3.1 8b BS32 | 32 | 4096 |

*Note: all models are compiled to use 4 devices corresponding to 8 cores on the `inf2.48xlarge` instance.*

Expand All @@ -41,15 +41,15 @@ We test the time to first token for increasing context sizes, from a typical Q/A

Time to first token is expressed in **seconds**.

![Llama3 8b inferentia2 TTFT](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3-8b/ttft.png "Time to first token")
![Llama3.1 8b inferentia2 TTFT](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3.1-8b/ttft.png "Time to first token")

## Inter-token Latency

The inter-token latency corresponds to the average time elapsed between two generated tokens.

It is expressed in **milliseconds**.

![Llama3 8b inferentia2 inter-token latency](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3-8b/latency.png "Inter-token latency")
![Llama3.1 8b inferentia2 inter-token latency](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3.1-8b/latency.png "Inter-token latency")

### Throughput

Expand All @@ -58,4 +58,4 @@ by the end-to-end latency.

Throughput is expressed in **tokens/second**.

![Llama3 8b inferentia2 throughput](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3-8b/throughput.png "Throughput")
![Llama3.1 8b inferentia2 throughput](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama3.1-8b/throughput.png "Throughput")
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,16 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

# Llama-2-7b performance on AWS Inferentia2 (Latency & Througput)
# Mistral-Small-Instruct performance on AWS Inferentia2 (Latency & Througput)

How fast is Llama-2-7b on Inferentia2? Let's figure out!
How fast is Mistral on Inferentia2? Let's figure out!

For this benchmark we will use the following configurations:

| Model type | batch_size | sequence_length |
|----------------|------------|-----------------|
| Llama2 7B BS1 | 1 | 4096 |
| Llama2 7B BS4 | 4 | 4096 |
| Llama2 7B BS8 | 8 | 4096 |
| Llama2 7B BS16 | 16 | 4096 |
| Llama2 7B BS32 | 24 | 4096 |
| Model type | batch_size | sequence_length |
|--------------------|------------|-----------------|
| Mistral-Small BS1 | 1 | 4096 |
| Mistral-Small BS4 | 4 | 4096 |

*Note: all models are compiled to use 6 devices corresponding to 12 cores on the `inf2.48xlarge` instance.*

Expand All @@ -41,15 +38,15 @@ We test the time to first token for increasing context sizes, from a typical Q/A

Time to first token is expressed in **seconds**.

![Llama2 7b inferentia2 TTFT](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2-7b/ttft.png "Time to first token")
![Mistral Small inferentia2 TTFT](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-mistral-small/ttft.png "Time to first token")

## Inter-token Latency

The inter-token latency corresponds to the average time elapsed between two generated tokens.

It is expressed in **milliseconds**.

![Llama2 7b inferentia2 inter-token latency](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2-7b/latency.png "Inter-token latency")
![Mistral Small inferentia2 inter-token latency](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-mistral-small/latency.png "Inter-token latency")

### Throughput

Expand All @@ -58,4 +55,4 @@ by the end-to-end latency.

Throughput is expressed in **tokens/second**.

![Llama2 7b inferentia2 throughput](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2-7b/throughput.png "Throughput")
![Mistral Small inferentia2 throughput](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-mistral-small/throughput.png "Throughput")
61 changes: 0 additions & 61 deletions docs/source/benchmarks/inferentia-mistral-v2.mdx

This file was deleted.

2 changes: 1 addition & 1 deletion infrastructure/ami/hcl2-files/variables.pkr.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ variable "instance_type" {
}

variable "source_ami" {
default = "ami-0bcb701dd3cace633"
default = "ami-0980ce83654efe544"
description = "Base Image"
type = string
/*
Expand Down
3 changes: 2 additions & 1 deletion optimum/neuron/modeling_decoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,11 +181,12 @@ def __init__(
tnx_kwargs["neuron_config"] = NeuronConfig(
continuous_batching=ContinuousBatchingConfig(batch_size_for_shared_caches=batch_size),
attention_layout=exporter.attention_layout,
fuse_qkv=True,
)
tnx_kwargs["n_positions"] = [sequence_length]
tnx_kwargs["context_length_estimate"] = [sequence_length]
else:
tnx_kwargs["neuron_config"] = NeuronConfig(attention_layout=exporter.attention_layout)
tnx_kwargs["neuron_config"] = NeuronConfig(attention_layout=exporter.attention_layout, fuse_qkv=True)
tnx_kwargs["n_positions"] = sequence_length

# Instantiate neuronx model
Expand Down
Loading

0 comments on commit 7180d48

Please sign in to comment.