Introduction

This page is project tracker to get halo models like llama3, Flux.1, Mistral etc. working on one or more MI3xx using shark/iree.

Release Goals

Shark V3.1.0 (Jan 6, 2025) llama3.1 405B sharded across 8 MI300x GPUs performant at level of vLLM PyTorch, Flux.1 dev
Shark V3.2.0 (Feb 2025) Grok-1 and Mixtral 8x7B performant

Glossary

TPn: Tensor Parallel using n GPUs where a large tensor is sharded across multiple GPUs using sharktank and scatter/gather to/from GPUs is done in single MLIR

TTFT: Time To First Token (time taken from processing of prompt to first token generated by the prefill stage)

ITL: Average time between each new token generated in decode phase (second token onwards)

User Instructions

-Read cookbooks for user-like inference run instructions.

Benchmarking

-Read benchmarking to get setup to get performance numbers.

Latest Tracy Profiles

Model	Tracy Profile
llama3.1 8B Fp16 Unsharded nondecomposed (i.e. using flash attention2)	Tracy Profile
llama3.1 8B Fp16 Unsharded decode (i.e. using flash attention2)	Tracy Profile

Schedule

(Model is assumed to be llama3.1 in the following table, e.g. "8B FP8" means "llama3.1 8B FP8 model")

Item	Current Week (Dec 9-13)	Next Week (Dec 16-20)
Sharktank Modeling	- @Ian Finish Flux Vae decode (DONE 12/11) - @Kyle finish flux model (DONE: 12/11) - @Boian flux clip model export and compile for bf16 (DONE: 12/11) - @Dan Finish and merge FP8 llama PR (ETA 12/12)	- @Rob multi-device fixes (ETA 12/17) - @Boian Landing flux transformer model (Done 12/16) - @Boian updating clip and T5 tests (ETA 12/16) - @Kyle help grok-1 (ETA: 12/20) - @Archana bring up grok-1 q4_k (ETA: 12/20) - @Boian debug flux compile time issue (ETA 12/20)
IREE codegeneration	- @kunwar decode flash attention (DONE 12/11)	@Dan Reworking fp8 attention for Stan (ETA 12/17) - @Dan lowering issue for fp8 (ETA 12/17)
Serving	- @ean flush out bf16 flux in shortfin for flux (ETA 12/12) - @Xida fix flakiness in batch handling (Done: 12/12) - @Stephen test and ensure sglang/shortfin batch runs work (ETA: 12/12)	- @Stephen Debugging multi-device llms in shortfin (ETA: 12/20) - @Ean debugging fp16 flux pipeline (ETA 12/17) - @Xida Debugging batching issue found in prefill (ETA 12/17) - @Xida ramping up and helping on multi-device shortfin issue (ETA:12/20)
Test Automation	- @Avi refresh benchmarking decode and prefill for 8B, 70B (ETA: 12/12) -@Archana shortfin PPL debugging (ETA: 12/10) -@Rob debug multi-device (ETA: 12/11)	- @Archana triaging PPL breakages from block size and device affinities (Done 12/17) - @Archana shortfin PPL integration (ETA 12/18)
Performance Tuning	-@Avi tracy profile for decode (ETA:12/11)	@Avi Landing fixes for block size changes (ETA 12/16) @Avi tracy profiling updates (Done 12/17) - @avi Benchmark with new rotary embedding (ETA: 12/20)

Nightly Test Reports

See latest CI/Nightly Test Report. Use Nod.AI Lab page to ssh into machine SharkMi300X to find logs and artifacts to triage the failures. File an issue (if not already filed/listed) and add to Issues table below.

Issues

category	issue link	assigned to	status
quark quantization	QUARK-71	Bowen Bow	FP8 matmul should be used in attention
iree codegen	18864	Ian Wood	OOM for 70B

Status-Numerics

Following naming convention should be used for weights and artifacts (on SharkMI300x and other similar machines)

UnSharded Weights:

/data/<model_name>/weights/<model_size>/<modelname_modelsize_datatype>.irpa

Example: /data/llama-3.1/weights/405b/fp16/llama3.1_405b_fp16.irpa

Sharded Weights:

/data/<model_name>/weights/<model_size>/<shard_size>/<modelname_modelsize_shardsize_ranksuffix>.irpa

Example: /data/llama-3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank0.irpa

Artifacts:

/data/<model_name>/artifacts/<model_size>/<model_name>\_<model_size>\_<data_type>\_<attention_kind>\_<sharding>\_<batch_size>.[mlir | vmfb]

Example: /data/llama-3.1/artifacts/405b/llama3.1_405b_fp16_nondecomposed_tp8_bs4.mlir

Artifacts

(MI300X GPU, SPX Mode)

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
llama3.1-8B-FP16 bs4 TP1 (prefill)	PASS mlir_tp1 irpa	PASS compile command	PASS benchmark command numpy inputs	tbd	tbd
llama3.1-8B-FP16 bs4 TP8 (prefill)	PASS mlir_tp8	PASS	PASS	tbd	tbd
llama3.1-8B-FP16	PASS mlir	Fails in iree, patch	tbd	tbd	tbd
llama3.1-70B-FP16	PASS mlir	Fails in iree, patch	tbd	tbd	tbd
llama3.1-405B-FP16 bs4 TP8 (prefill)	PASS mlir_tp8	PASS w/ patch compile command	PASS benchmark command numpy inputs	tbd	tbd
llama3.1-8B-FP8	PASS mlir	tbd	tbd	tbd	tbd
llama3.1-70B-FP8	ETA: 11/1	tbd	tbd	tbd	tbd
llama3.1-405B-FP8	ETA: 11/5	tbd	tbd	tbd	tbd
llama-toy-size-FP32-TP2-CPU	PASS	PASS	tbd	tbd	tbd

Flux.1 Dev Transformer

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
sharktank `black-forest-labs--FLUX.1-dev--transformer-single-layer-bf16`	MLIR IRPA	tbd	tbd	N/A	N/A
sharktank `black-forest-labs--FLUX.1-dev--black-forest-labs-transformer-bf16` (this is the real production model)	MLIR IRPA	tbd	tbd	tbd	tbd

Flux.1 Schnell Transformer

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
sharktank `black-forest-labs--FLUX.1-schnell--transformer-single-layer-bf16`	MLIR IRPA	tbd	tbd	N/A	N/A
sharktank `black-forest-labs--FLUX.1-schnell--black-forest-labs-transformer-bf16`	MLIR IRPA	tbd	tbd	tbd	tbd

Schenll is almost the same as Dev. Dev has a guidance layer and guidance parameter, while Schenll does not.

black-forest-labs--FLUX.1-<schnell/dev>--transformer-single-layer-bf16 is a single layer with random weights. It is meant to help for faster iteration when working with the model.

The actual models black-forest-labs--FLUX.1-<dev/schnell>--black-forest-labs-transformer-bf1 are with real pretrained parameters and have 19 MMDiT layers.

T5 Encoder (part of Flux.1 dev)

Only the xxl variant is actually used in FLUX. The small variant is provided for faster iteration if needed.

Compile

iree-compile \
  google__t5_v1_1_xxl_encoder_fp32.mlir \
  --iree-hal-target-device=hip \
  --iree-hip-target=gfx942 \
  -o google__t5_v1_1_xxl_encoder_fp32.vmfb

Run

iree-run-module \
  --device=hip \
  --module=google__t5_v1_1_xxl_encoder_fp32.vmfb \
  --parameters=model=google__t5_v1_1_xxl_encoder_fp32.irpa \
  --function=forward_bs4 \
  --input=@google__t5_v1_1_xxl_iree_forward_bs4_arg0.npy

(MI300X GPU, SPX Mode)

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
t5-v1.1-small-encoder-bf16	PASS mlir gguf irpa	PASS	PASS args expected_result	FAIL	tbd
t5-v1.1-xxl-encoder-bf16	PASS mlir gguf irpa	PASS	PASS args expected_result	FAIL	tbd
t5-v1.1-small-encoder-f32	PASS mlir gguf irpa	PASS	PASS args expected_result	PASS `tol < (atol=1e-4, rtol=1.5e-3)`	tbd
t5-v1.1-xxl-encoder-f32	PASS mlir gguf irpa	PASS	PASS args expected_result	PASS `tol < (atol=1e-4, rtol=1.5e-3)`	tbd

Mixtral 8x7B

Item	Generate MLIR	Compile to vmfb	IREE invocation	IREE numeric	Serving numeric
Mixtral 8x7B ONNX	tbd	tbd	tbd	tbd	tbd

AMD GPU Machines

MI300

MLIR generation and Compilation

Generate IR

python3 -m sharktank.examples.export_paged_llm_v1 \
  --irpa-file <input_irpa path with correct sharding and dtype> --output-mlir <output-mlir> \
  --bs <batch size> --tensor-parallelism-size <TP size if sharding> \
  --attention-kernel <decomposed or torch_sdpa> --no-fake-quant <only for fp8>

Generate vmfb

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 -o <output-vmfb path>

Evaluation tests

Perplexity

Follow the steps here

Accessing sharkblobs on Azure:

In browser, click on sharkblobs , then click on "Blob-containers" and the click on "halo-models"

Or, use command line by first installing az cli as:

curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

And then, get the account key for the storage account by clicking on "Storage Accounts" in Azure Services or searching "sharkblobs" in the top search bar. Then, click on sharkblobs. Then, on the left side bar, under Security + networking, click on "Access keys". Copy the account key from here and use in the following command To upload:

az storage blob upload --account-name sharkblobs --container-name halo-models --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>

To download:

az storage blob download --account-name sharkblobs --container-name halo-models --name <azure path, example: halo-models/llama3_8b/tp1/llama.mlir> --file <local_path_on_computer> --account-key <key_retrieved_from_directions_above>

if you are downloading from "sharkpublic" then replace instructions above by sharkpublic and get your account access key for sharkpublic. Example:

az storage blob download --account-name sharkpublic --container-name sharkpublic --name ian/llama8b_f16.gguf \
  --file llama8b_f16.gguf --account-key <key string>

Export With `sharktank` and Server with `shortfin`:

Follow the steps here.

Setup SGLang With Shortfin

Follow the steps here

SGLang/Shortfin Feature Enablement

Feature	Description	Enabled	Enablement Requirements	Reference(s)
`gen`	Generate shortfin completion, given a prompt	Yes	Enabled	Shortfin Implementation
`streaming`	Stream shortfin completion, given a prompt	Yes	Enabled	Shortfin Implementation
`run_batch`	Run batch of disjoint requests with continous batching	Yes	Enabled	Batch Docs
`fork`	Launch parallel prompts	Yes	Enabled	Fork Docs
`choices`	Given set of choices, generate response based on best log probs	No	Should work with greedy. Needs backend implementation	Greedy Token Selection OpenAI Implementation
`image`	Pass image as part of multi-modal prompt	No	Multi-Modal not supported by SF	sgl.image Docs
`regex`	Specify regular expression as decoding constraint	No	Only supported for local models	Regex Docs

SGLang Benchmark Results

The latest benchmark results for the SGLang integration can be found here

Models	compile	inference (SPX mode)	tracy
llama3.1-8b-Q4_1	PASS	prefill (1817 ms), decode (57.3 ms), commands	prefill decode
llama3.1-8b-Q4_k	PASS
llama3.1-70b-Q4_1	PASS	prefill (3543 ms), decode (213 ms), commands	prefill decode
grok-1-Q4_1	PASS	FAIL, out of memory	prefill decode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

halo-models.md

halo-models.md

Introduction

Release Goals

Glossary

User Instructions

Benchmarking

Latest Tracy Profiles

Schedule

Nightly Test Reports

Issues

Status-Numerics

Artifacts

Flux.1 Dev Transformer

Flux.1 Schnell Transformer

T5 Encoder (part of Flux.1 dev)

Compile

Run

Mixtral 8x7B

AMD GPU Machines

MLIR generation and Compilation

Evaluation tests

Perplexity

Accessing sharkblobs on Azure:

Export With `sharktank` and Server with `shortfin`:

Setup SGLang With Shortfin

SGLang/Shortfin Feature Enablement

SGLang Benchmark Results

Archive

Status (Old)

Artifacts (Old)

Guideline:

TP1

Models	FP16	FP8	Q4_1	Q4_K	Attention IRs
llama2-7b		irpa mlir			Attention IRs
llama3-8b	mlir gguf	mlir irpa	mlir gguf	mlir gguf
llama3-70b	mlir gguf	mlir irpa	mlir gguf	mlir gguf
llama3-405b	mlir gguf		mlir gguf	mlir gguf
grok-1	mlir gguf	NA	mlir gguf	gguf

Files

halo-models.md

Latest commit

History

halo-models.md

File metadata and controls

Introduction

Release Goals

Glossary

User Instructions

Benchmarking

Latest Tracy Profiles

Schedule

Nightly Test Reports

Issues

Status-Numerics

Artifacts

Flux.1 Dev Transformer

Flux.1 Schnell Transformer

T5 Encoder (part of Flux.1 dev)

Compile

Run

Mixtral 8x7B

AMD GPU Machines

MLIR generation and Compilation

Evaluation tests

Perplexity

Accessing sharkblobs on Azure:

Export With sharktank and Server with shortfin:

Setup SGLang With Shortfin

SGLang/Shortfin Feature Enablement

SGLang Benchmark Results

Archive

Status (Old)

Artifacts (Old)

Guideline:

TP1

Export With `sharktank` and Server with `shortfin`: