Inference tutorial - Part 3 of e2e series [WIP] #2343

jainapurva · 2025-06-09T23:18:31Z

No description provided.

pytorch-bot · 2025-06-09T23:18:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2343

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ce675b8 with merge base 101c039 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

docs/source/inference.rst

jainapurva · 2025-06-17T20:41:57Z

docs/source/inference.rst

+    print("Response:", output_text[0][len(prompt):])
+
+
+[Optional] Float8 Dynamic Quantization + Semi-structured (2:4) sparsity


@jerryzh168 @jcaip Does this look good? Should I keep sparsity as a optional section or just mention it in note

can we just add to huggingface torchao page?

Remove it from here, and just add keep the note for hf-torchao page ?

docs/source/inference.rst

jerryzh168 · 2025-06-17T20:44:43Z

docs/source/inference.rst

+
+    vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
+
+Inference with vLLM


should we move this after Inference with Transformers

cc @jainapurva I think if vLLM is our recommended serving solution, this should go before transformers.

jerryzh168 · 2025-06-17T20:45:36Z

docs/source/inference.rst

+
+vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.
+
+Setting up vLLM with Quantized Models


nit: this doesn't have to be a new section I think

jerryzh168 · 2025-06-17T20:51:11Z

docs/source/inference.rst

+Performance Breakdown
+=====================
+
+When using vLLM with torchao:


this is not a comprehensive list, probably just remove, do we have a exhaustive list of all the techniques that we support?

I don't think we've a comprehensive list. If we decide to make it, that could be another doc page or readme

I think we can remove this section in that case

andrewor14 · 2025-06-17T21:48:34Z

Hi @jainapurva, by the way I'm adding a serving.rst here: #2394. It uses the same template as parts 1 and 2. After that's landed, do you mind updating your PR to use that file instead? Right now it's a blank page with the template:

docs/source/inference.rst

jerryzh168 · 2025-06-18T23:51:43Z

docs/source/inference.rst

+.. note::
+    For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.
+
+Inference with vLLM


for this section, can you replace with https://huggingface.co/pytorch/Qwen3-8B-int4wo-hqq#inference-with-vllm

it might be easier to do command line compared to code

Preliminary structure for tutorial

c0584b4

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2025

jainapurva added the topic: documentation Use this tag if this PR adds or improves documentation label Jun 10, 2025

jainapurva and others added 8 commits June 16, 2025 09:59

Updates

f4e8f2d

Update

7c2332e

Update

942a02b

Update

888fd4c

Update

c200cd2

Merge remote-tracking branch 'origin/main' into inference_tutorial

4f76b23

Update

c52e6f8

Update

de160b1

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jainapurva added 2 commits June 17, 2025 12:11

Update

e8f5e53

Update

bbd567d

jainapurva commented Jun 17, 2025

View reviewed changes

jainapurva requested review from jerryzh168, andrewor14, drisspg and jcaip June 17, 2025 20:42

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

Update notes

6a96697

jerryzh168 reviewed Jun 17, 2025

View reviewed changes

jcaip reviewed Jun 18, 2025

View reviewed changes

docs/source/inference.rst Outdated Show resolved Hide resolved

jainapurva added 3 commits June 18, 2025 12:11

Updates

06612d3

Merge remote-tracking branch 'origin/main' into inference_tutorial

a3aa301

Updates

ce675b8

jainapurva force-pushed the inference_tutorial branch from b93b892 to ce675b8 Compare June 18, 2025 21:05

jerryzh168 reviewed Jun 18, 2025

View reviewed changes

docs/source/inference.rst Show resolved Hide resolved

jerryzh168 reviewed Jun 18, 2025

View reviewed changes

		print("Response:", output_text[0][len(prompt):])


		[Optional] Float8 Dynamic Quantization + Semi-structured (2:4) sparsity


		vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3

		Inference with vLLM


		vLLM automatically leverages torchao's optimized kernels when serving quantized models, providing significant throughput improvements.

		Setting up vLLM with Quantized Models

Inference tutorial - Part 3 of e2e series [WIP] #2343

Are you sure you want to change the base?

Inference tutorial - Part 3 of e2e series [WIP] #2343

Conversation

jainapurva commented Jun 9, 2025

Uh oh!

pytorch-bot bot commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2343

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 9, 2025 •

edited

Loading