Skip to content

Commit

Permalink
Fix
Browse files Browse the repository at this point in the history
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
  • Loading branch information
DarkLight1337 committed Feb 8, 2025
1 parent d47a589 commit 3c86db5
Show file tree
Hide file tree
Showing 4 changed files with 16 additions and 14 deletions.
8 changes: 5 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,11 @@ repos:
rev: v0.9.27
hooks:
- id: pymarkdown
# NOTE: If you get an AssertionError when applying fixes,
# try setting args to [scan] and fix the lint errors manually
args: [fix]
# Conflicts with pyml disable, so we flag this to be fixed manually
args: [fix, -d, md007]
hooks:
- id: pymarkdown
args: [scan]
- repo: https://github.com/rhysd/actionlint
rev: v1.7.7
hooks:
Expand Down
14 changes: 7 additions & 7 deletions csrc/quantization/machete/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,25 +6,25 @@ Machete is a spiritual successor to the Marlin kernel but optimized for Hopper a

Machete effectively performs

```
```python
scale_type = w_s.dtype
compute_type = a.dtype
out = (w_q.to(scale_type) * w_s - w_z.to(scale_type)) @ a
```

Where `w_q` is a quantized weight matrix, `w_s` is the quantization scales, and
Where `w_q` is a quantized weight matrix, `w_s` is the quantization scales, and
`w_z` is the quantization zeropoints.

> **_NOTE:_** `w_z` is added after the scales so we can
> **_NOTE:_** `w_z` is added after the scales so we can
use FMA operations, but this means they must have the scales pre-applied if the
supplied zeropoints assume that they will be subtracted before the scales are
supplied zeropoints assume that they will be subtracted before the scales are
applied.

## API

The main optimization within Machete is prepacking the weight matrix to more closely match the tensor core layouts, allowing for wider shared memory loads when loading the weight matrix. This means that the weight matrix must be prepacked before calling `machete_gemm`. The flow looks something like:

```
```python
from vllm import _custom_ops as ops

...
Expand All @@ -40,6 +40,6 @@ output = ops.machete_gemm(

## Code Generation

Since Machete is based on Cutlass, we can generate multiple type pairs and different tile shapes using the same kernel template. We generate multiple instantiations of this template using `generate.py`.
Since Machete is based on Cutlass, we can generate multiple type pairs and different tile shapes using the same kernel template. We generate multiple instantiations of this template using `generate.py`.

New type pairs (`TypeConfig`s) can be appended to `impl_configs` (in `generate()`), and these will get automatically generated (assuming they can be supported without issues). For each `TypeConfig`, you must also provide an `ImplConfig`, which bundles a `TypeConfig` with a list of `ScheduleConfig`s, `Specialization`s, and a default heuristic. The `ScheduleConfig`s (which contain info on tile shapes, tile scheduler, etc.) can perform differently for different problem shapes, and there is almost never one `ScheduleConfig` that works well for all problem shapes, so it is generally beneficial to generate different `ScheduleConfig`s for different potential problem shapes. This is where the heuristic comes in. For each `TypeConfig`, a default heuristic should be provided. This maps different problem shapes to different `ScheduleConfig`s and is used when the user does not provide the `schedule` parameter to `machete_gemm`. The `Specialization`s define what feature combinations to generate, i.e., `with_zeropoints`, `with_scales`, etc. We can reduce compile times and the final binary size by limiting the set of feature combinations we generate.
New type pairs (`TypeConfig`s) can be appended to `impl_configs` (in `generate()`), and these will get automatically generated (assuming they can be supported without issues). For each `TypeConfig`, you must also provide an `ImplConfig`, which bundles a `TypeConfig` with a list of `ScheduleConfig`s, `Specialization`s, and a default heuristic. The `ScheduleConfig`s (which contain info on tile shapes, tile scheduler, etc.) can perform differently for different problem shapes, and there is almost never one `ScheduleConfig` that works well for all problem shapes, so it is generally beneficial to generate different `ScheduleConfig`s for different potential problem shapes. This is where the heuristic comes in. For each `TypeConfig`, a default heuristic should be provided. This maps different problem shapes to different `ScheduleConfig`s and is used when the user does not provide the `schedule` parameter to `machete_gemm`. The `Specialization`s define what feature combinations to generate, i.e., `with_zeropoints`, `with_scales`, etc. We can reduce compile times and the final binary size by limiting the set of feature combinations we generate.
4 changes: 2 additions & 2 deletions docs/source/serving/engine_args.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Below, you can find an explanation of every engine argument for vLLM:

<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
```{eval-rst}
.. argparse::
:module: vllm.engine.arg_utils
Expand All @@ -17,7 +17,7 @@ Below, you can find an explanation of every engine argument for vLLM:

Below are the additional arguments related to the asynchronous engine:

<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
```{eval-rst}
.. argparse::
:module: vllm.engine.arg_utils
Expand Down
4 changes: 2 additions & 2 deletions examples/offline_inference/openai/openai_batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl -

Add embedding requests to your batch file. The following is an example:

```jsonl
```text
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are a helpful assistant."}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are an unhelpful assistant."}}
```
Expand Down Expand Up @@ -213,7 +213,7 @@ $ cat results.jsonl

Add score requests to your batch file. The following is an example:

```jsonl
```text
{"custom_id": "request-1", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
```
Expand Down

0 comments on commit 3c86db5

Please sign in to comment.