Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
0842597
initial commit (data synthesizer)
PeaBrane May 14, 2025
9e69e50
benchmarks with an s
PeaBrane May 14, 2025
0837619
decrement 1 for core radix tree size
PeaBrane May 14, 2025
69b3971
rename to data_utils
PeaBrane May 14, 2025
b4341d9
change paths in tests
PeaBrane May 14, 2025
c0a4b3c
fix edge case of contraction (with -1 head), and added type hints
PeaBrane May 14, 2025
25ebbb6
more stringent graph structure testing
PeaBrane May 14, 2025
80ecb22
cleanups
PeaBrane May 15, 2025
b8179f2
licenses
PeaBrane May 15, 2025
152b1f1
black (blank line at end)
PeaBrane May 15, 2025
759fabd
fix prompt len bug
PeaBrane May 15, 2025
a57d8ae
no tabulate dep
PeaBrane May 15, 2025
4c32c52
pandas license
PeaBrane May 15, 2025
905d04b
rm extra new lines
PeaBrane May 15, 2025
4e41f5b
pre commits
PeaBrane May 15, 2025
20040eb
make mypy happy
PeaBrane May 15, 2025
4d10bf8
rolling hasher
PeaBrane May 15, 2025
1517f1f
readme update
PeaBrane May 15, 2025
ec738d6
license in test_hasher.py
PeaBrane May 15, 2025
34caf41
copyright in hasher.py
PeaBrane May 15, 2025
e065c54
typo synthesizer
PeaBrane May 18, 2025
1c85d8d
logging into main dir
PeaBrane May 18, 2025
0e73683
rename to data_generator
PeaBrane May 18, 2025
fa88dcb
separate requirements for benchmarks
PeaBrane May 18, 2025
8577a76
link to mooncake trace + explanation
PeaBrane May 18, 2025
a4ba2f3
move tests into data_generator
PeaBrane May 18, 2025
0a5c180
example in README
PeaBrane May 18, 2025
3ea2668
package data_generator
PeaBrane May 18, 2025
4e32834
cli
PeaBrane May 18, 2025
daba38d
Merge branch 'main' into rupei/benchmark-tree
PeaBrane May 18, 2025
06678d8
restore accidentally deleted pytest in per-merge
PeaBrane May 18, 2025
04f7c98
actually, need to now install data_generator before the pytests
PeaBrane May 18, 2025
78ee708
fix pytest workflow
PeaBrane May 19, 2025
ac40686
pytest mypy
PeaBrane May 19, 2025
5cbac2a
update README with cli
PeaBrane May 19, 2025
cac7be4
mypy --install-types
PeaBrane May 19, 2025
3b3fa59
make pytest ignore benchmarks
PeaBrane May 20, 2025
8985f58
remove bash -ec
PeaBrane May 20, 2025
253f1f5
Merge branch 'main' into rupei/benchmark-tree
PeaBrane May 23, 2025
a248cc5
short README in benchmarks
PeaBrane May 23, 2025
7658687
minor language cleanups
PeaBrane May 23, 2025
f542709
reference benchmarks in KV tuning guide
PeaBrane May 23, 2025
d627b2b
better writing
PeaBrane May 23, 2025
14b54c5
docstrings
PeaBrane May 23, 2025
5defd12
Update benchmarks/README.md
PeaBrane May 23, 2025
c7ac84c
Update benchmarks/data_generator/README.md
PeaBrane May 23, 2025
c346b8f
improve readability
PeaBrane May 23, 2025
e380c42
more info on hash_ids prefix overlap
PeaBrane May 23, 2025
2490bc1
try adding benchmarks dir to python path
PeaBrane May 27, 2025
508e73c
Merge branch 'main' into rupei/benchmark-tree
PeaBrane May 27, 2025
81170c9
Merge branch 'main' into rupei/benchmark-tree
PeaBrane May 27, 2025
bef882e
pip install reqs before pytest benchmarks
PeaBrane May 27, 2025
577de3d
types-tabulate
PeaBrane May 27, 2025
bbe5793
restore pip install benchmarks
PeaBrane May 27, 2025
44a27bf
Merge remote-tracking branch 'origin/main' into rupei/benchmark-tree
PeaBrane Jun 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/pre-merge-python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ jobs:
env:
PYTEST_MARKS: "pre_merge or mypy"
run: |
docker run -w /workspace --name ${{ env.CONTAINER_ID }}_pytest ${{ steps.define_image_tag.outputs.image_tag }} pytest --basetemp=/tmp --junitxml=${{ env.PYTEST_XML_FILE }} -m "${{ env.PYTEST_MARKS }}"
docker run -w /workspace --name ${{ env.CONTAINER_ID }}_pytest ${{ steps.define_image_tag.outputs.image_tag }} bash -c "pip install -e /workspace/benchmarks && pytest --basetemp=/tmp --junitxml=${{ env.PYTEST_XML_FILE }} -m \"${{ env.PYTEST_MARKS }}\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeaBrane it looks like you don't have signed commits enabled, so the gitlab PR didn't get triggered. These changes look like they're failing in similar tests on gitlab side because the benchmarks package doesn't get installed, so mypy doesn't know about the import.

The reason for duplicate tests on gitlab side is to access wider pool of GPU runners for GPU testing.

ex: https://gitlab-master.nvidia.com/dl/ai-dynamo/dynamo/-/jobs/175820836

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw - was the correct fix to pip install in the test step here? Or would it make more sense to install in the Dockerfile itself so it's available to all? CC @nnshah1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can the benchmarks directory be moved under tests?
  2. Should the benchmark dependencies be added to the Dockerfile? If yes, which container image/stage should they be included in? Or can they be added to requirements.test.txt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend adding it to the docker file - as part of dev or ci

- name: Copy test report from test Container
if: always()
run: |
Expand All @@ -77,4 +77,4 @@ jobs:
uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
with:
name: Event File
path: ${{ github.event_path }}
path: ${{ github.event_path }}
30 changes: 30 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
<!-- # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. -->

# Benchmarks

This directory contains benchmarking scripts and tools for performance evaluation.

## Installation

To install the necessary dependencies locally, run:

```bash
pip install -e .
```

Currently, this will install lightweight tools for:
- Analyzing prefix-structured data (`datagen analyze`)
- Synthesizing structured data customizable for testing purposes (`datagen synthesize`)
131 changes: 131 additions & 0 deletions benchmarks/data_generator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
<!-- # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. -->

## Trace File Format

The following tools help analyze and synthesize new data based on the [mooncake trace file format](https://github.com/kvcache-ai/Mooncake/blob/d21da178bae8db9651cf18a76824c084145fc725/mooncake_trace.jsonl). In this format, the first few lines would look like this, for example:

```
{"timestamp": 0, "input_length": 6755, "output_length": 500, "hash_ids": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]}
{"timestamp": 0, "input_length": 7319, "output_length": 490, "hash_ids": [0, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]}
{"timestamp": 3052, "input_length": 7234, "output_length": 794, "hash_ids": [0, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]}
{"timestamp": 3052, "input_length": 2287, "output_length": 316, "hash_ids": [0, 42, 43, 44, 45]}
```

**Hash ID Generation:** Each new hash ID is the next consecutive integer after the last one used. Two `hash_ids` sharing the same integers represents the prefix overlap. To generate these increasing hash IDs from a list of texts, we provide the `texts_to_hashes` function in `hasher.py`.

**Timestamp:** The arrival time (in milliseconds) of the request since the first request, which can be the same for multiple requests arriving simultaneously.

**Block Size and Hash IDs:** In this example, the `block_size` (the page size of the KV cache) is assumed to be 512. The length of the `hash_ids` array equals `input_length // block_size`.

## Prefix Analyzer

The Prefix Analyzer provides statistics on a trace file, such as Input Sequence Length (ISL), Output Sequence Length (OSL), and theoretical cache hit rate.
It is useful for understanding the structure and reuse patterns in your dataset.

```bash
datagen analyze --input-file <path_to_trace.jsonl> --block-size <block_size>
```

- `--input-file`: Path to your trace file in jsonl format (default: `mooncake_trace.jsonl`)
- `--block-size`: Block size for prefix calculation (default: 512)

The script will print out summary statistics for ISL, OSL, user prompt lengths, and the theoretical cache hit rate (assuming an infinite cache).

## Synthesizer

The Synthesizer goes a step further:
It builds a prefix tree from the original trace file, extracts prefix statistics, and generates a new synthetic dataset based on these statistics.
You can control various aspects of the synthetic data generation with tunable knobs, such as request rate, context/prompt length multipliers, and the number of tree copies.

This is useful for generating large, realistic synthetic traces for benchmarking or simulation, while preserving the structural properties of the original dataset.

### How to run

```bash
datagen synthesize --input-file <path_to_trace.jsonl> --num-requests <N> [other options...]
```

**Options:**
- `--input-file`: Path to the input trace file (default: `mooncake_trace.jsonl`)
- `--num-requests`: Number of requests to synthesize (default: 100000)
- `--speedup-ratio`: Factor to speed up request intervals. It effectively divides the synthetic timestamps by this value (default: 1)
- `--prefix-len-multiplier`: Multiplier for prefix lengths (default: 1.0)
- `--prefix-root-multiplier`: Number of times to replicate the core radix tree (default: 1)
- `--prompt-len-multiplier`: Multiplier for leaf path lengths (default: 1.0, use <1 for shorter prompts)
- `--max-isl`: Maximum input sequence length to include in output (default: None, no filtering)
- `--block-size`: Block size for prefilling and decoding (default: 512)
- `--output-file`: Path to the output file (default: auto-generated from input file and options)

### Example

Say we only have these hash lists:

```
[0, 1, 2, (3)]
[0, 1]
[0, 1, 2]
[0, (4), (5)]
```

First, we identify the "core prefix nodes" as [0, 1, 2] since they are visited more than once. The nodes [3, 4, 5] would be considered "user prompts" as they only appear once (noted in brackets).

If we set the `prefix-len-multiplier` to 2, then the core prefix branches will be stretched, effectively giving:

```
[0, 1, 2, 3, 4, 5, (6)]
[0, 1, 2, 3]
[0, 1, 2, 3, 4, 5]
[0, 1, (7), (8)]
```


Note that the "prompt branches" are not stretched by `prefix-len-multiplier`. They can be separately modified by applying `prompt-len-multiplier`.

Now, if we set `prefix-root-multiplier` to 2, then each row will have a 50 percent chance of being incremented by a large integer, so that they will be effectively separated into a new radix tree, which matches the statistics of the original one, but having completely different roots.

For example, if rows 2 and 4 are offseted, then we would get:

```
[0, 1, 2, 3, 4, 5, (6)]
[10, 11, 12, 13]
[0, 1, 2, 3, 4, 5]
[10, 11, (14), (15)]
```

### Implementation details

The generation algorithm, simplified, is as follows

- Store the hash ids in a directed tree structure (prefix tree)
- Each directed edge `weight` indicates how many times the edge is traversed, which is needed to compute transition probabilities.
- Contract unary paths (chains) in the tree so that it is in a radix-tree form, meaning every node that is the only child will be contracted with the parent. As a consequence, each node need to store an attribute `length` to indicate the compressed length (1 if no compression). The depth multiplier scales this compressed length (rounded to the nearest integer), effectively increasing the length of each radix node.
- Identify every leaf node that is visited only once, and prune them from the tree, as they are highly likely not part of the core radix tree. In other words, we do not need to store nodes that are part of the actual user prompts.
- At this stage, each node will have (possibly zero) transition probabilities to a child prefix node, to a "user prompt" node, and to a "termination" node. Use these probabilities to sample a path in the core radix tree, the append the path with new hash ids corresponding to a user prompt of length sampled from the dataset. The width multiplier effectively duplicates the entire radix tree the specified number of times, each with a new set of hash ids, creating more diverse request patterns.

## Testing

To test for "correctness", or faithfulness to the original trace statistics, one can run
```
python -m benchmarks.data_utils.synthesizer \
--input-file mooncake_trace.jsonl \
--num-requests 500000 \
```
and compare the synthetic ISL statistics (mean, median, std) to the original ISL statistics, which one can obtain by running
```
python -m benchmarks.data_utils.prefix_analyzer \
--input-file mooncake_trace.jsonl \
```
I find this to be the most "robust" end-to-end test. It is important to sample a large number of requests (e.g., hundreds of thousands) to ensure the statistics are meaningful, due to the law of large numbers. In particular, the mean statistics (such as mean ISL) should be well preserved in the synthetic data. However, the standard deviation statistics—especially for ISL—are not expected to match exactly, since the synthesizer does not capture the correlation between context length and prompt length present in the original data.
20 changes: 20 additions & 0 deletions benchmarks/data_generator/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from data_generator.cli import main as cli_main


def main():
cli_main()
52 changes: 52 additions & 0 deletions benchmarks/data_generator/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import sys


def main():
parser = argparse.ArgumentParser(
description="Data generation and analysis tools for benchmarking",
prog="datagen",
)

# Add subparsers for commands
subparsers = parser.add_subparsers(dest="command", help="Command to run")

# Create the parser for the "analyze" command
subparsers.add_parser("analyze", help="Analyze data")

# Create the parser for the "synthesize" command
subparsers.add_parser("synthesize", help="Synthesize data")

args, remaining = parser.parse_known_args()

if args.command == "analyze":
# Import and run the analyzer main
from data_generator import prefix_analyzer

sys.argv = [sys.argv[0]] + remaining
prefix_analyzer.main()
elif args.command == "synthesize":
# Import and run the synthesizer main
from data_generator import synthesizer

sys.argv = [sys.argv[0]] + remaining
synthesizer.main()


if __name__ == "__main__":
main()
Loading
Loading