feat: data synthesizer based on prefix statistics #1087

PeaBrane · 2025-05-15T01:45:53Z

Overview:

The scope of this PR is well described in the committed README

Introduces two extra Python dependencies: networkx and pandas

Open discussion:

What is the best way to package the benchmarks directory with dynamo?
How should the directory structure be configured?

Summary by CodeRabbit

New Features
- Introduced a comprehensive benchmarking toolkit for performance evaluation, including tools for analyzing and synthesizing prefix-structured data.
- Added a command-line interface for benchmarking utilities.
Documentation
- Added detailed README files for benchmarking tools, data generator, and usage instructions.
- Updated performance tuning guide with new recommendations and future plans for automatic tuning.
Bug Fixes
- Improved the testing workflow to ensure benchmark packages are properly installed before running tests.
Tests
- Added unit tests for data hashing, empirical sampling, and synthetic data generation modules.
Chores
- Added configuration files for package management and testing.

benchmarks/data_generator/README.md

Co-authored-by: Neelay Shah <neelays@nvidia.com> Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>

.github/workflows/pre-merge-python.yml

coderabbitai · 2025-06-05T00:45:33Z

Walkthrough

This update introduces a comprehensive benchmarking and data synthesis toolkit under the benchmarks directory, including a modular Python package with CLI tools for analyzing and generating synthetic prefix-structured data. It adds detailed documentation, configuration files, and a suite of unit tests. The GitHub Actions workflow is updated to install the benchmarks package before running tests.

Changes

File(s)	Change Summary
.github/workflows/pre-merge-python.yml	Modified test workflow to install benchmarks package in editable mode before running pytest inside Docker.
benchmarks/README.md, benchmarks/data_generator/README.md	Added documentation files describing benchmarking tools, usage, and data formats.
benchmarks/pyproject.toml	Added project configuration, dependencies, entry points, and test/type-checking settings for the benchmarks package.
benchmarks/data_generator/init.py, benchmarks/data_generator/cli.py	Introduced CLI entry points for the data generator package.
benchmarks/data_generator/graph_utils.py, .../hasher.py, .../logging.py, .../protocols.py, .../sampler.py	Added utility modules for graph operations, hashing, logging, protocol constants, and empirical data sampling.
benchmarks/data_generator/prefix_analyzer.py	Added a class and CLI for analyzing prefix statistics and cache hit rates from trace data.
benchmarks/data_generator/synthesizer.py	Implemented a synthesizer class and CLI for generating synthetic datasets based on prefix/radix tree statistics.
benchmarks/data_generator/tests/test_hasher.py, .../test_sampler.py, .../test_synthesizer.py	Added unit tests for hashing, sampling, and synthesizer graph structure.
docs/guides/kv_router_perf_tuning.md	Updated documentation to mention new analysis tools and future auto-tuning plans.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI (datagen)
    participant Analyzer
    participant Synthesizer
    participant FileSystem

    User->>CLI (datagen): Run "datagen analyze <input>"
    CLI (datagen)->>Analyzer: Pass arguments
    Analyzer->>FileSystem: Load dataset
    Analyzer->>Analyzer: Analyze prefixes, cache hits
    Analyzer->>User: Print statistics

    User->>CLI (datagen): Run "datagen synthesize <input> [options]"
    CLI (datagen)->>Synthesizer: Pass arguments
    Synthesizer->>FileSystem: Load dataset
    Synthesizer->>Synthesizer: Build radix tree, sample paths
    Synthesizer->>FileSystem: Write synthetic dataset
    Synthesizer->>User: Print summary

Poem

In the warren, code does bloom,
Benchmarks sprout and data loom.
Prefix trees and hashes spin,
Synthesizers leap right in!
Rabbits test and analyze,
With graphs and stats, they optimize—
Hopping forward, swift and bright,
New tools make benchmarks light! 🐇✨

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 15

🔭 Outside diff range comments (1)

benchmarks/data_generator/logging.py (1)

1-56: 🛠️ Refactor suggestion

Module name conflicts with Python's built-in logging module.

The filename logging.py conflicts with Python's standard library logging module, which could cause import confusion. Consider renaming to something more specific like statistics.py, metrics_utils.py, or stats_logger.py.

The function implementation itself is well-designed with proper type hints, documentation, and statistical calculations.

🧹 Nitpick comments (12)

benchmarks/README.md (1)

1-14: Consider using standard markdown format for license header.

The HTML comment wrapper for the license header is unusual in markdown files. Consider either removing the comment wrapper entirely or using standard markdown syntax.

The documentation content is clear and concise.
.github/workflows/pre-merge-python.yml (1)
80-80: Add missing newline at end of file.

Static analysis detected a missing newline character at the end of the file.
-          path: ${{ github.event_path }}
+          path: ${{ github.event_path }}
+
🧰 Tools

🪛 YAMLlint (1.37.1)

[error] 80-80: no new line character at the end of file

(new-line-at-end-of-file)
benchmarks/data_generator/protocols.py (1)
20-22: LGTM: Clean protocol constants with good documentation.

The use of negative integers for special nodes is a solid design choice that avoids conflicts with real node IDs. Consider adding type annotations for better code clarity:
-SUPER_ROOT = -1  # Dummy node preceding all real nodes; not an actual data root
-CACHE_END = -2  # Special node indicating end of a path
-END_NODE = -3  # Special node indicating to skip leaf sampling
+SUPER_ROOT: int = -1  # Dummy node preceding all real nodes; not an actual data root
+CACHE_END: int = -2  # Special node indicating end of a path
+END_NODE: int = -3  # Special node indicating to skip leaf sampling
benchmarks/pyproject.toml (1)
42-49: Consider moving pytest-mypy to development dependencies.

The pytest-mypy package is typically used during development/testing and might be better placed in an optional dev dependencies group rather than required dependencies.

Consider restructuring like this:
 dependencies = [
     "networkx",
     "pandas",
     "tabulate",
     "types-tabulate",
     "transformers",
-    "pytest-mypy",
 ]
+
+[project.optional-dependencies]
+dev = [
+    "pytest-mypy",
+]
benchmarks/data_generator/sampler.py (2)
38-47: Remove unnecessary else branch
np.random.rand() is returned only when rng is None, so the else is redundant and flagged by Pylint R1705.
-    if rng is not None:
-        return data[np.searchsorted(cdf, rng.random())]
-    else:
-        return data[np.searchsorted(cdf, np.random.rand())]
+    rnd = rng.random() if rng is not None else np.random.rand()
+    return data[np.searchsorted(cdf, rnd)]
🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 44-47: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

58-64: Make random seed configurable
Hard-coding default_rng(0) makes every EmpiricalSampler deterministic and highly correlated. Accept an optional seed or Generator instead.
benchmarks/data_generator/README.md (2)
20-25: Specify code-block language to satisfy markdownlint and enable syntax highlight
-```
+```json
(Apply similarly to all fenced blocks.)

🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

20-20: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

114-116: Minor grammar fix – singular agreement

“each node need to store” → “each node needs to store”

🧰 Tools

🪛 LanguageTool

[grammar] ~114-~114: “Node” is a singular noun. It appears that the verb form is incorrect.
Context: ...the parent. As a consequence, each node need to store an attribute length to indic...

(PCT_SINGULAR_NOUN_PLURAL_VERB_AGREEMENT)

[uncategorized] ~116-~116: “the” seems less likely than “they”.
Context: ...o sample a path in the core radix tree, the append the path with new hash ids corre...

(AI_HYDRA_LEO_CP_THE_THEY)
benchmarks/data_generator/prefix_analyzer.py (2)

60-67: Docstring & return type mismatch

Docstring says “Tuple”, type annotation is dict[str, list]. Update one of them for consistency.

44-50: Prefer logging over print for large datasets
print statements spam stdout when analyzing big traces and can’t be filtered. Switch to the logging module.
benchmarks/data_generator/synthesizer.py (2)
81-94: Improve assertion error messages for better debugging.

The assertion error messages could be more descriptive to help users understand what went wrong.
-assert (
-    isinstance(self.num_copies, int) and self.num_copies >= 1
-), "num_copies must be an integer greater than or equal to 1"
+assert (
+    isinstance(self.num_copies, int) and self.num_copies >= 1
+), f"num_copies must be an integer >= 1, got {self.num_copies} (type: {type(self.num_copies)})"

-assert (
-    isinstance(self.speedup_ratio, float) and self.speedup_ratio > 0
-), "speedup_ratio must be a positive float"
+assert (
+    isinstance(self.speedup_ratio, float) and self.speedup_ratio > 0
+), f"speedup_ratio must be a positive float, got {self.speedup_ratio} (type: {type(self.speedup_ratio)})"

-assert (
-    isinstance(self.prefix_len_multiplier, float)
-    and self.prefix_len_multiplier > 0
-), "context_len_multiplier must be a positive float"
+assert (
+    isinstance(self.prefix_len_multiplier, float)
+    and self.prefix_len_multiplier > 0
+), f"prefix_len_multiplier must be a positive float, got {self.prefix_len_multiplier} (type: {type(self.prefix_len_multiplier)})"

-assert (
-    isinstance(self.prompt_len_multiplier, float)
-    and self.prompt_len_multiplier > 0
-), "prompt_len_multiplier must be a positive float"
+assert (
+    isinstance(self.prompt_len_multiplier, float)
+    and self.prompt_len_multiplier > 0
+), f"prompt_len_multiplier must be a positive float, got {self.prompt_len_multiplier} (type: {type(self.prompt_len_multiplier)})"
184-184: Add better error message for input length validation.

The assertion should provide more context about what values caused the failure.
-assert np.all(0 < input_lens_mod) and np.all(input_lens_mod <= self.block_size)
+invalid_low = input_lens_mod[input_lens_mod <= 0]
+invalid_high = input_lens_mod[input_lens_mod > self.block_size]
+assert len(invalid_low) == 0 and len(invalid_high) == 0, (
+    f"Invalid input lengths found: {len(invalid_low)} values <= 0, "
+    f"{len(invalid_high)} values > block_size ({self.block_size})"
+)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a20445d and 44a27bf.

📒 Files selected for processing (17)

.github/workflows/pre-merge-python.yml (2 hunks)
benchmarks/README.md (1 hunks)
benchmarks/data_generator/README.md (1 hunks)
benchmarks/data_generator/__init__.py (1 hunks)
benchmarks/data_generator/cli.py (1 hunks)
benchmarks/data_generator/graph_utils.py (1 hunks)
benchmarks/data_generator/hasher.py (1 hunks)
benchmarks/data_generator/logging.py (1 hunks)
benchmarks/data_generator/prefix_analyzer.py (1 hunks)
benchmarks/data_generator/protocols.py (1 hunks)
benchmarks/data_generator/sampler.py (1 hunks)
benchmarks/data_generator/synthesizer.py (1 hunks)
benchmarks/data_generator/tests/test_hasher.py (1 hunks)
benchmarks/data_generator/tests/test_sampler.py (1 hunks)
benchmarks/data_generator/tests/test_synthesizer.py (1 hunks)
benchmarks/pyproject.toml (1 hunks)
docs/guides/kv_router_perf_tuning.md (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (5)

benchmarks/data_generator/__init__.py (3)

benchmarks/data_generator/synthesizer.py (1)

main (350-455)

benchmarks/data_generator/cli.py (1)

main (20-48)

benchmarks/data_generator/prefix_analyzer.py (1)

main (156-183)

benchmarks/data_generator/tests/test_sampler.py (1)

benchmarks/data_generator/sampler.py (2)

EmpiricalSampler (50-69)

sample (66-69)

benchmarks/data_generator/cli.py (2)

benchmarks/data_generator/synthesizer.py (1)

main (350-455)

benchmarks/data_generator/prefix_analyzer.py (1)

main (156-183)

benchmarks/data_generator/tests/test_hasher.py (1)

benchmarks/data_generator/hasher.py (1)

texts_to_hashes (21-74)

benchmarks/data_generator/graph_utils.py (1)

benchmarks/data_generator/sampler.py (1)

get_cdf (26-28)

🪛 YAMLlint (1.37.1)

.github/workflows/pre-merge-python.yml

[error] 80-80: no new line character at the end of file

(new-line-at-end-of-file)

🪛 LanguageTool

benchmarks/data_generator/README.md

[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...-size <block_size> ``` - --input-file: Path to your trace file in jsonl format...

(UNLIKELY_OPENING_PUNCTUATION)

[typographical] ~97-~97: The conjunction “so that” does not have a comma in front.
Context: ... of being incremented by a large integer, so that they will be effectively separated into...

(SO_THAT_UNNECESSARY_COMMA)

[style] ~97-~97: Consider using a different adverb to strengthen your wording.
Context: ...tistics of the original one, but having completely different roots. For example, if rows ...

(COMPLETELY_ENTIRELY)

[grammar] ~114-~114: “Node” is a singular noun. It appears that the verb form is incorrect.
Context: ...the parent. As a consequence, each node need to store an attribute length to indic...

(PCT_SINGULAR_NOUN_PLURAL_VERB_AGREEMENT)

[uncategorized] ~116-~116: “the” seems less likely than “they”.
Context: ...o sample a path in the core radix tree, the append the path with new hash ids corre...

(AI_HYDRA_LEO_CP_THE_THEY)

[style] ~131-~131: To reduce wordiness, try specifying a number or using “many” or “numerous” instead.
Context: ...-to-end test. It is important to sample a large number of requests (e.g., hundreds of thousands) ...

(LARGE_NUMBER_OF)

[misspelling] ~131-~131: This word is normally spelled with a hyphen.
Context: ...statistics (such as mean ISL) should be well preserved in the synthetic data. However, the sta...

(EN_COMPOUNDS_WELL_PRESERVED)

🪛 markdownlint-cli2 (0.17.2)

benchmarks/data_generator/README.md

20-20: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

76-76: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

87-87: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

101-101: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

121-121: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

127-127: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

🪛 Pylint (3.3.7)

benchmarks/data_generator/prefix_analyzer.py

[refactor] 22-22: Too few public methods (1/2)

(R0903)

benchmarks/data_generator/sampler.py

[refactor] 44-47: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 50-50: Too few public methods (1/2)

(R0903)

benchmarks/data_generator/tests/test_synthesizer.py

[refactor] 40-40: Too many arguments (6/5)

(R0913)

[refactor] 40-40: Too many positional arguments (6/5)

(R0917)

benchmarks/data_generator/synthesizer.py

[refactor] 32-32: Too many instance attributes (12/7)

(R0902)

[refactor] 33-33: Too many arguments (7/5)

(R0913)

[refactor] 33-33: Too many positional arguments (7/5)

(R0917)

[refactor] 33-33: Too many local variables (29/15)

(R0914)

[refactor] 33-33: Too many branches (15/12)

(R0912)

[refactor] 33-33: Too many statements (67/50)

(R0915)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Build and Test - vllm

🔇 Additional comments (9)

benchmarks/data_generator/__init__.py (1)

16-21: Clean package entry point implementation.

The implementation correctly provides a package-level entry point that delegates to the CLI main function. This follows standard Python packaging patterns.

.github/workflows/pre-merge-python.yml (1)

57-57: Installation step is necessary for testing the new package.

The addition of pip install -e /workspace/benchmarks before running pytest is correct and necessary to ensure the new benchmarks package is available during testing.

docs/guides/kv_router_perf_tuning.md (1)

68-69: LGTM: Excellent documentation enhancement.

The added paragraph effectively introduces users to the available analysis tools while setting expectations for future automatic tuning capabilities. This provides valuable context for the current manual tuning requirements.

benchmarks/data_generator/tests/test_sampler.py (1)

22-46: LGTM: Well-designed statistical test.

The test effectively validates the EmpiricalSampler's distribution behavior:

Uses appropriate sample size (1000) for statistical significance

Reasonable tolerance range (300-400) for ~333 expected occurrences

Properly validates absence of unexpected values

Clear test structure and assertions

The statistical bounds are appropriate for catching significant distribution deviations while allowing for normal sampling variance.

benchmarks/data_generator/tests/test_hasher.py (2)

22-42: Well-designed test fixture.

The dummy tokenizer fixture is properly constructed with appropriate normalization, pre-tokenization, and decoding components. The simple vocabulary makes the test predictable and easy to debug.

45-54: Comprehensive test case for block hashing.

The test effectively validates the rolling hash behavior with different text combinations, ensuring that the same prefix ("a b c d") produces consistent hash IDs while different continuations produce different chains.

benchmarks/data_generator/hasher.py (1)

21-74: Solid rolling hash implementation.

The function correctly implements rolling hash computation with efficient batch tokenization. The hash-to-integer mapping ensures deterministic and consistent hash IDs across different runs.

benchmarks/pyproject.toml (1)

16-82: Well-structured project configuration.

The pyproject.toml file is comprehensive with appropriate metadata, dependencies, and tool configurations. The CLI entry point and package setup will enable proper installation and usage of the benchmarking tools.

benchmarks/data_generator/tests/test_synthesizer.py (1)

75-96: Comprehensive graph structure validation.

The test effectively validates the synthesizer's graph construction by checking node relationships and attributes. The test data covers various scenarios with different hash sequences.

benchmarks/data_generator/cli.py

benchmarks/data_generator/hasher.py

benchmarks/data_generator/tests/test_synthesizer.py

benchmarks/data_generator/sampler.py

benchmarks/data_generator/prefix_analyzer.py

benchmarks/data_generator/synthesizer.py

rmccorm4 · 2025-06-06T23:39:25Z

.github/workflows/pre-merge-python.yml

          PYTEST_MARKS: "pre_merge or mypy"
        run: |
-          docker run -w /workspace --name ${{ env.CONTAINER_ID }}_pytest ${{ steps.define_image_tag.outputs.image_tag }} pytest --basetemp=/tmp --junitxml=${{ env.PYTEST_XML_FILE }} -m "${{ env.PYTEST_MARKS }}"
+          docker run -w /workspace --name ${{ env.CONTAINER_ID }}_pytest ${{ steps.define_image_tag.outputs.image_tag }} bash -c "pip install -e /workspace/benchmarks && pytest --basetemp=/tmp --junitxml=${{ env.PYTEST_XML_FILE }} -m \"${{ env.PYTEST_MARKS }}\""


@PeaBrane it looks like you don't have signed commits enabled, so the gitlab PR didn't get triggered. These changes look like they're failing in similar tests on gitlab side because the benchmarks package doesn't get installed, so mypy doesn't know about the import.

The reason for duplicate tests on gitlab side is to access wider pool of GPU runners for GPU testing.

ex: https://gitlab-master.nvidia.com/dl/ai-dynamo/dynamo/-/jobs/175820836

Btw - was the correct fix to pip install in the test step here? Or would it make more sense to install in the Dockerfile itself so it's available to all? CC @nnshah1

Can the benchmarks directory be moved under tests?

Should the benchmark dependencies be added to the Dockerfile? If yes, which container image/stage should they be included in? Or can they be added to requirements.test.txt?

I would recommend adding it to the docker file - as part of dev or ci

PeaBrane added 13 commits May 14, 2025 14:18

initial commit (data synthesizer)

0842597

benchmarks with an s

9e69e50

decrement 1 for core radix tree size

0837619

rename to data_utils

69b3971

change paths in tests

b4341d9

fix edge case of contraction (with -1 head), and added type hints

c0a4b3c

more stringent graph structure testing

25ebbb6

cleanups

80ecb22

licenses

b8179f2

black (blank line at end)

152b1f1

fix prompt len bug

759fabd

no tabulate dep

a57d8ae

pandas license

4c32c52

PeaBrane requested review from a team, GuanLuo, alec-flowers, biswapanda, grahamking, ishandhanani, jthomson04, kkranen, nnshah1, paulhendricks, piotrm-nvidia, ptarasiewiczNV, rmccorm4, ryanolson, tanmayv25, tedzhouhk and tmonty12 as code owners May 15, 2025 01:45