-
Notifications
You must be signed in to change notification settings - Fork 680
feat: data synthesizer based on prefix statistics #1087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Neelay Shah <neelays@nvidia.com> Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com> Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
WalkthroughThis update introduces a comprehensive benchmarking and data synthesis toolkit under the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI (datagen)
participant Analyzer
participant Synthesizer
participant FileSystem
User->>CLI (datagen): Run "datagen analyze <input>"
CLI (datagen)->>Analyzer: Pass arguments
Analyzer->>FileSystem: Load dataset
Analyzer->>Analyzer: Analyze prefixes, cache hits
Analyzer->>User: Print statistics
User->>CLI (datagen): Run "datagen synthesize <input> [options]"
CLI (datagen)->>Synthesizer: Pass arguments
Synthesizer->>FileSystem: Load dataset
Synthesizer->>Synthesizer: Build radix tree, sample paths
Synthesizer->>FileSystem: Write synthetic dataset
Synthesizer->>User: Print summary
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
🔭 Outside diff range comments (1)
benchmarks/data_generator/logging.py (1)
1-56: 🛠️ Refactor suggestionModule name conflicts with Python's built-in
loggingmodule.The filename
logging.pyconflicts with Python's standard libraryloggingmodule, which could cause import confusion. Consider renaming to something more specific likestatistics.py,metrics_utils.py, orstats_logger.py.The function implementation itself is well-designed with proper type hints, documentation, and statistical calculations.
🧹 Nitpick comments (12)
benchmarks/README.md (1)
1-14: Consider using standard markdown format for license header.The HTML comment wrapper for the license header is unusual in markdown files. Consider either removing the comment wrapper entirely or using standard markdown syntax.
The documentation content is clear and concise.
.github/workflows/pre-merge-python.yml (1)
80-80: Add missing newline at end of file.Static analysis detected a missing newline character at the end of the file.
- path: ${{ github.event_path }} + path: ${{ github.event_path }} +🧰 Tools
🪛 YAMLlint (1.37.1)
[error] 80-80: no new line character at the end of file
(new-line-at-end-of-file)
benchmarks/data_generator/protocols.py (1)
20-22: LGTM: Clean protocol constants with good documentation.The use of negative integers for special nodes is a solid design choice that avoids conflicts with real node IDs. Consider adding type annotations for better code clarity:
-SUPER_ROOT = -1 # Dummy node preceding all real nodes; not an actual data root -CACHE_END = -2 # Special node indicating end of a path -END_NODE = -3 # Special node indicating to skip leaf sampling +SUPER_ROOT: int = -1 # Dummy node preceding all real nodes; not an actual data root +CACHE_END: int = -2 # Special node indicating end of a path +END_NODE: int = -3 # Special node indicating to skip leaf samplingbenchmarks/pyproject.toml (1)
42-49: Consider moving pytest-mypy to development dependencies.The
pytest-mypypackage is typically used during development/testing and might be better placed in an optional dev dependencies group rather than required dependencies.Consider restructuring like this:
dependencies = [ "networkx", "pandas", "tabulate", "types-tabulate", "transformers", - "pytest-mypy", ] + +[project.optional-dependencies] +dev = [ + "pytest-mypy", +]benchmarks/data_generator/sampler.py (2)
38-47: Remove unnecessaryelsebranch
np.random.rand()is returned only whenrng is None, so theelseis redundant and flagged by Pylint R1705.- if rng is not None: - return data[np.searchsorted(cdf, rng.random())] - else: - return data[np.searchsorted(cdf, np.random.rand())] + rnd = rng.random() if rng is not None else np.random.rand() + return data[np.searchsorted(cdf, rnd)]🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 44-47: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
58-64: Make random seed configurable
Hard-codingdefault_rng(0)makes everyEmpiricalSamplerdeterministic and highly correlated. Accept an optionalseedorGeneratorinstead.benchmarks/data_generator/README.md (2)
20-25: Specify code-block language to satisfymarkdownlintand enable syntax highlight-``` +```json(Apply similarly to all fenced blocks.)
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
20-20: Fenced code blocks should have a language specified
null(MD040, fenced-code-language)
114-116: Minor grammar fix – singular agreement“each node need to store” → “each node needs to store”
🧰 Tools
🪛 LanguageTool
[grammar] ~114-~114: “Node” is a singular noun. It appears that the verb form is incorrect.
Context: ...the parent. As a consequence, each node need to store an attributelengthto indic...(PCT_SINGULAR_NOUN_PLURAL_VERB_AGREEMENT)
[uncategorized] ~116-~116: “the” seems less likely than “they”.
Context: ...o sample a path in the core radix tree, the append the path with new hash ids corre...(AI_HYDRA_LEO_CP_THE_THEY)
benchmarks/data_generator/prefix_analyzer.py (2)
60-67: Docstring & return type mismatchDocstring says “Tuple”, type annotation is
dict[str, list]. Update one of them for consistency.
44-50: Preferloggingover
loggingmodule.benchmarks/data_generator/synthesizer.py (2)
81-94: Improve assertion error messages for better debugging.The assertion error messages could be more descriptive to help users understand what went wrong.
-assert ( - isinstance(self.num_copies, int) and self.num_copies >= 1 -), "num_copies must be an integer greater than or equal to 1" +assert ( + isinstance(self.num_copies, int) and self.num_copies >= 1 +), f"num_copies must be an integer >= 1, got {self.num_copies} (type: {type(self.num_copies)})" -assert ( - isinstance(self.speedup_ratio, float) and self.speedup_ratio > 0 -), "speedup_ratio must be a positive float" +assert ( + isinstance(self.speedup_ratio, float) and self.speedup_ratio > 0 +), f"speedup_ratio must be a positive float, got {self.speedup_ratio} (type: {type(self.speedup_ratio)})" -assert ( - isinstance(self.prefix_len_multiplier, float) - and self.prefix_len_multiplier > 0 -), "context_len_multiplier must be a positive float" +assert ( + isinstance(self.prefix_len_multiplier, float) + and self.prefix_len_multiplier > 0 +), f"prefix_len_multiplier must be a positive float, got {self.prefix_len_multiplier} (type: {type(self.prefix_len_multiplier)})" -assert ( - isinstance(self.prompt_len_multiplier, float) - and self.prompt_len_multiplier > 0 -), "prompt_len_multiplier must be a positive float" +assert ( + isinstance(self.prompt_len_multiplier, float) + and self.prompt_len_multiplier > 0 +), f"prompt_len_multiplier must be a positive float, got {self.prompt_len_multiplier} (type: {type(self.prompt_len_multiplier)})"
184-184: Add better error message for input length validation.The assertion should provide more context about what values caused the failure.
-assert np.all(0 < input_lens_mod) and np.all(input_lens_mod <= self.block_size) +invalid_low = input_lens_mod[input_lens_mod <= 0] +invalid_high = input_lens_mod[input_lens_mod > self.block_size] +assert len(invalid_low) == 0 and len(invalid_high) == 0, ( + f"Invalid input lengths found: {len(invalid_low)} values <= 0, " + f"{len(invalid_high)} values > block_size ({self.block_size})" +)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
.github/workflows/pre-merge-python.yml(2 hunks)benchmarks/README.md(1 hunks)benchmarks/data_generator/README.md(1 hunks)benchmarks/data_generator/__init__.py(1 hunks)benchmarks/data_generator/cli.py(1 hunks)benchmarks/data_generator/graph_utils.py(1 hunks)benchmarks/data_generator/hasher.py(1 hunks)benchmarks/data_generator/logging.py(1 hunks)benchmarks/data_generator/prefix_analyzer.py(1 hunks)benchmarks/data_generator/protocols.py(1 hunks)benchmarks/data_generator/sampler.py(1 hunks)benchmarks/data_generator/synthesizer.py(1 hunks)benchmarks/data_generator/tests/test_hasher.py(1 hunks)benchmarks/data_generator/tests/test_sampler.py(1 hunks)benchmarks/data_generator/tests/test_synthesizer.py(1 hunks)benchmarks/pyproject.toml(1 hunks)docs/guides/kv_router_perf_tuning.md(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
benchmarks/data_generator/__init__.py (3)
benchmarks/data_generator/synthesizer.py (1)
main(350-455)benchmarks/data_generator/cli.py (1)
main(20-48)benchmarks/data_generator/prefix_analyzer.py (1)
main(156-183)
benchmarks/data_generator/tests/test_sampler.py (1)
benchmarks/data_generator/sampler.py (2)
EmpiricalSampler(50-69)sample(66-69)
benchmarks/data_generator/cli.py (2)
benchmarks/data_generator/synthesizer.py (1)
main(350-455)benchmarks/data_generator/prefix_analyzer.py (1)
main(156-183)
benchmarks/data_generator/tests/test_hasher.py (1)
benchmarks/data_generator/hasher.py (1)
texts_to_hashes(21-74)
benchmarks/data_generator/graph_utils.py (1)
benchmarks/data_generator/sampler.py (1)
get_cdf(26-28)
🪛 YAMLlint (1.37.1)
.github/workflows/pre-merge-python.yml
[error] 80-80: no new line character at the end of file
(new-line-at-end-of-file)
🪛 LanguageTool
benchmarks/data_generator/README.md
[uncategorized] ~42-~42: Loose punctuation mark.
Context: ...-size <block_size> ``` - --input-file: Path to your trace file in jsonl format...
(UNLIKELY_OPENING_PUNCTUATION)
[typographical] ~97-~97: The conjunction “so that” does not have a comma in front.
Context: ... of being incremented by a large integer, so that they will be effectively separated into...
(SO_THAT_UNNECESSARY_COMMA)
[style] ~97-~97: Consider using a different adverb to strengthen your wording.
Context: ...tistics of the original one, but having completely different roots. For example, if rows ...
(COMPLETELY_ENTIRELY)
[grammar] ~114-~114: “Node” is a singular noun. It appears that the verb form is incorrect.
Context: ...the parent. As a consequence, each node need to store an attribute length to indic...
(PCT_SINGULAR_NOUN_PLURAL_VERB_AGREEMENT)
[uncategorized] ~116-~116: “the” seems less likely than “they”.
Context: ...o sample a path in the core radix tree, the append the path with new hash ids corre...
(AI_HYDRA_LEO_CP_THE_THEY)
[style] ~131-~131: To reduce wordiness, try specifying a number or using “many” or “numerous” instead.
Context: ...-to-end test. It is important to sample a large number of requests (e.g., hundreds of thousands) ...
(LARGE_NUMBER_OF)
[misspelling] ~131-~131: This word is normally spelled with a hyphen.
Context: ...statistics (such as mean ISL) should be well preserved in the synthetic data. However, the sta...
(EN_COMPOUNDS_WELL_PRESERVED)
🪛 markdownlint-cli2 (0.17.2)
benchmarks/data_generator/README.md
20-20: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
76-76: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
87-87: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
101-101: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
121-121: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
127-127: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
🪛 Pylint (3.3.7)
benchmarks/data_generator/prefix_analyzer.py
[refactor] 22-22: Too few public methods (1/2)
(R0903)
benchmarks/data_generator/sampler.py
[refactor] 44-47: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 50-50: Too few public methods (1/2)
(R0903)
benchmarks/data_generator/tests/test_synthesizer.py
[refactor] 40-40: Too many arguments (6/5)
(R0913)
[refactor] 40-40: Too many positional arguments (6/5)
(R0917)
benchmarks/data_generator/synthesizer.py
[refactor] 32-32: Too many instance attributes (12/7)
(R0902)
[refactor] 33-33: Too many arguments (7/5)
(R0913)
[refactor] 33-33: Too many positional arguments (7/5)
(R0917)
[refactor] 33-33: Too many local variables (29/15)
(R0914)
[refactor] 33-33: Too many branches (15/12)
(R0912)
[refactor] 33-33: Too many statements (67/50)
(R0915)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (9)
benchmarks/data_generator/__init__.py (1)
16-21: Clean package entry point implementation.The implementation correctly provides a package-level entry point that delegates to the CLI main function. This follows standard Python packaging patterns.
.github/workflows/pre-merge-python.yml (1)
57-57: Installation step is necessary for testing the new package.The addition of
pip install -e /workspace/benchmarksbefore running pytest is correct and necessary to ensure the new benchmarks package is available during testing.docs/guides/kv_router_perf_tuning.md (1)
68-69: LGTM: Excellent documentation enhancement.The added paragraph effectively introduces users to the available analysis tools while setting expectations for future automatic tuning capabilities. This provides valuable context for the current manual tuning requirements.
benchmarks/data_generator/tests/test_sampler.py (1)
22-46: LGTM: Well-designed statistical test.The test effectively validates the EmpiricalSampler's distribution behavior:
- Uses appropriate sample size (1000) for statistical significance
- Reasonable tolerance range (300-400) for ~333 expected occurrences
- Properly validates absence of unexpected values
- Clear test structure and assertions
The statistical bounds are appropriate for catching significant distribution deviations while allowing for normal sampling variance.
benchmarks/data_generator/tests/test_hasher.py (2)
22-42: Well-designed test fixture.The dummy tokenizer fixture is properly constructed with appropriate normalization, pre-tokenization, and decoding components. The simple vocabulary makes the test predictable and easy to debug.
45-54: Comprehensive test case for block hashing.The test effectively validates the rolling hash behavior with different text combinations, ensuring that the same prefix ("a b c d") produces consistent hash IDs while different continuations produce different chains.
benchmarks/data_generator/hasher.py (1)
21-74: Solid rolling hash implementation.The function correctly implements rolling hash computation with efficient batch tokenization. The hash-to-integer mapping ensures deterministic and consistent hash IDs across different runs.
benchmarks/pyproject.toml (1)
16-82: Well-structured project configuration.The pyproject.toml file is comprehensive with appropriate metadata, dependencies, and tool configurations. The CLI entry point and package setup will enable proper installation and usage of the benchmarking tools.
benchmarks/data_generator/tests/test_synthesizer.py (1)
75-96: Comprehensive graph structure validation.The test effectively validates the synthesizer's graph construction by checking node relationships and attributes. The test data covers various scenarios with different hash sequences.
| PYTEST_MARKS: "pre_merge or mypy" | ||
| run: | | ||
| docker run -w /workspace --name ${{ env.CONTAINER_ID }}_pytest ${{ steps.define_image_tag.outputs.image_tag }} pytest --basetemp=/tmp --junitxml=${{ env.PYTEST_XML_FILE }} -m "${{ env.PYTEST_MARKS }}" | ||
| docker run -w /workspace --name ${{ env.CONTAINER_ID }}_pytest ${{ steps.define_image_tag.outputs.image_tag }} bash -c "pip install -e /workspace/benchmarks && pytest --basetemp=/tmp --junitxml=${{ env.PYTEST_XML_FILE }} -m \"${{ env.PYTEST_MARKS }}\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PeaBrane it looks like you don't have signed commits enabled, so the gitlab PR didn't get triggered. These changes look like they're failing in similar tests on gitlab side because the benchmarks package doesn't get installed, so mypy doesn't know about the import.
The reason for duplicate tests on gitlab side is to access wider pool of GPU runners for GPU testing.
ex: https://gitlab-master.nvidia.com/dl/ai-dynamo/dynamo/-/jobs/175820836
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw - was the correct fix to pip install in the test step here? Or would it make more sense to install in the Dockerfile itself so it's available to all? CC @nnshah1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can the benchmarks directory be moved under tests?
- Should the benchmark dependencies be added to the Dockerfile? If yes, which container image/stage should they be included in? Or can they be added to requirements.test.txt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend adding it to the docker file - as part of dev or ci
Overview:
The scope of this PR is well described in the committed README
Introduces two extra Python dependencies:
networkxandpandasOpen discussion:
benchmarksdirectory withdynamo?Summary by CodeRabbit
New Features
Documentation
Bug Fixes
Tests
Chores