-
Notifications
You must be signed in to change notification settings - Fork 704
feat: generate random texts from hashes using lorem ipsum #1458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe updates introduce a new example script for synthesizing requests from Mooncake traces, enhance documentation with a quickstart and workflow guide, and add bidirectional conversion between text and hash IDs in the data generator. The synthesizer's constructor is refactored, and new tests validate hash-to-text conversion with a real tokenizer. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ExampleScript
participant Synthesizer
participant Hasher
participant Tokenizer
User->>ExampleScript: Run example.py
ExampleScript->>Synthesizer: Initialize with Mooncake trace and parameters
Synthesizer->>Synthesizer: Generate synthetic requests (hash IDs)
ExampleScript->>Hasher: hashes_to_texts(tokenizer, hash_ids, input_lengths)
Hasher->>Tokenizer: Tokenize and detokenize random words per hash ID
Hasher->>ExampleScript: Return generated texts
ExampleScript->>File: Write synthesized requests as JSONL
Poem
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (1)
benchmarks/pyproject.toml (1)
42-50: 🛠️ Refactor suggestionUnpinned dependency may break reproducible builds
"lorem_text"is added without a version specifier. Each CI run will now pick up whatever the latest version is on PyPI, which can introduce silent breakages or licensing surprises.- "lorem_text", + # Pin to the newest known-good version to keep builds reproducible + "lorem_text>=0.1,<0.2",Consider pinning to a compatible upper bound or including it in a
constraints.txt.
🧹 Nitpick comments (5)
benchmarks/data_generator/README.md (1)
29-36: Nice clarification – consider mentioning determinism caveatGood addition explaining the round-trip workflow. One tiny nit: random Lorem-Ipsum sampling makes the process non-deterministic; a short sentence about setting
PYTHONHASHSEED/random.seedfor reproducible text may help advanced users.benchmarks/data_generator/tests/test_hasher.py (2)
16-18: Unused imports & missing reproducibility seed
mathis used later, butrandomis imported without a fixed seed.
Unseeded randomness makes the test non-reproducible across failures/debug sessions.-import random +import random + +random.seed(0) # ensure deterministic test dataAlso check that
randomremains used; otherwise remove it.
65-100: Minor: tighten assertion message for easier triageIf this loop fails, knowing the mismatched lengths is helpful but printing the first few tokens can aid debugging.
- assert ( - actual_length == expected_length - ), f"Entry {i}: expected length {expected_length}, got {actual_length}" + assert actual_length == expected_length, ( + f"Entry {i}: expected {expected_length}, got {actual_length}. " + f"Sample tokens: {tokens[:10]}" + )benchmarks/data_generator/hasher.py (2)
23-25: Consider lazy initialization for lorem text generation.Generating 20 paragraphs of Lorem Ipsum text at module import time could impact import performance. Consider moving this initialization to a lazy loading pattern or the first call to
hashes_to_texts.Apply this diff to implement lazy initialization:
-# Generate 20 paragraphs of Lorem Ipsum -lorem_text = lorem.paragraphs(20) -words = np.array(list(set(re.findall(r"\b[a-zA-Z]+\b", lorem_text)))) +# Lazy initialization for lorem words +_words = None + +def _get_lorem_words(): + global _words + if _words is None: + lorem_text = lorem.paragraphs(20) + _words = np.array(list(set(re.findall(r"\b[a-zA-Z]+\b", lorem_text)))) + return _wordsThen update line 145 in
hashes_to_textsto use:-sampled_words = np.random.choice(words, size=current_block_size) +sampled_words = np.random.choice(_get_lorem_words(), size=current_block_size)
91-160: Consider refactoring to reduce function complexity.The function has 20 local variables, exceeding the recommended limit of 15. Consider extracting helper functions to improve readability and maintainability.
Extract the token generation logic into a helper function:
def _generate_token_block(tokenizer, hash_id, block_size, hash_to_tokens_cache): """Generate or retrieve token block for a given hash_id.""" if hash_id in hash_to_tokens_cache: return hash_to_tokens_cache[hash_id] # Generate new random array sampled_words = np.random.choice(_get_lorem_words(), size=block_size) sampled_text = " ".join(sampled_words) tokens = tokenizer.encode(sampled_text, add_special_tokens=False) token_array = np.array(tokens[:block_size], dtype=np.int32) if tokenizer.bos_token_id is not None: token_array[0] = tokenizer.bos_token_id hash_to_tokens_cache[hash_id] = token_array return token_array🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 91-91: Too many local variables (20/15)
(R0914)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
benchmarks/data_generator/README.md(1 hunks)benchmarks/data_generator/hasher.py(2 hunks)benchmarks/data_generator/prefix_analyzer.py(1 hunks)benchmarks/data_generator/synthesizer.py(1 hunks)benchmarks/data_generator/tests/test_hasher.py(3 hunks)benchmarks/pyproject.toml(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
benchmarks/data_generator/synthesizer.py (1)
benchmarks/data_generator/logging_utils.py (1)
calculate_and_print_statistics(23-55)
benchmarks/data_generator/prefix_analyzer.py (1)
benchmarks/data_generator/logging_utils.py (1)
calculate_and_print_statistics(23-55)
🪛 Pylint (3.3.7)
benchmarks/data_generator/hasher.py
[refactor] 91-91: Too many local variables (20/15)
(R0914)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
🔇 Additional comments (2)
benchmarks/data_generator/tests/test_hasher.py (1)
48-51: External-model download will slow / break CI
AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base")triggers a ~1 GB download and requires network connectivity.
Typical CI environments are offline or have tight timeouts.Options:
- Mark the fixture as slow / optional:
import pytest @pytest.fixture(scope="module") @pytest.mark.slow def deepseek_tokenizer(): ...
Skip if
HF_HUB_OFFLINE=1or no internet.Use a lightweight HF tokenizer such as
"gpt2"(≈ 5 MB) that still exercises the same code path.Failing to guard this will likely break the pipeline.
benchmarks/data_generator/hasher.py (1)
28-51: LGTM!The enhanced tokenizer parameter handling provides good flexibility by accepting either a tokenizer object or a string name.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
benchmarks/data_generator/hasher.py (1)
118-123:⚠️ Potential issueConstraint check is still inverted – it raises on the safe case and lets real violations slip through.
Same concern was raised in a previous review but the guard was not fixed.
The requirement is “block capacity must not exceedinput_len” →
len(hash_ids) * block_sizeshould be ≤input_len.- # Verify constraint: len(hash_ids) * block_size <= input_len - if len(hash_ids) * block_size < input_len: + # Verify constraint: len(hash_ids) * block_size <= input_len + if len(hash_ids) * block_size > input_len: raise ValueError( f"Constraint violation: len(hash_ids) * block_size ({len(hash_ids) * block_size}) > input_len ({input_len})" )Without this fix the function silently produces truncated texts for the real overflow case.
🧹 Nitpick comments (1)
benchmarks/data_generator/hasher.py (1)
24-25: Minor: sampling pool built fromset(…)flattens word frequency.Converting to a
setdiscards natural word repetition, producing uniformly random words instead of lorem-like distribution.
If more natural output is preferred, keep duplicates:-words = np.array(list(set(re.findall(r"\b[a-zA-Z]+\b", lorem_text)))) +words = np.array(re.findall(r"\b[a-zA-Z]+\b", lorem_text))Purely stylistic; feel free to ignore if uniform randomness is intentional.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
benchmarks/data_generator/hasher.py(2 hunks)benchmarks/data_generator/synthesizer.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- benchmarks/data_generator/synthesizer.py
🧰 Additional context used
🪛 Pylint (3.3.7)
benchmarks/data_generator/hasher.py
[refactor] 91-91: Too many local variables (20/15)
(R0914)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
benchmarks/data_generator/hasher.py (3)
125-128:⚠️ Potential issueConstraint check is inverted – raises on valid input, lets violations pass.
The guard must trigger when the hash capacity exceeds the allowed
input_len.-if len(hash_ids) * block_size < input_len: +if len(hash_ids) * block_size > input_len:
143-147: 🛠️ Refactor suggestionRigid length assertion breaks when the same hash ID appears with different residual block sizes.
Legitimate for the last block of a sequence. Either slice or key the cache by(hash_id, block_size).Minimal non-breaking fix:
- assert len(existing_array) == current_block_size, ... - token_array = existing_array + if len(existing_array) != current_block_size: + token_array = existing_array[:current_block_size] + else: + token_array = existing_array
150-156:⚠️ Potential issue
token_arraycan be shorter thancurrent_block_size, leading toIndexErroron the BOS assignment and length drift.Pad when necessary before converting to
np.ndarray.- tokens = tokenizer.encode(sampled_text, add_special_tokens=False) - token_array = np.array(tokens[:current_block_size], dtype=np.int32) + tokens = tokenizer.encode(sampled_text, add_special_tokens=False) + if len(tokens) < current_block_size: + tokens.extend( + np.random.choice(words, size=current_block_size - len(tokens)).tolist() + ) + token_array = np.array(tokens[:current_block_size], dtype=np.int32)
🧹 Nitpick comments (3)
benchmarks/README.md (1)
31-31: Grammar fix – “information is provided”, not “are”.
informationis an uncountable noun, so use the singular verb.-Detailed information are provided in the `data_generator` directory. +Detailed information is provided in the `data_generator` directory.🧰 Tools
🪛 LanguageTool
[uncategorized] ~31-~31: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...tagen synthesize) Detailed information are provided in thedata_generator` direct...(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)
benchmarks/data_generator/example.py (1)
62-66: Reuse theoutput_filevariable for consistency.-with open("synthesized_requests.jsonl", "w") as f: +with open(output_file, "w") as f:benchmarks/data_generator/hasher.py (1)
121-122: Caching inside the function defeats determinism & reuse.
Every call rebuilds_hash_id_to_tokens, so the samehash_idmaps to different text across calls.
If reproducibility matters, promote the cache to module scope or accept an explicit cache parameter.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
benchmarks/README.md(1 hunks)benchmarks/data_generator/README.md(2 hunks)benchmarks/data_generator/example.py(1 hunks)benchmarks/data_generator/hasher.py(2 hunks)benchmarks/data_generator/synthesizer.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- benchmarks/data_generator/README.md
- benchmarks/data_generator/synthesizer.py
🧰 Additional context used
🪛 Pylint (3.3.7)
benchmarks/data_generator/hasher.py
[refactor] 96-96: Too many local variables (20/15)
(R0914)
🪛 LanguageTool
benchmarks/README.md
[uncategorized] ~31-~31: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...tagen synthesize) Detailed information are provided in the data_generator` direct...
(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Test - vllm
tedzhouhk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not very familiar with the codebase, approve for unblock
Overview:
complements
texts_to_hashesSummary by CodeRabbit
New Features
Documentation
Bug Fixes
Tests
Refactor