Skip to content

Conversation

@PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented Jun 10, 2025

Overview:

complements texts_to_hashes

Summary by CodeRabbit

  • New Features

    • Added a script to generate synthetic requests from Mooncake trace files, including conversion of hash IDs back to random text using a tokenizer.
    • Introduced a function for reconstructing text from hash IDs, supporting both tokenizer instances and model names.
  • Documentation

    • Enhanced documentation with a Quickstart guide, detailed workflow, and notes on bidirectional conversion between texts and hash IDs.
    • Clarified installation instructions and referenced future benchmarking scripts for core components.
  • Bug Fixes

    • Improved tokenizer handling by supporting both string model names and tokenizer instances.
  • Tests

    • Added comprehensive tests to verify accurate conversion from hash IDs to text, ensuring token length consistency with various tokenizers.
  • Refactor

    • Renamed parameters in the synthesizer for clarity and updated import statements for better module organization.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 10, 2025

Walkthrough

The updates introduce a new example script for synthesizing requests from Mooncake traces, enhance documentation with a quickstart and workflow guide, and add bidirectional conversion between text and hash IDs in the data generator. The synthesizer's constructor is refactored, and new tests validate hash-to-text conversion with a real tokenizer.

Changes

File(s) Change Summary
benchmarks/data_generator/README.md, benchmarks/README.md Documentation updated: added quickstart, clarified workflows, and installation notes.
benchmarks/data_generator/hasher.py Added hashes_to_texts function; enhanced tokenizer handling in texts_to_hashes.
benchmarks/data_generator/example.py New script for synthesizing and converting requests from Mooncake traces.
benchmarks/data_generator/synthesizer.py Refactored Synthesizer constructor parameter: num_copiesprefix_root_multiplier; import fix.
benchmarks/data_generator/prefix_analyzer.py Import statement updated for logging utility.
benchmarks/data_generator/tests/test_hasher.py Added fixture and test for hashes_to_texts using the DeepSeek tokenizer.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ExampleScript
    participant Synthesizer
    participant Hasher
    participant Tokenizer

    User->>ExampleScript: Run example.py
    ExampleScript->>Synthesizer: Initialize with Mooncake trace and parameters
    Synthesizer->>Synthesizer: Generate synthetic requests (hash IDs)
    ExampleScript->>Hasher: hashes_to_texts(tokenizer, hash_ids, input_lengths)
    Hasher->>Tokenizer: Tokenize and detokenize random words per hash ID
    Hasher->>ExampleScript: Return generated texts
    ExampleScript->>File: Write synthesized requests as JSONL
Loading

Poem

In the moonlit night, a script was born,
To hash and unhash, from dusk till morn.
With tokens and traces, requests take flight,
Synthesizer hops, making data just right.
Now hashes find voices, in Lorem delight—
A rabbit’s work, both clever and light!
🐇✨


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ad53d0 and 8f2dd93.

📒 Files selected for processing (1)
  • benchmarks/data_generator/prefix_analyzer.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • benchmarks/data_generator/prefix_analyzer.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (1)
benchmarks/pyproject.toml (1)

42-50: 🛠️ Refactor suggestion

Unpinned dependency may break reproducible builds

"lorem_text" is added without a version specifier. Each CI run will now pick up whatever the latest version is on PyPI, which can introduce silent breakages or licensing surprises.

-    "lorem_text",
+    # Pin to the newest known-good version to keep builds reproducible
+    "lorem_text>=0.1,<0.2",

Consider pinning to a compatible upper bound or including it in a constraints.txt.

🧹 Nitpick comments (5)
benchmarks/data_generator/README.md (1)

29-36: Nice clarification – consider mentioning determinism caveat

Good addition explaining the round-trip workflow. One tiny nit: random Lorem-Ipsum sampling makes the process non-deterministic; a short sentence about setting PYTHONHASHSEED/random.seed for reproducible text may help advanced users.

benchmarks/data_generator/tests/test_hasher.py (2)

16-18: Unused imports & missing reproducibility seed

math is used later, but random is imported without a fixed seed.
Unseeded randomness makes the test non-reproducible across failures/debug sessions.

-import random
+import random
+
+random.seed(0)  # ensure deterministic test data

Also check that random remains used; otherwise remove it.


65-100: Minor: tighten assertion message for easier triage

If this loop fails, knowing the mismatched lengths is helpful but printing the first few tokens can aid debugging.

-        assert (
-            actual_length == expected_length
-        ), f"Entry {i}: expected length {expected_length}, got {actual_length}"
+        assert actual_length == expected_length, (
+            f"Entry {i}: expected {expected_length}, got {actual_length}. "
+            f"Sample tokens: {tokens[:10]}"
+        )
benchmarks/data_generator/hasher.py (2)

23-25: Consider lazy initialization for lorem text generation.

Generating 20 paragraphs of Lorem Ipsum text at module import time could impact import performance. Consider moving this initialization to a lazy loading pattern or the first call to hashes_to_texts.

Apply this diff to implement lazy initialization:

-# Generate 20 paragraphs of Lorem Ipsum
-lorem_text = lorem.paragraphs(20)
-words = np.array(list(set(re.findall(r"\b[a-zA-Z]+\b", lorem_text))))
+# Lazy initialization for lorem words
+_words = None
+
+def _get_lorem_words():
+    global _words
+    if _words is None:
+        lorem_text = lorem.paragraphs(20)
+        _words = np.array(list(set(re.findall(r"\b[a-zA-Z]+\b", lorem_text))))
+    return _words

Then update line 145 in hashes_to_texts to use:

-sampled_words = np.random.choice(words, size=current_block_size)
+sampled_words = np.random.choice(_get_lorem_words(), size=current_block_size)

91-160: Consider refactoring to reduce function complexity.

The function has 20 local variables, exceeding the recommended limit of 15. Consider extracting helper functions to improve readability and maintainability.

Extract the token generation logic into a helper function:

def _generate_token_block(tokenizer, hash_id, block_size, hash_to_tokens_cache):
    """Generate or retrieve token block for a given hash_id."""
    if hash_id in hash_to_tokens_cache:
        return hash_to_tokens_cache[hash_id]
    
    # Generate new random array
    sampled_words = np.random.choice(_get_lorem_words(), size=block_size)
    sampled_text = " ".join(sampled_words)
    tokens = tokenizer.encode(sampled_text, add_special_tokens=False)
    token_array = np.array(tokens[:block_size], dtype=np.int32)
    
    if tokenizer.bos_token_id is not None:
        token_array[0] = tokenizer.bos_token_id
    
    hash_to_tokens_cache[hash_id] = token_array
    return token_array
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 91-91: Too many local variables (20/15)

(R0914)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3c85cfd and 1f92c6e.

📒 Files selected for processing (6)
  • benchmarks/data_generator/README.md (1 hunks)
  • benchmarks/data_generator/hasher.py (2 hunks)
  • benchmarks/data_generator/prefix_analyzer.py (1 hunks)
  • benchmarks/data_generator/synthesizer.py (1 hunks)
  • benchmarks/data_generator/tests/test_hasher.py (3 hunks)
  • benchmarks/pyproject.toml (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
benchmarks/data_generator/synthesizer.py (1)
benchmarks/data_generator/logging_utils.py (1)
  • calculate_and_print_statistics (23-55)
benchmarks/data_generator/prefix_analyzer.py (1)
benchmarks/data_generator/logging_utils.py (1)
  • calculate_and_print_statistics (23-55)
🪛 Pylint (3.3.7)
benchmarks/data_generator/hasher.py

[refactor] 91-91: Too many local variables (20/15)

(R0914)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (2)
benchmarks/data_generator/tests/test_hasher.py (1)

48-51: External-model download will slow / break CI

AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base") triggers a ~1 GB download and requires network connectivity.
Typical CI environments are offline or have tight timeouts.

Options:

  1. Mark the fixture as slow / optional:
import pytest

@pytest.fixture(scope="module")
@pytest.mark.slow
def deepseek_tokenizer():
    ...
  1. Skip if HF_HUB_OFFLINE=1 or no internet.

  2. Use a lightweight HF tokenizer such as "gpt2" (≈ 5 MB) that still exercises the same code path.

Failing to guard this will likely break the pipeline.

benchmarks/data_generator/hasher.py (1)

28-51: LGTM!

The enhanced tokenizer parameter handling provides good flexibility by accepting either a tokenizer object or a string name.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Yan Ru Pei <yanrpei@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
benchmarks/data_generator/hasher.py (1)

118-123: ⚠️ Potential issue

Constraint check is still inverted – it raises on the safe case and lets real violations slip through.

Same concern was raised in a previous review but the guard was not fixed.
The requirement is “block capacity must not exceed input_len
len(hash_ids) * block_size should be ≤ input_len.

-        # Verify constraint: len(hash_ids) * block_size <= input_len
-        if len(hash_ids) * block_size < input_len:
+        # Verify constraint: len(hash_ids) * block_size <= input_len
+        if len(hash_ids) * block_size > input_len:
             raise ValueError(
                 f"Constraint violation: len(hash_ids) * block_size ({len(hash_ids) * block_size}) > input_len ({input_len})"
             )

Without this fix the function silently produces truncated texts for the real overflow case.

🧹 Nitpick comments (1)
benchmarks/data_generator/hasher.py (1)

24-25: Minor: sampling pool built from set(…) flattens word frequency.

Converting to a set discards natural word repetition, producing uniformly random words instead of lorem-like distribution.
If more natural output is preferred, keep duplicates:

-words = np.array(list(set(re.findall(r"\b[a-zA-Z]+\b", lorem_text))))
+words = np.array(re.findall(r"\b[a-zA-Z]+\b", lorem_text))

Purely stylistic; feel free to ignore if uniform randomness is intentional.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f92c6e and 3946c3d.

📒 Files selected for processing (2)
  • benchmarks/data_generator/hasher.py (2 hunks)
  • benchmarks/data_generator/synthesizer.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • benchmarks/data_generator/synthesizer.py
🧰 Additional context used
🪛 Pylint (3.3.7)
benchmarks/data_generator/hasher.py

[refactor] 91-91: Too many local variables (20/15)

(R0914)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm

@PeaBrane PeaBrane enabled auto-merge (squash) June 10, 2025 20:07
@PeaBrane PeaBrane disabled auto-merge June 10, 2025 20:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
benchmarks/data_generator/hasher.py (3)

125-128: ⚠️ Potential issue

Constraint check is inverted – raises on valid input, lets violations pass.

The guard must trigger when the hash capacity exceeds the allowed input_len.

-if len(hash_ids) * block_size < input_len:
+if len(hash_ids) * block_size > input_len:

143-147: 🛠️ Refactor suggestion

Rigid length assertion breaks when the same hash ID appears with different residual block sizes.
Legitimate for the last block of a sequence. Either slice or key the cache by (hash_id, block_size).

Minimal non-breaking fix:

-                assert len(existing_array) == current_block_size, ...
-                token_array = existing_array
+                if len(existing_array) != current_block_size:
+                    token_array = existing_array[:current_block_size]
+                else:
+                    token_array = existing_array

150-156: ⚠️ Potential issue

token_array can be shorter than current_block_size, leading to IndexError on the BOS assignment and length drift.

Pad when necessary before converting to np.ndarray.

-                tokens = tokenizer.encode(sampled_text, add_special_tokens=False)
-                token_array = np.array(tokens[:current_block_size], dtype=np.int32)
+                tokens = tokenizer.encode(sampled_text, add_special_tokens=False)
+                if len(tokens) < current_block_size:
+                    tokens.extend(
+                        np.random.choice(words, size=current_block_size - len(tokens)).tolist()
+                    )
+                token_array = np.array(tokens[:current_block_size], dtype=np.int32)
🧹 Nitpick comments (3)
benchmarks/README.md (1)

31-31: Grammar fix – “information is provided”, not “are”.
information is an uncountable noun, so use the singular verb.

-Detailed information are provided in the `data_generator` directory.
+Detailed information is provided in the `data_generator` directory.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~31-~31: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...tagen synthesize) Detailed information are provided in the data_generator` direct...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)

benchmarks/data_generator/example.py (1)

62-66: Reuse the output_file variable for consistency.

-with open("synthesized_requests.jsonl", "w") as f:
+with open(output_file, "w") as f:
benchmarks/data_generator/hasher.py (1)

121-122: Caching inside the function defeats determinism & reuse.
Every call rebuilds _hash_id_to_tokens, so the same hash_id maps to different text across calls.
If reproducibility matters, promote the cache to module scope or accept an explicit cache parameter.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3946c3d and 46cf9a6.

📒 Files selected for processing (5)
  • benchmarks/README.md (1 hunks)
  • benchmarks/data_generator/README.md (2 hunks)
  • benchmarks/data_generator/example.py (1 hunks)
  • benchmarks/data_generator/hasher.py (2 hunks)
  • benchmarks/data_generator/synthesizer.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • benchmarks/data_generator/README.md
  • benchmarks/data_generator/synthesizer.py
🧰 Additional context used
🪛 Pylint (3.3.7)
benchmarks/data_generator/hasher.py

[refactor] 96-96: Too many local variables (20/15)

(R0914)

🪛 LanguageTool
benchmarks/README.md

[uncategorized] ~31-~31: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...tagen synthesize) Detailed information are provided in the data_generator` direct...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm

Copy link
Contributor

@tedzhouhk tedzhouhk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not very familiar with the codebase, approve for unblock

@PeaBrane PeaBrane disabled auto-merge June 11, 2025 02:56
@PeaBrane PeaBrane merged commit 1f59718 into main Jun 11, 2025
10 checks passed
@PeaBrane PeaBrane deleted the rupei/hashes-to-texts branch June 11, 2025 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants