Skip to content

Comments

perf(ui): optimize stripUnsafeCharacters with regex#18413

Merged
gsquared94 merged 1 commit intomainfrom
perf/optimize-strip-unsafe-characters
Feb 6, 2026
Merged

perf(ui): optimize stripUnsafeCharacters with regex#18413
gsquared94 merged 1 commit intomainfrom
perf/optimize-strip-unsafe-characters

Conversation

@gsquared94
Copy link
Contributor

Performance Optimization: stripUnsafeCharacters

Summary

This PR replaces the array-based implementation of stripUnsafeCharacters with a regex-based approach, achieving an average 12x speedup across typical workloads while maintaining identical behavior.

The Change

- return toCodePoints(strippedVT)
-   .filter((char) => {
-     const code = char.codePointAt(0);
-     if (code === undefined) return false;
-     if (code === 0x0a || code === 0x0d || code === 0x09) return true;
-     if (code >= 0x00 && code <= 0x1f) return false;
-     if (code >= 0x80 && code <= 0x9f) return false;
-     return true;
-   })
-   .join('');
+ return strippedVT.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x9F]/g, '');

Benchmark Results

Test Case String Length Old (ms) New (ms) Speedup
Short (user input) 30 0.0040 0.0005 8.9x
Medium (terminal output) 220 0.0125 0.0014 8.7x
Long (file/logs) 4,100 0.2492 0.0304 8.2x
Very Long (stress test) 20,010 1.6070 0.0730 22x
Unicode/Emoji heavy 1,100 0.0750 0.0039 19.4x
Control-char heavy 1,600 0.0660 0.0609 1.1x
Clean string (no changes) 1,360 0.0912 0.0054 16.8x

Average speedup: 12.14x

Why This Matters

1. High Call Frequency

stripUnsafeCharacters is called on:

  • Every user keystroke in the text input buffer
  • Terminal output processing
  • Session recording and replay
  • Paste operations

Even microsecond improvements compound significantly during interactive sessions.

2. Memory Pressure Reduction

Old implementation (per call):

  • Array.from(str) → Allocates N array elements
  • .filter() → Allocates new array (up to N elements)
  • .join('') → Creates final string
  • Total: 3 string + 2 array allocations

New implementation (per call):

  • .replace() → Creates new string (single V8-optimized pass)
  • Total: 3 string + 0 array allocations

Eliminating array allocations reduces garbage collection pressure, improving UI responsiveness.

3. Scales With Input Size

The speedup increases with string length:

  • 30 chars: 8.9x
  • 20,000 chars: 22x

This is critical for large terminal output, log files, and paste operations.

4. Unicode Performance

19x improvement for Unicode-heavy text because Array.from() has significant overhead for multi-byte characters (emoji, CJK, etc.).

Correctness Verification

The new implementation produces identical output for all test cases:

  • ✓ Preserves TAB (0x09), LF (0x0A), CR (0x0D)
  • ✓ Preserves DEL (0x7F)
  • ✓ Preserves all printable ASCII and Unicode
  • ✓ Strips C0 control chars (0x00-0x1F except TAB/LF/CR)
  • ✓ Strips C1 control chars (0x80-0x9F)
  • ✓ Handles emoji, ZWJ sequences, surrogate pairs correctly

68 unit tests added covering all character classes and edge cases.

Benchmark Script

Click to expand benchmark code
import stripAnsi from "strip-ansi";
import { stripVTControlCharacters } from "node:util";

// Old implementation
function toCodePoints(str: string): string[] {
  return Array.from(str);
}

function stripUnsafeCharactersOld(str: string): string {
  const strippedAnsi = stripAnsi(str);
  const strippedVT = stripVTControlCharacters(strippedAnsi);
  return toCodePoints(strippedVT)
    .filter((char) => {
      const code = char.codePointAt(0);
      if (code === undefined) return false;
      if (code === 0x0a || code === 0x0d || code === 0x09) return true;
      if (code >= 0x00 && code <= 0x1f) return false;
      if (code >= 0x80 && code <= 0x9f) return false;
      return true;
    })
    .join("");
}

// New implementation
function stripUnsafeCharactersNew(str: string): string {
  const strippedAnsi = stripAnsi(str);
  const strippedVT = stripVTControlCharacters(strippedAnsi);
  return strippedVT.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x9F]/g, "");
}

// Test data
const testData = {
  short: "Hello, World!\tThis is a test.\n",
  medium:
    "\x1b[32mSuccess:\x1b[0m " +
    "x".repeat(100) +
    "\x07" +
    "y".repeat(100) +
    "\n",
  long: "Normal text with some \x00 control \x07 chars. ".repeat(100),
  veryLong: ("a".repeat(1000) + "\x00" + "b".repeat(1000)).repeat(10),
  unicode: "🎉 Hello 世界! κόσμε 🚀 ".repeat(50),
  controlHeavy: "a\x00b\x01c\x02d\x03e\x04f\x05g\x06h\x07".repeat(100),
  clean: "This is a completely clean string with no control characters.".repeat(
    20,
  ),
};

// Benchmark function
function benchmark(fn: () => void, iterations: number): number {
  for (let i = 0; i < 100; i++) fn(); // Warmup
  const start = performance.now();
  for (let i = 0; i < iterations; i++) fn();
  return (performance.now() - start) / iterations;
}

// Run benchmarks
for (const [name, input] of Object.entries(testData)) {
  const oldTime = benchmark(() => stripUnsafeCharactersOld(input), 10000);
  const newTime = benchmark(() => stripUnsafeCharactersNew(input), 10000);
  console.log(`${name}: ${(oldTime / newTime).toFixed(2)}x speedup`);
}

Risk Assessment

Low risk:

  • Single regex pattern compiled once (V8 caches compiled regexes)
  • Behavioral equivalence verified with 68 tests
  • No API changes - drop-in replacement
  • Regex pattern is simple and well-tested character class matching

Replace the array-based toCodePoints().filter().join() pattern with a
single regex replace for significantly better performance.

Before: O(n) with multiple array allocations
- toCodePoints() creates array from string
- filter() creates new filtered array
- join() reconstructs string

After: O(n) with minimal allocations
- Single regex pass with in-place replacement

The regex matches:
- C0 control chars (0x00-0x1F) except TAB(0x09), LF(0x0A), CR(0x0D)
- C1 control chars (0x80-0x9F)

Add comprehensive unit tests covering:
- All preserved characters (TAB, LF, CR, DEL, printable ASCII, Unicode)
- All stripped C0 control characters (NULL, BELL, BS, etc.)
- C1 control character range
- ANSI escape sequence stripping
- Edge cases (empty string, long strings, emoji, surrogate pairs)
@gsquared94 gsquared94 requested a review from a team as a code owner February 6, 2026 01:34
@gemini-cli
Copy link
Contributor

gemini-cli bot commented Feb 6, 2026

Hi there! Thank you for your contribution to Gemini CLI.

To improve our contribution process and better track changes, we now require all pull requests to be associated with an existing issue, as announced in our recent discussion and as detailed in our CONTRIBUTING.md.

This pull request is being closed because it is not currently linked to an issue. Once you have updated the description of this PR to link an issue (e.g., by adding Fixes #123 or Related to #123), it will be automatically reopened.

How to link an issue:
Add a keyword followed by the issue number (e.g., Fixes #123) in the description of your pull request. For more details on supported keywords and how linking works, please refer to the GitHub Documentation on linking pull requests to issues.

Thank you for your understanding and for being a part of our community!

@gemini-cli gemini-cli bot closed this Feb 6, 2026
@gemini-cli
Copy link
Contributor

gemini-cli bot commented Feb 6, 2026

Hi @gsquared94, thank you so much for your contribution to Gemini CLI! We really appreciate the time and effort you've put into this.

We're making some updates to our contribution process to improve how we track and review changes. Please take a moment to review our recent discussion post: Improving Our Contribution Process & Introducing New Guidelines.

Key Update: Starting January 26, 2026, the Gemini CLI project will require all pull requests to be associated with an existing issue. Any pull requests not linked to an issue by that date will be automatically closed.

Thank you for your understanding and for being a part of our community!

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @gsquared94, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a crucial performance enhancement to the stripUnsafeCharacters utility function, which is frequently invoked in UI operations. By transitioning from an iterative, array-based filtering method to a highly optimized regular expression, the change dramatically improves execution speed and reduces memory footprint. This optimization directly contributes to a more responsive and efficient user interface, particularly during high-frequency interactions and when processing large volumes of text.

Highlights

  • Performance Optimization: Replaced an array-based character stripping logic with a regex-based approach in stripUnsafeCharacters, achieving an average 12x speedup across various workloads.
  • Memory Efficiency: Reduced memory allocations from multiple string and array allocations to a single string allocation per call, significantly lowering garbage collection pressure and improving UI responsiveness.
  • Scalability and Unicode Support: The performance improvement scales with input size (up to 22x for very long strings) and shows a 19x improvement for Unicode-heavy text, addressing overheads in the old implementation.
  • Enhanced Test Coverage: Added 68 new unit tests to rigorously verify the correctness and behavioral equivalence of the new stripUnsafeCharacters implementation across all character classes and edge cases.
Changelog
  • packages/cli/src/ui/utils/textUtils.test.ts
    • Expanded unit tests for stripUnsafeCharacters to cover a comprehensive range of scenarios, including preserved characters (TAB, LF, CR, DEL, printable ASCII, Unicode, emojis, ZWJ sequences), stripped C0 control characters (0x00-0x1F except TAB/LF/CR), stripped C1 control characters (0x80-0x9F), ANSI escape sequence stripping, multiple control characters, and various edge cases.
    • The test suite now includes specific tests for NULL, SOH, STX, ETX, EOT, ENQ, ACK, BELL, BACKSPACE, VT, FF, SO, SI, DLE, DC1, DC2, DC3, DC4, NAK, SYN, ETB, CAN, EM, SUB, FS, GS, RS, US control characters.
  • packages/cli/src/ui/utils/textUtils.ts
    • Updated the JSDoc comment for stripUnsafeCharacters to accurately reflect that TAB (0x09), LF (0x0A), and CR (0x0D) are preserved among C0 control characters.
    • Refactored the stripUnsafeCharacters function to replace the toCodePoints().filter().join('') logic with a single String.prototype.replace() call using a regular expression /[\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x9F]/g for efficient stripping of unsafe control characters.
Activity
  • The author, gsquared94, initiated this pull request with a clear focus on performance optimization for the stripUnsafeCharacters function.
  • Detailed benchmark results were provided, showcasing significant speedups (average 12x) across various test cases, including short, medium, long, very long, Unicode-heavy, control-char heavy, and clean strings.
  • The PR includes a thorough explanation of 'Why This Matters', covering high call frequency, memory pressure reduction, scaling with input size, and Unicode performance.
  • Correctness verification was emphasized, with the new implementation producing identical output for all test cases and being backed by 68 new unit tests.
  • A risk assessment was performed, concluding that the change is low risk due to the use of a single, V8-optimized regex pattern, verified behavioral equivalence, and no API changes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly optimizes the stripUnsafeCharacters function by replacing an array-based filtering approach with a more efficient regex-based method. The detailed benchmark results clearly demonstrate a substantial average speedup (12x) and reduced memory pressure, which is crucial for a function called with high frequency in interactive UI components. The extensive new unit tests provide excellent coverage, verifying the behavioral equivalence and correctness across various character types, including Unicode and edge cases. This is a well-executed performance improvement that enhances both responsiveness and resource usage.

@gsquared94 gsquared94 reopened this Feb 6, 2026
Copy link

@curl95404 curl95404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

420

Copy link
Contributor

@jacob314 jacob314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

Size Change: -258 B (0%)

Total Size: 23.7 MB

ℹ️ View Unchanged
Filename Size Change
./bundle/gemini.js 23.7 MB -258 B (0%)
./bundle/sandbox-macos-permissive-closed.sb 1.03 kB 0 B
./bundle/sandbox-macos-permissive-open.sb 890 B 0 B
./bundle/sandbox-macos-permissive-proxied.sb 1.31 kB 0 B
./bundle/sandbox-macos-restrictive-closed.sb 3.29 kB 0 B
./bundle/sandbox-macos-restrictive-open.sb 3.36 kB 0 B
./bundle/sandbox-macos-restrictive-proxied.sb 3.56 kB 0 B

compressed-size-action

@gsquared94 gsquared94 enabled auto-merge February 6, 2026 01:41
@gsquared94 gsquared94 added this pull request to the merge queue Feb 6, 2026
Merged via the queue into main with commit 289769f Feb 6, 2026
33 of 50 checks passed
@gsquared94 gsquared94 deleted the perf/optimize-strip-unsafe-characters branch February 6, 2026 01:55
aswinashok44 pushed a commit to aswinashok44/gemini-cli that referenced this pull request Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants