Add Unicode hardening to markdown sanitization functions by Copilot · Pull Request #14795 · github/gh-aw

Copilot · 2026-02-10T15:13:38Z

Markdown sanitization lacked protection against Unicode-based attacks: zero-width characters for hidden content, bidirectional overrides for visual spoofing, and full-width ASCII for filter bypass.

Changes

Added hardenUnicodeText() in sanitize_content_core.cjs
- NFC normalization before processing
- Zero-width character removal: \u200B-\u200D, \u2060, \uFEFF
- Bidirectional override removal: \u202A-\u202E, \u2066-\u2069
- Full-width ASCII conversion: \uFF01-\uFF5E → \u0021-\u007E
Integrated into sanitization pipeline
- Applied in sanitizeContentCore(), sanitizeContent(), sanitizeLabelContent()
- Runs before ANSI and control character removal
- Automatic coverage for sanitizeIncomingText() via core function
Test coverage
- 220 tests across transformation types, combined attacks, edge cases

Example

// Before: vulnerable to Unicode attacks
sanitizeContent("filename\u202E.txt.exe");  // RTL override hides .exe
// → "filename.txt.exe" (but displays reversed, looks like .txt)

// After: attacks neutralized
sanitizeContent("filename\u202E.txt.exe");
// → "filename.txt.exe" (override removed, displays correctly)

// Full-width bypass prevented
sanitizeContent("\uFF21\uFF22\uFF23");  // Full-width ABC
// → "ABC" (converted to standard ASCII)

Attack vectors addressed:

Visual spoofing via RTL overrides (Trojan Source)
Hidden content via zero-width characters
Filter evasion via full-width lookalikes
Encoding inconsistencies via decomposed Unicode

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot

Pull request overview

This pull request adds Unicode hardening to markdown sanitization functions to protect against Unicode-based attacks including visual spoofing, hidden content, and filter bypass. The implementation introduces a new hardenUnicodeText() function that performs NFC normalization, removes zero-width characters, strips bidirectional override controls, and converts full-width ASCII to standard ASCII.

Changes:

Implemented hardenUnicodeText() in sanitize_content_core.cjs with four-step hardening pipeline
Integrated Unicode hardening into sanitizeContentCore(), sanitizeContent(), and sanitizeLabelContent()
Added comprehensive test coverage (220 tests) across transformation types, combined attacks, and edge cases

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
actions/setup/js/sanitize_content_core.cjs	Implements `hardenUnicodeText()` function with NFC normalization, zero-width removal, bidirectional control removal, and full-width ASCII conversion; integrates into `sanitizeContentCore()`
actions/setup/js/sanitize_content.cjs	Imports and applies `hardenUnicodeText()` early in the sanitization pipeline for the custom mention filtering path
actions/setup/js/sanitize_label_content.cjs	Integrates `hardenUnicodeText()` into label sanitization before ANSI and control character removal
actions/setup/js/sanitize_content.test.cjs	Adds 220 tests covering zero-width removal, NFC normalization, full-width conversion, directional overrides, combined attacks, and edge cases
actions/setup/js/sanitize_label_content.test.cjs	Adds Unicode hardening tests for label content including zero-width characters, full-width ASCII, directional overrides, NFC normalization, and emoji preservation

Comments suppressed due to low confidence (1)

actions/setup/js/sanitize_content_core.cjs:525

There's a subtle issue with the order of operations in hardenUnicodeText(). NFC normalization happens BEFORE full-width ASCII conversion, which means full-width characters won't be normalized with their combining marks.

For example, with input "\uFF21\u0301" (full-width A + combining acute):

Step 1 (NFC): No change, because \uFF21 is not recognized as a base character for composition
Steps 2-3: No change (no relevant characters)
Step 4: "\uFF21" → "A", resulting in "A\u0301"
Final: "A\u0301" (not composed)

If NFC normalization ran AFTER full-width conversion, "A\u0301" would compose to "Á" (U+00C1).

Consider reordering to:

Full-width conversion (convert to standard chars first)
Zero-width removal
Directional override removal
NFC normalization (compose after all chars are standard)

This would ensure that full-width base characters properly compose with their combining marks.

  // Step 1: Normalize Unicode to canonical composition (NFC)
  // This ensures consistent character representation across different encodings
  result = result.normalize("NFC");

  // Step 2: Strip invisible zero-width characters that can hide content
  // These include: zero-width space, zero-width non-joiner, zero-width joiner,
  // word joiner, and byte order mark
  result = result.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "");

  // Step 3: Remove bidirectional text override controls
  // These can be used to reverse text direction and create visual spoofs
  result = result.replace(/[\u202A\u202B\u202C\u202D\u202E\u2066\u2067\u2068\u2069]/g, "");

  // Step 4: Convert full-width ASCII characters to standard ASCII
  // Full-width characters (U+FF01-FF5E) can be used to bypass filters
  result = result.replace(/[\uFF01-\uFF5E]/g, char => {
    const code = char.charCodeAt(0);
    // Map full-width to half-width by subtracting offset
    const standardCode = code - 0xfee0;
    return String.fromCharCode(standardCode);
  });

  return result;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-10T15:39:35Z

actions/setup/js/sanitize_content_core.cjs

+    const standardCode = code - 0xfee0;
+    return String.fromCharCode(standardCode);
+  });
+


The full-width conversion range [\uFF01-\uFF5E] excludes U+FF00. However, U+FF00 is actually undefined in Unicode. The actual full-width/ideographic space is U+3000, not U+FF00. If conversion of full-width spaces is desired for security purposes, consider adding:

// Convert ideographic/full-width space to ASCII space result = result.replace(/\u3000/g, " ");

This would prevent attackers from using U+3000 to bypass space-based filters. However, this might be omitted intentionally to preserve certain CJK formatting semantics.

This issue also appears on line 503 of the same file.

Suggested change

// Step 5: Convert ideographic/full-width space (U+3000) to ASCII space

// This prevents use of U+3000 to bypass space-based filters

result = result.replace(/\u3000/g, " ");

Copilot · 2026-02-10T15:39:36Z

actions/setup/js/sanitize_content_core.cjs

+  // Step 2: Strip invisible zero-width characters that can hide content
+  // These include: zero-width space, zero-width non-joiner, zero-width joiner,
+  // word joiner, and byte order mark
+  result = result.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "");


The removal of zero-width joiner (U+200D) in line 510 may break complex emoji sequences that legitimately use ZWJ to combine multiple emoji. For example, family emoji (👨‍👩‍👧‍👦), profession emoji (👨‍⚕️), and flag sequences use ZWJ.

While the test at line 1304-1309 checks basic emoji preservation, it doesn't test ZWJ emoji sequences. Consider either:

Adding an exception for ZWJ when it appears between emoji characters, or

Accepting this tradeoff and documenting that complex ZWJ emoji may be broken as a security measure

If option 1 is chosen, you could detect emoji by checking if adjacent characters are in emoji Unicode ranges (U+1F300-1F9FF, U+2600-26FF, etc.) before deciding whether to remove ZWJ.

Copilot · 2026-02-10T15:39:36Z

actions/setup/js/sanitize_content.test.cjs

+      it("should not affect emoji", () => {
+        const input = "Hello 👋 World 🌍";
+        const result = sanitizeContent(input);
+        expect(result).toContain("👋");
+        expect(result).toContain("🌍");
+      });


The emoji preservation test should include examples of ZWJ (zero-width joiner) emoji sequences, since U+200D is removed by the Unicode hardening function at line 510 of sanitize_content_core.cjs. Examples to test:

Family emoji: "👨‍👩‍👧‍👦" (uses ZWJ between each member)

Profession emoji: "👨‍⚕️" (man + ZWJ + medical symbol)

Flag sequences that use ZWJ

This would verify whether the intended behavior is to break these sequences (as a security tradeoff) or whether the implementation needs adjustment.

Copilot · 2026-02-10T15:39:37Z

actions/setup/js/sanitize_label_content.test.cjs

+    it("should preserve emoji in labels", () => {
+      expect(sanitizeLabelContent("🐛 bug")).toBe("🐛 bug");
+      expect(sanitizeLabelContent("✨ enhancement")).toBe("✨ enhancement");
+    });


Similar to the sanitize_content.test.cjs tests, the emoji preservation test should include examples of ZWJ (zero-width joiner) emoji sequences to verify whether they are intentionally broken or should be preserved. Since U+200D is removed by hardenUnicodeText(), complex emoji like "👨‍👩‍👧‍👦" (family) or "👨‍⚕️" (profession) will be broken into their component parts.

Initial plan

258b9ab

Copilot AI assigned Copilot and pelikhan Feb 10, 2026

Copilot started work on behalf of pelikhan February 10, 2026 15:13 View session

Copilot AI and others added 2 commits February 10, 2026 15:21

Add Unicode hardening transformations to markdown sanitization

0865ff9

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Apply Unicode hardening to label sanitization

3fe4857

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Apply hardening transformation to sanitization functions~~ Add Unicode hardening to markdown sanitization functions Feb 10, 2026

pelikhan marked this pull request as ready for review February 10, 2026 15:33

Copilot AI requested a review from pelikhan February 10, 2026 15:33

Copilot AI review requested due to automatic review settings February 10, 2026 15:33

Merge branch 'main' into copilot/harden-sanitization-functions

9141fe9

Copilot finished work on behalf of pelikhan February 10, 2026 15:33

Copilot started reviewing on behalf of pelikhan February 10, 2026 15:34 View session

pelikhan merged commit 01056b8 into main Feb 10, 2026
53 of 54 checks passed

pelikhan deleted the copilot/harden-sanitization-functions branch February 10, 2026 15:35

Copilot AI reviewed Feb 10, 2026

View reviewed changes

github-actions bot mentioned this pull request Feb 11, 2026

🌱 Daily Team Evolution Insights - February 11, 2026 #14869

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Unicode hardening to markdown sanitization functions#14795

Add Unicode hardening to markdown sanitization functions#14795
pelikhan merged 4 commits intomainfrom
copilot/harden-sanitization-functions

Copilot AI commented Feb 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+  // Step 5: Convert ideographic/full-width space (U+3000) to ASCII space
+  // This prevents use of U+3000 to bypass space-based filters
+  result = result.replace(/\u3000/g, " ");

Conversation

Copilot AI commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Example

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 10, 2026 •

edited

Loading