Skip to content

Add Unicode hardening to markdown sanitization functions#14795

Merged
pelikhan merged 4 commits intomainfrom
copilot/harden-sanitization-functions
Feb 10, 2026
Merged

Add Unicode hardening to markdown sanitization functions#14795
pelikhan merged 4 commits intomainfrom
copilot/harden-sanitization-functions

Conversation

Copy link
Contributor

Copilot AI commented Feb 10, 2026

Markdown sanitization lacked protection against Unicode-based attacks: zero-width characters for hidden content, bidirectional overrides for visual spoofing, and full-width ASCII for filter bypass.

Changes

  • Added hardenUnicodeText() in sanitize_content_core.cjs

    • NFC normalization before processing
    • Zero-width character removal: \u200B-\u200D, \u2060, \uFEFF
    • Bidirectional override removal: \u202A-\u202E, \u2066-\u2069
    • Full-width ASCII conversion: \uFF01-\uFF5E\u0021-\u007E
  • Integrated into sanitization pipeline

    • Applied in sanitizeContentCore(), sanitizeContent(), sanitizeLabelContent()
    • Runs before ANSI and control character removal
    • Automatic coverage for sanitizeIncomingText() via core function
  • Test coverage

    • 220 tests across transformation types, combined attacks, edge cases

Example

// Before: vulnerable to Unicode attacks
sanitizeContent("filename\u202E.txt.exe");  // RTL override hides .exe
// → "filename.txt.exe" (but displays reversed, looks like .txt)

// After: attacks neutralized
sanitizeContent("filename\u202E.txt.exe");
// → "filename.txt.exe" (override removed, displays correctly)

// Full-width bypass prevented
sanitizeContent("\uFF21\uFF22\uFF23");  // Full-width ABC
// → "ABC" (converted to standard ASCII)

Attack vectors addressed:

  • Visual spoofing via RTL overrides (Trojan Source)
  • Hidden content via zero-width characters
  • Filter evasion via full-width lookalikes
  • Encoding inconsistencies via decomposed Unicode

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits February 10, 2026 15:21
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Apply hardening transformation to sanitization functions Add Unicode hardening to markdown sanitization functions Feb 10, 2026
@pelikhan pelikhan marked this pull request as ready for review February 10, 2026 15:33
Copilot AI requested a review from pelikhan February 10, 2026 15:33
Copilot AI review requested due to automatic review settings February 10, 2026 15:33
@pelikhan pelikhan merged commit 01056b8 into main Feb 10, 2026
53 of 54 checks passed
@pelikhan pelikhan deleted the copilot/harden-sanitization-functions branch February 10, 2026 15:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds Unicode hardening to markdown sanitization functions to protect against Unicode-based attacks including visual spoofing, hidden content, and filter bypass. The implementation introduces a new hardenUnicodeText() function that performs NFC normalization, removes zero-width characters, strips bidirectional override controls, and converts full-width ASCII to standard ASCII.

Changes:

  • Implemented hardenUnicodeText() in sanitize_content_core.cjs with four-step hardening pipeline
  • Integrated Unicode hardening into sanitizeContentCore(), sanitizeContent(), and sanitizeLabelContent()
  • Added comprehensive test coverage (220 tests) across transformation types, combined attacks, and edge cases

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
actions/setup/js/sanitize_content_core.cjs Implements hardenUnicodeText() function with NFC normalization, zero-width removal, bidirectional control removal, and full-width ASCII conversion; integrates into sanitizeContentCore()
actions/setup/js/sanitize_content.cjs Imports and applies hardenUnicodeText() early in the sanitization pipeline for the custom mention filtering path
actions/setup/js/sanitize_label_content.cjs Integrates hardenUnicodeText() into label sanitization before ANSI and control character removal
actions/setup/js/sanitize_content.test.cjs Adds 220 tests covering zero-width removal, NFC normalization, full-width conversion, directional overrides, combined attacks, and edge cases
actions/setup/js/sanitize_label_content.test.cjs Adds Unicode hardening tests for label content including zero-width characters, full-width ASCII, directional overrides, NFC normalization, and emoji preservation
Comments suppressed due to low confidence (1)

actions/setup/js/sanitize_content_core.cjs:525

  • There's a subtle issue with the order of operations in hardenUnicodeText(). NFC normalization happens BEFORE full-width ASCII conversion, which means full-width characters won't be normalized with their combining marks.

For example, with input "\uFF21\u0301" (full-width A + combining acute):

  • Step 1 (NFC): No change, because \uFF21 is not recognized as a base character for composition
  • Steps 2-3: No change (no relevant characters)
  • Step 4: "\uFF21" → "A", resulting in "A\u0301"
  • Final: "A\u0301" (not composed)

If NFC normalization ran AFTER full-width conversion, "A\u0301" would compose to "Á" (U+00C1).

Consider reordering to:

  1. Full-width conversion (convert to standard chars first)
  2. Zero-width removal
  3. Directional override removal
  4. NFC normalization (compose after all chars are standard)

This would ensure that full-width base characters properly compose with their combining marks.

  // Step 1: Normalize Unicode to canonical composition (NFC)
  // This ensures consistent character representation across different encodings
  result = result.normalize("NFC");

  // Step 2: Strip invisible zero-width characters that can hide content
  // These include: zero-width space, zero-width non-joiner, zero-width joiner,
  // word joiner, and byte order mark
  result = result.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "");

  // Step 3: Remove bidirectional text override controls
  // These can be used to reverse text direction and create visual spoofs
  result = result.replace(/[\u202A\u202B\u202C\u202D\u202E\u2066\u2067\u2068\u2069]/g, "");

  // Step 4: Convert full-width ASCII characters to standard ASCII
  // Full-width characters (U+FF01-FF5E) can be used to bypass filters
  result = result.replace(/[\uFF01-\uFF5E]/g, char => {
    const code = char.charCodeAt(0);
    // Map full-width to half-width by subtracting offset
    const standardCode = code - 0xfee0;
    return String.fromCharCode(standardCode);
  });

  return result;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

const standardCode = code - 0xfee0;
return String.fromCharCode(standardCode);
});

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full-width conversion range [\uFF01-\uFF5E] excludes U+FF00. However, U+FF00 is actually undefined in Unicode. The actual full-width/ideographic space is U+3000, not U+FF00. If conversion of full-width spaces is desired for security purposes, consider adding:

// Convert ideographic/full-width space to ASCII space
result = result.replace(/\u3000/g, " ");

This would prevent attackers from using U+3000 to bypass space-based filters. However, this might be omitted intentionally to preserve certain CJK formatting semantics.

This issue also appears on line 503 of the same file.

Suggested change
// Step 5: Convert ideographic/full-width space (U+3000) to ASCII space
// This prevents use of U+3000 to bypass space-based filters
result = result.replace(/\u3000/g, " ");

Copilot uses AI. Check for mistakes.
// Step 2: Strip invisible zero-width characters that can hide content
// These include: zero-width space, zero-width non-joiner, zero-width joiner,
// word joiner, and byte order mark
result = result.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "");
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of zero-width joiner (U+200D) in line 510 may break complex emoji sequences that legitimately use ZWJ to combine multiple emoji. For example, family emoji (👨‍👩‍👧‍👦), profession emoji (👨‍⚕️), and flag sequences use ZWJ.

While the test at line 1304-1309 checks basic emoji preservation, it doesn't test ZWJ emoji sequences. Consider either:

  1. Adding an exception for ZWJ when it appears between emoji characters, or
  2. Accepting this tradeoff and documenting that complex ZWJ emoji may be broken as a security measure

If option 1 is chosen, you could detect emoji by checking if adjacent characters are in emoji Unicode ranges (U+1F300-1F9FF, U+2600-26FF, etc.) before deciding whether to remove ZWJ.

Copilot uses AI. Check for mistakes.
Comment on lines +1304 to +1309
it("should not affect emoji", () => {
const input = "Hello 👋 World 🌍";
const result = sanitizeContent(input);
expect(result).toContain("👋");
expect(result).toContain("🌍");
});
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The emoji preservation test should include examples of ZWJ (zero-width joiner) emoji sequences, since U+200D is removed by the Unicode hardening function at line 510 of sanitize_content_core.cjs. Examples to test:

  • Family emoji: "👨‍👩‍👧‍👦" (uses ZWJ between each member)
  • Profession emoji: "👨‍⚕️" (man + ZWJ + medical symbol)
  • Flag sequences that use ZWJ

This would verify whether the intended behavior is to break these sequences (as a security tradeoff) or whether the implementation needs adjustment.

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +122
it("should preserve emoji in labels", () => {
expect(sanitizeLabelContent("🐛 bug")).toBe("🐛 bug");
expect(sanitizeLabelContent("✨ enhancement")).toBe("✨ enhancement");
});
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the sanitize_content.test.cjs tests, the emoji preservation test should include examples of ZWJ (zero-width joiner) emoji sequences to verify whether they are intentionally broken or should be preserved. Since U+200D is removed by hardenUnicodeText(), complex emoji like "👨‍👩‍👧‍👦" (family) or "👨‍⚕️" (profession) will be broken into their component parts.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants