Add Unicode hardening to markdown sanitization functions#14795
Add Unicode hardening to markdown sanitization functions#14795
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This pull request adds Unicode hardening to markdown sanitization functions to protect against Unicode-based attacks including visual spoofing, hidden content, and filter bypass. The implementation introduces a new hardenUnicodeText() function that performs NFC normalization, removes zero-width characters, strips bidirectional override controls, and converts full-width ASCII to standard ASCII.
Changes:
- Implemented
hardenUnicodeText()insanitize_content_core.cjswith four-step hardening pipeline - Integrated Unicode hardening into
sanitizeContentCore(),sanitizeContent(), andsanitizeLabelContent() - Added comprehensive test coverage (220 tests) across transformation types, combined attacks, and edge cases
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/sanitize_content_core.cjs | Implements hardenUnicodeText() function with NFC normalization, zero-width removal, bidirectional control removal, and full-width ASCII conversion; integrates into sanitizeContentCore() |
| actions/setup/js/sanitize_content.cjs | Imports and applies hardenUnicodeText() early in the sanitization pipeline for the custom mention filtering path |
| actions/setup/js/sanitize_label_content.cjs | Integrates hardenUnicodeText() into label sanitization before ANSI and control character removal |
| actions/setup/js/sanitize_content.test.cjs | Adds 220 tests covering zero-width removal, NFC normalization, full-width conversion, directional overrides, combined attacks, and edge cases |
| actions/setup/js/sanitize_label_content.test.cjs | Adds Unicode hardening tests for label content including zero-width characters, full-width ASCII, directional overrides, NFC normalization, and emoji preservation |
Comments suppressed due to low confidence (1)
actions/setup/js/sanitize_content_core.cjs:525
- There's a subtle issue with the order of operations in
hardenUnicodeText(). NFC normalization happens BEFORE full-width ASCII conversion, which means full-width characters won't be normalized with their combining marks.
For example, with input "\uFF21\u0301" (full-width A + combining acute):
- Step 1 (NFC): No change, because \uFF21 is not recognized as a base character for composition
- Steps 2-3: No change (no relevant characters)
- Step 4: "\uFF21" → "A", resulting in "A\u0301"
- Final: "A\u0301" (not composed)
If NFC normalization ran AFTER full-width conversion, "A\u0301" would compose to "Á" (U+00C1).
Consider reordering to:
- Full-width conversion (convert to standard chars first)
- Zero-width removal
- Directional override removal
- NFC normalization (compose after all chars are standard)
This would ensure that full-width base characters properly compose with their combining marks.
// Step 1: Normalize Unicode to canonical composition (NFC)
// This ensures consistent character representation across different encodings
result = result.normalize("NFC");
// Step 2: Strip invisible zero-width characters that can hide content
// These include: zero-width space, zero-width non-joiner, zero-width joiner,
// word joiner, and byte order mark
result = result.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "");
// Step 3: Remove bidirectional text override controls
// These can be used to reverse text direction and create visual spoofs
result = result.replace(/[\u202A\u202B\u202C\u202D\u202E\u2066\u2067\u2068\u2069]/g, "");
// Step 4: Convert full-width ASCII characters to standard ASCII
// Full-width characters (U+FF01-FF5E) can be used to bypass filters
result = result.replace(/[\uFF01-\uFF5E]/g, char => {
const code = char.charCodeAt(0);
// Map full-width to half-width by subtracting offset
const standardCode = code - 0xfee0;
return String.fromCharCode(standardCode);
});
return result;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const standardCode = code - 0xfee0; | ||
| return String.fromCharCode(standardCode); | ||
| }); | ||
|
|
There was a problem hiding this comment.
The full-width conversion range [\uFF01-\uFF5E] excludes U+FF00. However, U+FF00 is actually undefined in Unicode. The actual full-width/ideographic space is U+3000, not U+FF00. If conversion of full-width spaces is desired for security purposes, consider adding:
// Convert ideographic/full-width space to ASCII space
result = result.replace(/\u3000/g, " ");This would prevent attackers from using U+3000 to bypass space-based filters. However, this might be omitted intentionally to preserve certain CJK formatting semantics.
This issue also appears on line 503 of the same file.
| // Step 5: Convert ideographic/full-width space (U+3000) to ASCII space | |
| // This prevents use of U+3000 to bypass space-based filters | |
| result = result.replace(/\u3000/g, " "); |
| // Step 2: Strip invisible zero-width characters that can hide content | ||
| // These include: zero-width space, zero-width non-joiner, zero-width joiner, | ||
| // word joiner, and byte order mark | ||
| result = result.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, ""); |
There was a problem hiding this comment.
The removal of zero-width joiner (U+200D) in line 510 may break complex emoji sequences that legitimately use ZWJ to combine multiple emoji. For example, family emoji (👨👩👧👦), profession emoji (👨⚕️), and flag sequences use ZWJ.
While the test at line 1304-1309 checks basic emoji preservation, it doesn't test ZWJ emoji sequences. Consider either:
- Adding an exception for ZWJ when it appears between emoji characters, or
- Accepting this tradeoff and documenting that complex ZWJ emoji may be broken as a security measure
If option 1 is chosen, you could detect emoji by checking if adjacent characters are in emoji Unicode ranges (U+1F300-1F9FF, U+2600-26FF, etc.) before deciding whether to remove ZWJ.
| it("should not affect emoji", () => { | ||
| const input = "Hello 👋 World 🌍"; | ||
| const result = sanitizeContent(input); | ||
| expect(result).toContain("👋"); | ||
| expect(result).toContain("🌍"); | ||
| }); |
There was a problem hiding this comment.
The emoji preservation test should include examples of ZWJ (zero-width joiner) emoji sequences, since U+200D is removed by the Unicode hardening function at line 510 of sanitize_content_core.cjs. Examples to test:
- Family emoji: "👨👩👧👦" (uses ZWJ between each member)
- Profession emoji: "👨⚕️" (man + ZWJ + medical symbol)
- Flag sequences that use ZWJ
This would verify whether the intended behavior is to break these sequences (as a security tradeoff) or whether the implementation needs adjustment.
| it("should preserve emoji in labels", () => { | ||
| expect(sanitizeLabelContent("🐛 bug")).toBe("🐛 bug"); | ||
| expect(sanitizeLabelContent("✨ enhancement")).toBe("✨ enhancement"); | ||
| }); |
There was a problem hiding this comment.
Similar to the sanitize_content.test.cjs tests, the emoji preservation test should include examples of ZWJ (zero-width joiner) emoji sequences to verify whether they are intentionally broken or should be preserved. Since U+200D is removed by hardenUnicodeText(), complex emoji like "👨👩👧👦" (family) or "👨⚕️" (profession) will be broken into their component parts.
Markdown sanitization lacked protection against Unicode-based attacks: zero-width characters for hidden content, bidirectional overrides for visual spoofing, and full-width ASCII for filter bypass.
Changes
Added
hardenUnicodeText()insanitize_content_core.cjs\u200B-\u200D,\u2060,\uFEFF\u202A-\u202E,\u2066-\u2069\uFF01-\uFF5E→\u0021-\u007EIntegrated into sanitization pipeline
sanitizeContentCore(),sanitizeContent(),sanitizeLabelContent()sanitizeIncomingText()via core functionTest coverage
Example
Attack vectors addressed:
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.