Handle invalid Unicode code point in the string literals #853

jevancc · 2020-10-12T18:25:59Z

This Pull Request fixes/closes #778

It changes the following:

Handle invalid Unicode code point in the string literals
Fix the unexpected behavior of handling hexadecimal and Unicode escape sequence

The lexer will no longer panic when processing the string literal with invalid Unicode code points, e.g. "\uD83D". It will replace all the invalid code point with a fixed replacement char "�" ("\uFFFD"). For example, the string "ABC\uD83D123" will become "ABC�123" after processing. The replacement is not revertable so the following example will not have the expected behavior:

let x = '\uD834'; // '�', invalid code point
let y = '\uDD1E'; // '�', invalid code point
let z = x + y; // '��', the expected result should be '𝄞'

This issue can only be solved when we have the string data stored in bytes instead of chars.

This PR also fixes the unexpected behavior of handling hexadecimal and Unicode escape sequence, including:

'\xZZ'             // now throw a syntax error
'\uD834\u{{DD1E}}' // correctly make it a 4-byte character
'\u00AA\n'         // correctly handle Unicode escape sequence followed by other escape character

codecov · 2020-10-12T18:34:52Z

Codecov Report

Merging #853 into master will increase coverage by 0.07%.
The diff coverage is 44.44%.

@@            Coverage Diff             @@
##           master     #853      +/-   ##
==========================================
+ Coverage   59.24%   59.31%   +0.07%     
==========================================
  Files         157      157              
  Lines       10037    10071      +34     
==========================================
+ Hits         5946     5974      +28     
- Misses       4091     4097       +6

Impacted Files	Coverage Δ
boa/src/syntax/lexer/string.rs	`38.37% <43.18%> (+3.58%)`	⬆️
boa/src/syntax/lexer/cursor.rs	`60.68% <100.00%> (-2.03%)`	⬇️
boa/src/builtins/object/mod.rs	`64.86% <0.00%> (+3.18%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 210a9ec...4e61f11. Read the comment docs.

boa/src/syntax/lexer/string.rs

Razican

Looks good to me :) at some point in the future we will need to switch to UTF-16 code-point lexing and string handling.

jevancc added 4 commits October 12, 2020 00:12

Remove redundant code

8b918a4

Handle invalid Unicode code point

f3a47ed

Fix reading Unicode char with more than 1 utf16 bytes

e9c4e9f

Fix error messages

4e61f11

HalidOdat added bug Something isn't working lexer Issues surrounding the lexer labels Oct 13, 2020

HalidOdat added this to the v0.11.0 milestone Oct 13, 2020

HalidOdat reviewed Oct 13, 2020

View reviewed changes

boa/src/syntax/lexer/string.rs Show resolved Hide resolved

boa/src/syntax/lexer/string.rs Show resolved Hide resolved

HalidOdat approved these changes Oct 13, 2020

View reviewed changes

HalidOdat requested review from Lan2u, Razican, jasonwilliams and RageKnify October 13, 2020 17:52

Razican approved these changes Oct 14, 2020

View reviewed changes

HalidOdat merged commit de7202d into boa-dev:master Oct 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle invalid Unicode code point in the string literals #853

Handle invalid Unicode code point in the string literals #853

jevancc commented Oct 12, 2020 •

edited

Loading

codecov bot commented Oct 12, 2020 •

edited

Loading

Razican left a comment

Handle invalid Unicode code point in the string literals #853

Handle invalid Unicode code point in the string literals #853

Conversation

jevancc commented Oct 12, 2020 • edited Loading

codecov bot commented Oct 12, 2020 • edited Loading

Codecov Report

Razican left a comment

Choose a reason for hiding this comment

jevancc commented Oct 12, 2020 •

edited

Loading

codecov bot commented Oct 12, 2020 •

edited

Loading