Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle invalid Unicode code point in the string literals #853

Merged
merged 4 commits into from
Oct 14, 2020

Conversation

jevancc
Copy link
Contributor

@jevancc jevancc commented Oct 12, 2020

This Pull Request fixes/closes #778

It changes the following:

  • Handle invalid Unicode code point in the string literals
  • Fix the unexpected behavior of handling hexadecimal and Unicode escape sequence

The lexer will no longer panic when processing the string literal with invalid Unicode code points, e.g. "\uD83D". It will replace all the invalid code point with a fixed replacement char "�" ("\uFFFD"). For example, the string "ABC\uD83D123" will become "ABC�123" after processing. The replacement is not revertable so the following example will not have the expected behavior:

let x = '\uD834'; // '�', invalid code point
let y = '\uDD1E'; // '�', invalid code point
let z = x + y; // '��', the expected result should be '𝄞'

This issue can only be solved when we have the string data stored in bytes instead of chars.

This PR also fixes the unexpected behavior of handling hexadecimal and Unicode escape sequence, including:

'\xZZ'             // now throw a syntax error
'\uD834\u{{DD1E}}' // correctly make it a 4-byte character
'\u00AA\n'         // correctly handle Unicode escape sequence followed by other escape character

@codecov
Copy link

codecov bot commented Oct 12, 2020

Codecov Report

Merging #853 into master will increase coverage by 0.07%.
The diff coverage is 44.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #853      +/-   ##
==========================================
+ Coverage   59.24%   59.31%   +0.07%     
==========================================
  Files         157      157              
  Lines       10037    10071      +34     
==========================================
+ Hits         5946     5974      +28     
- Misses       4091     4097       +6     
Impacted Files Coverage Δ
boa/src/syntax/lexer/string.rs 38.37% <43.18%> (+3.58%) ⬆️
boa/src/syntax/lexer/cursor.rs 60.68% <100.00%> (-2.03%) ⬇️
boa/src/builtins/object/mod.rs 64.86% <0.00%> (+3.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 210a9ec...4e61f11. Read the comment docs.

@HalidOdat HalidOdat added bug Something isn't working lexer Issues surrounding the lexer labels Oct 13, 2020
@HalidOdat HalidOdat added this to the v0.11.0 milestone Oct 13, 2020
Copy link
Member

@Razican Razican left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :) at some point in the future we will need to switch to UTF-16 code-point lexing and string handling.

@HalidOdat HalidOdat merged commit de7202d into boa-dev:master Oct 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lexer Issues surrounding the lexer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic on invalid utf16 instead of an Error
3 participants