Fix tokenizing Unicode escape sequence in string literal #826

jevancc · 2020-10-08T18:56:53Z

This Pull Request fixes/closes #808 .

It changes the following:

Add InnerIter::peek_iter() and store the peeked char in InnerIter instance instead of Cursor
Fix invalid Unicode code point in lexer test

The issue is caused by a bug in Cursor::fill_bytes. When Cursor::fill_bytes was called after Cursor::peek, the iter would be incremented to peek the next char but the peeked char would not be filled to the buffer. This PR introduces a new method InnerIter::peek_char and stores the peeked char in InnerIter so that it can fill the peeked char to input buffer when invoking InnerIter::fill_bytes correctly.

The test syntax::lexer::tests::codepoint_with_no_braces is updated in this PR. Since \uD83D is a surrogate code point, the test will panic when calling decode_utf16 and trying to decode it. This bug should be fixed in another issue/PR to handle the invalid code point.

codecov · 2020-10-08T19:07:58Z

Codecov Report

Merging #826 into master will increase coverage by 0.02%.
The diff coverage is 81.25%.

@@            Coverage Diff             @@
##           master     #826      +/-   ##
==========================================
+ Coverage   59.23%   59.26%   +0.02%     
==========================================
  Files         157      157              
  Lines       10034    10035       +1     
==========================================
+ Hits         5944     5947       +3     
+ Misses       4090     4088       -2

Impacted Files	Coverage Δ
boa/src/syntax/lexer/cursor.rs	`62.71% <81.25%> (+2.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 470dbb4...a12e8dc. Read the comment docs.

HalidOdat

Looks good! check my comment on how we might improve this :)

HalidOdat · 2020-10-08T22:20:27Z

boa/src/syntax/lexer/cursor.rs

+                let mut buf = [0u8; 4];
+                chr.encode_utf8(&mut buf);
+                Ok(Some(buf[0]))


Suggested change

let mut buf = [0u8; 4];

chr.encode_utf8(&mut buf);

Ok(Some(buf[0]))

Ok(Some(chr as u8))

Couldn't we just cast to u8 since we check if its ascii, this should be faster than calling encode_utf8

Sounds great! I've made this change.

Besides, I can look into the invalid code point issue. Is there already an issue/PR for it?

Besides, I can look into the invalid code point issue. Is there already an issue/PR for it?

There is an issue #778, but not a PR to fix (also nobody is assigned if you would like to take it)

HalidOdat

Looks good to me! :)

Razican

It seems there are no regressions in the benchmarks, so this looks pretty good! Thanks!

jevancc added 3 commits October 8, 2020 11:16

Add peek_char to InnerIter

6951bfa

Fix invalid code point in lexer test

71bd4e6

rustfmt

16d8124

jevancc marked this pull request as ready for review October 8, 2020 19:19

HalidOdat reviewed Oct 8, 2020

View reviewed changes

HalidOdat requested review from Lan2u, Razican and RageKnify October 8, 2020 22:26

HalidOdat added bug Something isn't working lexer Issues surrounding the lexer labels Oct 8, 2020

HalidOdat added this to the v0.11.0 milestone Oct 8, 2020

Cast ascii char directly to u8

a12e8dc

HalidOdat approved these changes Oct 8, 2020

View reviewed changes

Razican mentioned this pull request Oct 9, 2020

Syntax error when using Unicode escape sequence with curly brackets in a string #827

Closed

Razican approved these changes Oct 9, 2020

View reviewed changes

Razican merged commit 01dbf8b into boa-dev:master Oct 9, 2020

jevancc deleted the fix/808_unicode_escape_str branch October 9, 2020 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokenizing Unicode escape sequence in string literal #826

Fix tokenizing Unicode escape sequence in string literal #826

jevancc commented Oct 8, 2020 •

edited

Loading

codecov bot commented Oct 8, 2020 •

edited

Loading

HalidOdat left a comment •

edited

Loading

HalidOdat Oct 8, 2020

jevancc Oct 8, 2020

jevancc Oct 8, 2020

HalidOdat Oct 8, 2020 •

edited

Loading

HalidOdat left a comment

Razican left a comment

Fix tokenizing Unicode escape sequence in string literal #826

Fix tokenizing Unicode escape sequence in string literal #826

Conversation

jevancc commented Oct 8, 2020 • edited Loading

codecov bot commented Oct 8, 2020 • edited Loading

Codecov Report

HalidOdat left a comment • edited Loading

Choose a reason for hiding this comment

HalidOdat Oct 8, 2020

Choose a reason for hiding this comment

jevancc Oct 8, 2020

Choose a reason for hiding this comment

jevancc Oct 8, 2020

Choose a reason for hiding this comment

HalidOdat Oct 8, 2020 • edited Loading

Choose a reason for hiding this comment

HalidOdat left a comment

Choose a reason for hiding this comment

Razican left a comment

Choose a reason for hiding this comment

jevancc commented Oct 8, 2020 •

edited

Loading

codecov bot commented Oct 8, 2020 •

edited

Loading

HalidOdat left a comment •

edited

Loading

HalidOdat Oct 8, 2020 •

edited

Loading