Do not skip control characters embedded in malformed UTF-8 characters in comments #1059

udif · 2024-07-14T07:32:16Z

This PR is a partial fix for #1054
Since SystemVerilog files are simple ASCII files, some code bases use international languages in their comments.
Slang expects comments to be in the UTF8 encoding.
If a different coding is used, it is possible that the comment is misinterpreted as an illegal UTF8 string, and effectively skips control characters such as newline.
For this to work, it may be necessary to add the -Wno-invalid-source-encoding , since slang will abort any file when the maximum number of lexer errors is encountered (default is 16).

Embedding a newline in a malformed UTF8 comment.

…a control character.

codecov · 2024-07-14T07:47:05Z

Codecov Report

Attention: Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.70%. Comparing base (4dee9aa) to head (4061a9e).
Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1059      +/-   ##
==========================================
- Coverage   94.71%   94.70%   -0.01%     
==========================================
  Files         191      191              
  Lines       47664    47669       +5     
==========================================
+ Hits        45144    45147       +3     
- Misses       2520     2522       +2

Files	Coverage Δ
source/parsing/Lexer.cpp	`99.47% <66.66%> (-0.21%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4dee9aa...4061a9e. Read the comment docs.

MikePopoloski

As you mention, I think you probably want to also add a new command line option that doesn't count invalid UTF8 sequences as errors. Otherwise the lexer is going to early-out regardless of whether you hide the warning or not. Or maybe the InvalidUTF8Seq warning should simply not count towards the lexer error count ever, since if the user suppresses they'll be pretty surprised to see them add up and cause a real error anyway.

MikePopoloski · 2024-07-14T21:00:11Z

include/slang/text/CharInfo.h

@@ -194,6 +194,16 @@ constexpr const char* utf8Decode(const char* b, uint32_t* c, int* e, int& comput
    *e |= (uc(b[3])) >> 6;
    *e ^= 0x2a; // top two bits of each tail byte correct?
    *e >>= shifte[len];
+    // For normal path, this should not be checked


The comment above describes this function as branchless; I think we should keep it that way. You can put this handling in Lexer::scanUTF8Char instead.

udif · 2024-07-15T07:53:33Z

As you mention, I think you probably want to also add a new command line option that doesn't count invalid UTF8 sequences as errors.

At the moment I settled for using --max-lexer-errors=1000000, as I wasn't sure what other cases also trigger lexer errors.
Going now through the Lexer error cases I got this list:

Unicode BOM.
embedded nulls in source buffer
Other ASCII character (<128) not used by the lexer (non-printable)
UTF8 in code
Embedded null in string literal.
Embedded null in line comment
Embedded null in block comment
And ofcourse, Illegal UTF8 in comments (our case).

A cleaner way to deal with it is to check if invalid-source-encoding warning is turned off with -Wno-invalid-source-encoding and in that case not increment the lexer error count.
The problem is that Lexer::scanUTF8Char is also used by Lexer::lexToken and Lexer::lexStringLiteral, so instead we must do this check explicitly in Lexer::scanLineComment and Lexer::scanBlockComment and undo the error increment.

I looked into it and from the quick search I did, the Lexer class is too low level to get access to the DiagnosticsEngine::getSeverity() method. In addition, this solution means that -Wno-invalid-source-encoding has an additional side effect other than disabling the warning.

One way to solve this is to add an explicit lexer option to disable these errors from being counted.
What do you think?

MikePopoloski · 2024-07-15T15:08:55Z

slang deliberately as a design decision doesn't change internal behaviors based on whether a particular warning is disabled or not; filtering of warning output happens at the very end of the pipeline.

I think the right thing to do here is to just not count these cases as errors; any time we would issue a warning instead of a hard error we shouldn't then count it as an error in the lexer, since the user can suppress those warnings.

MikePopoloski · 2024-07-15T15:09:12Z

include/slang/text/CharInfo.h

@@ -194,6 +194,7 @@ constexpr const char* utf8Decode(const char* b, uint32_t* c, int* e, int& comput
    *e |= (uc(b[3])) >> 6;
    *e ^= 0x2a; // top two bits of each tail byte correct?
    *e >>= shifte[len];
+    // For normal path, this should not be checked


Looks like you left this partial comment here.

Yes, this was left by mistake. I'll remove it.

…lexer errors

and will not cause compilation abort

udif · 2024-07-15T16:23:52Z

I think the right thing to do here is to just not count these cases as errors

OK, next revision backs out the errorCount increment on scanUTF8Char but only in case it was called from line or block comments. I also added a test to make sure this works, and the test is conveniently pushed before the change so it can be shown to fail on the existing code by checking out 0d21f7f .

source/parsing/Lexer.cpp

udif and others added 3 commits July 14, 2024 09:54

Prepare test for 1st UTF8 failure mode:

79375c5

Embedding a newline in a malformed UTF8 comment.

Fixed issue with s malformed UTF8 character in a comment that embeds …

18b6e5e

…a control character.

style: pre-commit fixes

25cc15d

udif added 3 commits July 14, 2024 10:47

Added [[fallthrough]] attribute

7111f6c

Found 2nd pattern of malformed UTF8 that still fails.

3987a64

Previous fix was simply incomplete and did not decrement 'next' enough.

b240b12

MikePopoloski reviewed Jul 14, 2024

View reviewed changes

udif and others added 2 commits July 15, 2024 10:01

Move UTF8 embedded newline check to Lexer.cpp

1799e96

style: pre-commit fixes

78d1a40

MikePopoloski reviewed Jul 15, 2024

View reviewed changes

udif and others added 4 commits July 15, 2024 18:47

Revert unnecessary change

fad0c45

Added a test to make sure UTF8 errors on comments are not counted as …

0d21f7f

…lexer errors

UTF8 errors on comments no longer count as lexer errors,

accbf06

and will not cause compilation abort

style: pre-commit fixes

4d5cf55

MikePopoloski reviewed Jul 15, 2024

View reviewed changes

source/parsing/Lexer.cpp Outdated Show resolved Hide resolved

Removed lexer error counter increment for all UTF8 errors.

4061a9e

MikePopoloski merged commit b5b8359 into MikePopoloski:master Jul 15, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not skip control characters embedded in malformed UTF-8 characters in comments #1059

Do not skip control characters embedded in malformed UTF-8 characters in comments #1059

udif commented Jul 14, 2024

codecov bot commented Jul 14, 2024 •

edited

Loading

MikePopoloski left a comment

MikePopoloski Jul 14, 2024

udif Jul 15, 2024

udif commented Jul 15, 2024

MikePopoloski commented Jul 15, 2024

MikePopoloski Jul 15, 2024

udif Jul 15, 2024

udif commented Jul 15, 2024

Do not skip control characters embedded in malformed UTF-8 characters in comments #1059

Do not skip control characters embedded in malformed UTF-8 characters in comments #1059

Conversation

udif commented Jul 14, 2024

codecov bot commented Jul 14, 2024 • edited Loading

Codecov Report

MikePopoloski left a comment

Choose a reason for hiding this comment

MikePopoloski Jul 14, 2024

Choose a reason for hiding this comment

udif Jul 15, 2024

Choose a reason for hiding this comment

udif commented Jul 15, 2024

MikePopoloski commented Jul 15, 2024

MikePopoloski Jul 15, 2024

Choose a reason for hiding this comment

udif Jul 15, 2024

Choose a reason for hiding this comment

udif commented Jul 15, 2024

codecov bot commented Jul 14, 2024 •

edited

Loading