Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not skip control characters embedded in malformed UTF-8 characters in comments #1059

Merged
merged 13 commits into from
Jul 15, 2024

Conversation

udif
Copy link
Contributor

@udif udif commented Jul 14, 2024

This PR is a partial fix for #1054
Since SystemVerilog files are simple ASCII files, some code bases use international languages in their comments.
Slang expects comments to be in the UTF8 encoding.
If a different coding is used, it is possible that the comment is misinterpreted as an illegal UTF8 string, and effectively skips control characters such as newline.
For this to work, it may be necessary to add the -Wno-invalid-source-encoding , since slang will abort any file when the maximum number of lexer errors is encountered (default is 16).

Copy link

codecov bot commented Jul 14, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.70%. Comparing base (4dee9aa) to head (4061a9e).
Report is 2 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1059      +/-   ##
==========================================
- Coverage   94.71%   94.70%   -0.01%     
==========================================
  Files         191      191              
  Lines       47664    47669       +5     
==========================================
+ Hits        45144    45147       +3     
- Misses       2520     2522       +2     
Files Coverage Δ
source/parsing/Lexer.cpp 99.47% <66.66%> (-0.21%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4dee9aa...4061a9e. Read the comment docs.

Copy link
Owner

@MikePopoloski MikePopoloski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mention, I think you probably want to also add a new command line option that doesn't count invalid UTF8 sequences as errors. Otherwise the lexer is going to early-out regardless of whether you hide the warning or not. Or maybe the InvalidUTF8Seq warning should simply not count towards the lexer error count ever, since if the user suppresses they'll be pretty surprised to see them add up and cause a real error anyway.

@@ -194,6 +194,16 @@ constexpr const char* utf8Decode(const char* b, uint32_t* c, int* e, int& comput
*e |= (uc(b[3])) >> 6;
*e ^= 0x2a; // top two bits of each tail byte correct?
*e >>= shifte[len];
// For normal path, this should not be checked
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above describes this function as branchless; I think we should keep it that way. You can put this handling in Lexer::scanUTF8Char instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@udif
Copy link
Contributor Author

udif commented Jul 15, 2024

As you mention, I think you probably want to also add a new command line option that doesn't count invalid UTF8 sequences as errors.

At the moment I settled for using --max-lexer-errors=1000000, as I wasn't sure what other cases also trigger lexer errors.
Going now through the Lexer error cases I got this list:

  1. Unicode BOM.
  2. embedded nulls in source buffer
  3. Other ASCII character (<128) not used by the lexer (non-printable)
  4. UTF8 in code
  5. Embedded null in string literal.
  6. Embedded null in line comment
  7. Embedded null in block comment
  8. And ofcourse, Illegal UTF8 in comments (our case).

A cleaner way to deal with it is to check if invalid-source-encoding warning is turned off with -Wno-invalid-source-encoding and in that case not increment the lexer error count.
The problem is that Lexer::scanUTF8Char is also used by Lexer::lexToken and Lexer::lexStringLiteral, so instead we must do this check explicitly in Lexer::scanLineComment and Lexer::scanBlockComment and undo the error increment.

I looked into it and from the quick search I did, the Lexer class is too low level to get access to the DiagnosticsEngine::getSeverity() method. In addition, this solution means that -Wno-invalid-source-encoding has an additional side effect other than disabling the warning.

One way to solve this is to add an explicit lexer option to disable these errors from being counted.
What do you think?

@MikePopoloski
Copy link
Owner

slang deliberately as a design decision doesn't change internal behaviors based on whether a particular warning is disabled or not; filtering of warning output happens at the very end of the pipeline.

I think the right thing to do here is to just not count these cases as errors; any time we would issue a warning instead of a hard error we shouldn't then count it as an error in the lexer, since the user can suppress those warnings.

@@ -194,6 +194,7 @@ constexpr const char* utf8Decode(const char* b, uint32_t* c, int* e, int& comput
*e |= (uc(b[3])) >> 6;
*e ^= 0x2a; // top two bits of each tail byte correct?
*e >>= shifte[len];
// For normal path, this should not be checked
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you left this partial comment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was left by mistake. I'll remove it.

@udif
Copy link
Contributor Author

udif commented Jul 15, 2024

I think the right thing to do here is to just not count these cases as errors

OK, next revision backs out the errorCount increment on scanUTF8Char but only in case it was called from line or block comments. I also added a test to make sure this works, and the test is conveniently pushed before the change so it can be shown to fail on the existing code by checking out 0d21f7f .

@MikePopoloski MikePopoloski merged commit b5b8359 into MikePopoloski:master Jul 15, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants