-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not skip control characters embedded in malformed UTF-8 characters in comments #1059
Do not skip control characters embedded in malformed UTF-8 characters in comments #1059
Conversation
Embedding a newline in a malformed UTF8 comment.
…a control character.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1059 +/- ##
==========================================
- Coverage 94.71% 94.70% -0.01%
==========================================
Files 191 191
Lines 47664 47669 +5
==========================================
+ Hits 45144 45147 +3
- Misses 2520 2522 +2
Continue to review full report in Codecov by Sentry.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mention, I think you probably want to also add a new command line option that doesn't count invalid UTF8 sequences as errors. Otherwise the lexer is going to early-out regardless of whether you hide the warning or not. Or maybe the InvalidUTF8Seq warning should simply not count towards the lexer error count ever, since if the user suppresses they'll be pretty surprised to see them add up and cause a real error anyway.
include/slang/text/CharInfo.h
Outdated
@@ -194,6 +194,16 @@ constexpr const char* utf8Decode(const char* b, uint32_t* c, int* e, int& comput | |||
*e |= (uc(b[3])) >> 6; | |||
*e ^= 0x2a; // top two bits of each tail byte correct? | |||
*e >>= shifte[len]; | |||
// For normal path, this should not be checked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment above describes this function as branchless; I think we should keep it that way. You can put this handling in Lexer::scanUTF8Char instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
At the moment I settled for using
A cleaner way to deal with it is to check if I looked into it and from the quick search I did, the One way to solve this is to add an explicit lexer option to disable these errors from being counted. |
slang deliberately as a design decision doesn't change internal behaviors based on whether a particular warning is disabled or not; filtering of warning output happens at the very end of the pipeline. I think the right thing to do here is to just not count these cases as errors; any time we would issue a warning instead of a hard error we shouldn't then count it as an error in the lexer, since the user can suppress those warnings. |
include/slang/text/CharInfo.h
Outdated
@@ -194,6 +194,7 @@ constexpr const char* utf8Decode(const char* b, uint32_t* c, int* e, int& comput | |||
*e |= (uc(b[3])) >> 6; | |||
*e ^= 0x2a; // top two bits of each tail byte correct? | |||
*e >>= shifte[len]; | |||
// For normal path, this should not be checked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you left this partial comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was left by mistake. I'll remove it.
and will not cause compilation abort
OK, next revision backs out the |
This PR is a partial fix for #1054
Since SystemVerilog files are simple ASCII files, some code bases use international languages in their comments.
Slang expects comments to be in the UTF8 encoding.
If a different coding is used, it is possible that the comment is misinterpreted as an illegal UTF8 string, and effectively skips control characters such as newline.
For this to work, it may be necessary to add the
-Wno-invalid-source-encoding
, since slang will abort any file when the maximum number of lexer errors is encountered (default is 16).