Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[swift]improve multiline nested comment lexer rule for swift2&swift3&swift5 #4184

Open
hollowrider opened this issue Aug 1, 2024 · 2 comments

Comments

@hollowrider
Copy link

The Block_comment lexer rule can't handle comment like /*/**/ .It will conduct an un expected error. Current Block_comment rule is this:

Block_comment: '/*' (Block_comment | .)*? '*/' -> channel(HIDDEN);

To fix that, I make a little change on the Block_comment rule and there it is.

Block_comment: '/*' (Block_comment | '/' ~'*'|~'/')*? '*/' -> channel(HIDDEN);

This rule will refuse /* character in Block_comment and match the nested comment corrently. I find this kind of defeat existing in swift2&swift3&swift5 lexer file and maybe other grammar files that allow multiline nested comment.

Error swift code is below:

/*/**/
let _: [Any] = [
    0, 1, 1.0, 1.0e+1, 1e+1, true,
    "Hello, world!", "Hello, \(1)!", "Hello, \(1.0e+1)!", "Hello, \(Int.max)!",
    (nil == nil)
]
/** 
another comment
*/
if 10 < 20{
	if 10 < 20{
	}
}

when using origin Block_comment rule, it will tokenize like this:

[@0,0:228='/*/**/\r\n\r\nlet _: [Int] = []\r\nlet _ = [1, 2, 3]\r\nlet _: [Any] = [\r\n    0, 1, 1.0, 1.0e+1, 1e+1, true,\r\n    "Hello, world!", "Hello, \(1)!", "Hello, \(1.0e+1)!", "Hello, \(Int.max)!",\r\n    (nil == nil)\r\n]\r\n/** \r\nanother comment\r\n*/',<Block_comment>,channel=1,1:0]

After fixing this defeat, it will work like this. And when parsing grammar, it will throw exception as expected.

[@0,0:0='/',<'/'>,1:0]
[@1,1:1='*',<'*'>,1:1]
[@2,2:5='/**/',<Block_comment>,channel=1,1:2]
@msagca
Copy link
Contributor

msagca commented Aug 1, 2024

Hi @hollowrider,

Formal syntax rules associated with comments in the documentation are as follows:

comment → // comment-text line-break
multiline-comment → /* multiline-comment-text */
comment-text → comment-text-item comment-text?
comment-text-item → Any Unicode scalar value except U+000A or U+000D
multiline-comment-text → multiline-comment-text-item multiline-comment-text?
multiline-comment-text-item → multiline-comment
multiline-comment-text-item → comment-text-item
multiline-comment-text-item → Any Unicode scalar value except /* or */

Doesn't your input violate these rules since it contains an unmatched /*? It is not a nested comment because it's not a comment since it's not terminated by */. Maybe I'm interpreting the syntax rules wrong.

@hollowrider
Copy link
Author

@msagca Thanks for your comment.
Exactly, /*/**/ violate these rules. However, What problem I meet is when users input a swift file with grammar mistakes like this and parser give an unexpected output.
Below swift input contain /*/**/ character and definitely it should raise an exception because it violates rules you list. However, when I parse this file, you will find no errors are thrown.

/*/**/
let _: [Any] = [
    0, 1, 1.0, 1.0e+1, 1e+1, true,
    "Hello, world!", "Hello, \(1)!", "Hello, \(1.0e+1)!", "Hello, \(Int.max)!",
    (nil == nil)
]
/** 
another comment
*/
if 10 < 20{
	if 10 < 20{
	}
}

And if you use grun token function to analyize this file, you will find the reason. The lexer recognizes the struct between line 1 and line 9 as the whole Block_comment or multiline-comment named in swift-book. Below is the lexer token result:

[@0,0:188='/*/**/\r\nlet _: [Any] = [\r\n    0, 1, 1.0, 1.0e+1, 1e+1, true,\r\n    "Hello, world!", "Hello, \(1)!", "Hello, \(1.0e+1)!", "Hello, \(Int.max)!",\r\n    (nil == nil)\r\n]\r\n/** \r\nanother comment\r\n*/',<Block_comment>,channel=1,1:0]

This isn't what I expect. To fix that, I suggest to change the Block_comment rule like below. Changed lexer will recognize the beginning / and * apart from following multiline-comment. And it will raise an error when grammar parses.

Block_comment: '/*' (Block_comment | '/' ~'*'|~'/')*? '*/' -> channel(HIDDEN);

There is the lexer output after changing the rule.

[@0,0:0='/',<'/'>,1:0]
[@1,1:1='*',<'*'>,1:1]
[@2,2:5='/**/',<Block_comment>,channel=1,1:2]

To be honest, I'm not an experienced antlr grammar writer, but I want to share the problem I meet and improve g4 file. Would you think it could work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants