Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escapes sequence recognition failure in character sets #1537

Closed
renatahodovan opened this issue Dec 20, 2016 · 32 comments
Closed

Escapes sequence recognition failure in character sets #1537

renatahodovan opened this issue Dec 20, 2016 · 32 comments

Comments

@renatahodovan
Copy link
Contributor

After the 4.6 release, the behaviour of escape sequences in the lexer's character sets changed compared to 4.5.3. I'm not sure whether they are bugs or features but I think it's worth to mention:

  1. [\[] this worked in 4.5.3 but in 4.6 it's an invalid escape sequence (instead [[] can be used)
  2. [\v] vertical tabs worked in 4.5.3 but they are invalid in 4.6
  3. In 4.5.3, arbitrary characters could be escaped like [\j\k\l] but now they are invalid.
  4. [d-a]: although its a bit weird, but was valid in 4.5.3, but it's not in 4.6

If these changes are expected, then I'm going to create some PR-s in grammars-v4, since several of them are failing now.

@parrt
Copy link
Member

parrt commented Dec 20, 2016

Yeah, basically it's saying now that the escape is unnecessary. I didn't realize the PR made it an error rather than warning. sorry about that. Please do create a PR for grammars-v4! :)

@renatahodovan
Copy link
Contributor Author

Does it mean that vertical tabs are not allowed anymore (i.e., they should be removed)? And a second thing, grammars-v4 still uses 4.5.3. Will it be bumped up?

@parrt
Copy link
Member

parrt commented Dec 20, 2016

Hmm...well, \v can be done with \u000B so let's leave it. it's rare. Yeah, we should bump to 4.6

@KvanTTT
Copy link
Member

KvanTTT commented Dec 21, 2016

It seems to me that this pull request #1517 brought such behavior. In my opinion these cases are not clear. So, @renatahodovan make pull request to grammars if you want :)

@parrt
Copy link
Member

parrt commented Dec 22, 2016

@renatahodovan we could add \v if you want. Also let's drop from error to warning on [\[] type stuff. You ok with that @KvanTTT ?

@renatahodovan
Copy link
Contributor Author

@parrt yes, I think it would be useful for the sake of backward compatibility. Thanks!

@KvanTTT
Copy link
Member

KvanTTT commented Dec 22, 2016

Yes, warning instead of error is a good idea. I'll try to fix these issues at the beginning of January.

@parrt
Copy link
Member

parrt commented Jan 6, 2017

Any pull request for this coming? If not, I can do it.

@KvanTTT
Copy link
Member

KvanTTT commented Jan 6, 2017

Not ready for now :(

@Nulleye
Copy link

Nulleye commented Jan 11, 2017

Just to help on testing.
This:

ID : [a-zA-Z&/][a-zA-Z0-9\.@\-_\*/><]*;

fails on 4.6 but not on 4.5.3

@KvanTTT
Copy link
Member

KvanTTT commented Jan 11, 2017

@Nulleye thank you for feedback!

@Nulleye
Copy link

Nulleye commented Jan 11, 2017

Sorry, I realized that the web form has eaten some backslash chars.
To be accurate you have to put a backslash before the . - and *

@parrt
Copy link
Member

parrt commented Jan 11, 2017

@Nulleye i added the tick marks to make it appear as code. thanks!

@sharwell
Copy link
Member

@Nulleye You don't need to escape the . and * characters inside a character set. Also, the - should appear last if it's not part of a range, like this:

ID : [a-zA-Z&/][a-zA-Z0-9.@_*/><-]*;

@parrt
Copy link
Member

parrt commented Feb 24, 2017

@KvanTTT can you take a look at this now that we've pulled in the unicode 32 stuff?

@parrt parrt added this to the 4.7 milestone Feb 24, 2017
@KvanTTT
Copy link
Member

KvanTTT commented Feb 24, 2017

OK, I'll take a look.

@KvanTTT
Copy link
Member

KvanTTT commented Feb 24, 2017

Hmm...well, \v can be done with \u000B so let's leave it. it's rare.

I think we should not support it because of inconsistency.

In 4.5.3 we can not write \v char inside quotes: '\v' but can write in square brackets [\v].

Since 4.6 version the \v char is completely restricted.

@parrt
Copy link
Member

parrt commented Feb 24, 2017

Ok, since 4.6 will not allow, let's keep it as-is. no \v anywhere. Use \u000B.

@KvanTTT
Copy link
Member

KvanTTT commented Feb 24, 2017

Moreover, I suggest just to close this issue.

In 4.5.3 version we can not use backslash inside quote literals.
Fo example: '\[' throws a error, [\[] does not throw a error.

Since 4.6 version we can not use backslash inside square-bracket blocks too.

So, since ANTLR 4.6 escape chars processing is consistent. See also testValidEscapeSequences test.

@parrt
Copy link
Member

parrt commented Feb 24, 2017

Ok, will close. Can we change the errors to be warnings (I think you made them errors) though that broke all the grammars-v4 grammars?

@KvanTTT
Copy link
Member

KvanTTT commented Feb 24, 2017

Warnings for square-bracket block or for both syntax? It's easier to update our grammars :)

@parrt
Copy link
Member

parrt commented Feb 24, 2017

Well people over-escaped previously which caused lots of failures to build as it became an error rather than warning (or was previously just ignored). Could be lots of old 4.5.3 grammars out there that did \x inappropriately. Let's just make the stuff you added as errors into warnings if it makes sense like ignoring \ in \x if x is not a valid escape.

parrt added a commit that referenced this issue Mar 1, 2017
parrt added a commit to parrt/antlr4 that referenced this issue Mar 2, 2017
@KvanTTT
Copy link
Member

KvanTTT commented Mar 2, 2017

@parrt what should we do with such char sets?

  • CHARSET_WITHOUT_START: [-z]
  • CHARSET_WITHOUT_END: [a-]

Possible solutions:

  • Consider that the first and last bounds are zero and infinity respectively, i. e
    [\u0000-z] and [a-\uFFFF].
  • Add a new error "incorrect char set" for such cases.

I think the second choice is better.

@parrt
Copy link
Member

parrt commented Mar 2, 2017

yeah, an error is best choice. can we reuse a previous error?

@KvanTTT
Copy link
Member

KvanTTT commented Mar 2, 2017

I'm afraid but I didn't find a corresponding type for such error.

@parrt
Copy link
Member

parrt commented Mar 2, 2017

dang. ok, maybe create another error maybe error code 165 INVALID_SET?

@KvanTTT
Copy link
Member

KvanTTT commented Mar 2, 2017

I agree. I'll fix it tomorrow.

@mwpowellhtx
Copy link

Hello, along similar lines, I've got the following escape sequences to consider:

strLit = ( "'" { charValue } "'" ) | ( '"' { charValue } '"' )
charValue = hexEscape | octEscape | charEscape | /[^\0\n\\]/
hexEscape = '\' ( "x" | "X" ) hexDigit hexDigit
octEscape = '\' octalDigit octalDigit octalDigit
charEscape = '\' ( "a" | "b" | "f" | "n" | "r" | "t" | "v" | '\' | "'" | '"' )
quote = "'" | '"'

I know how I might approach it using something like Boost.Spirit.Qi, with:

hex_esc %= no_case["\\x"] >> uint_parser<unsigned char, 16, 2, 2>{};
oct_esc %= '\\' >> uint_parser<unsigned char, 8, 3, 3>{};
// The last bit in this phrase is literally, "Or Any Characters Not in the Sequence".
char_val %= hex_esc | oct_esc | char_esc | ~char_("\0\n\\");
str_lit %= ("'" >> *(char_val - "'") >> "'")
    | ('"' >> *(char_val - '"') >> '"')
    ;

And the escape sequences:

struct escapes_t : qi::symbols<char, char> {
    escapes_t() {
        this->add("\\a", '\a')
            ("\\b", '\b')
            ("\\f", '\f')
            ("\\n", '\n')
            ("\\r", '\r')
            ("\\t", '\t')
            ("\\v", '\v')
            ("\\\\", '\\')
            ("\\'", '\'')
            ("\\\"", '"')
            ;
    }
} char_esc;

Curious how that might flow in ANTLR4 targeting C#.

@andreasabel
Copy link

@parrt wrote:

Well people over-escaped previously which caused lots of failures to build as it became an error rather than warning (or was previously just ignored). Could be lots of old 4.5.3 grammars out there that did \x inappropriately. Let's just make the stuff you added as errors into warnings if it makes sense like ignoring \ in \x if x is not a valid escape.

It seem that this is not what has been implemented in the end. I tried this on 4.9:

lexer grammar Issue1537;

CHAR : '\''   -> more, mode(CHARMODE);

mode CHARMODE;
CHARANY     :  ~[\'\\] -> more, mode(CHAREND);  // extra escaping of '

mode CHAREND;
CHARENDC     :  '\''  -> type(CHAR), mode(DEFAULT_MODE);

This makes ANTLR crash:

$ java org.antlr.v4.Tool Issue1537.g4
warning(156): Issue1537.g4:7:16: invalid escape sequence \'
Exception in thread "main" java.lang.RuntimeException: set is empty
	at org.antlr.v4.runtime.misc.IntervalSet.getMaxElement(IntervalSet.java:421)
	at org.antlr.v4.runtime.atn.ATNSerializer.serialize(ATNSerializer.java:169)
	at org.antlr.v4.runtime.atn.ATNSerializer.getSerialized(ATNSerializer.java:601)
	at org.antlr.v4.Tool.generateInterpreterData(Tool.java:745)
	at org.antlr.v4.Tool.processNonCombinedGrammar(Tool.java:400)
	at org.antlr.v4.Tool.process(Tool.java:369)
	at org.antlr.v4.Tool.processGrammarsOnCommandLine(Tool.java:328)
	at org.antlr.v4.Tool.main(Tool.java:172)

See the discussion at BNFC/bnfc#329.

Should I open a new antlr4 issue for this?

@parrt
Copy link
Member

parrt commented Jan 1, 2021

hi. at this point I'm not doing a lot of fixes but it seems reasonable to prevent the tool from failing given \' inside of a character set. I'm just focused on other things now.

@andreasabel
Copy link

No hurry.
Software development is an eternal cycle of: new featues - new regressions - bug fixes, with epicycles at bug fixes - regressions introduced by bug fixes - regressions introduced by regression fixes etc.
I take your answer as a "yes".

@KvanTTT
Copy link
Member

KvanTTT commented Dec 30, 2021

Now ANTLR reports error(156): Test.g4:8:16: invalid escape sequence \' without crash because now it's error instead of a warning (it's fixed by 7f07af8). Now it's eventually fixed I guess :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants