Escapes sequence recognition failure in character sets #1537

renatahodovan · 2016-12-20T16:34:45Z

After the 4.6 release, the behaviour of escape sequences in the lexer's character sets changed compared to 4.5.3. I'm not sure whether they are bugs or features but I think it's worth to mention:

[\[] this worked in 4.5.3 but in 4.6 it's an invalid escape sequence (instead [[] can be used)
[\v] vertical tabs worked in 4.5.3 but they are invalid in 4.6
In 4.5.3, arbitrary characters could be escaped like [\j\k\l] but now they are invalid.
[d-a]: although its a bit weird, but was valid in 4.5.3, but it's not in 4.6

If these changes are expected, then I'm going to create some PR-s in grammars-v4, since several of them are failing now.

The text was updated successfully, but these errors were encountered:

parrt · 2016-12-20T18:47:40Z

Yeah, basically it's saying now that the escape is unnecessary. I didn't realize the PR made it an error rather than warning. sorry about that. Please do create a PR for grammars-v4! :)

renatahodovan · 2016-12-20T18:55:15Z

Does it mean that vertical tabs are not allowed anymore (i.e., they should be removed)? And a second thing, grammars-v4 still uses 4.5.3. Will it be bumped up?

parrt · 2016-12-20T18:56:58Z

Hmm...well, \v can be done with \u000B so let's leave it. it's rare. Yeah, we should bump to 4.6

KvanTTT · 2016-12-21T09:23:14Z

It seems to me that this pull request #1517 brought such behavior. In my opinion these cases are not clear. So, @renatahodovan make pull request to grammars if you want :)

parrt · 2016-12-22T17:27:46Z

@renatahodovan we could add \v if you want. Also let's drop from error to warning on [\[] type stuff. You ok with that @KvanTTT ?

renatahodovan · 2016-12-22T21:51:26Z

@parrt yes, I think it would be useful for the sake of backward compatibility. Thanks!

KvanTTT · 2016-12-22T22:46:14Z

Yes, warning instead of error is a good idea. I'll try to fix these issues at the beginning of January.

parrt · 2017-01-06T14:16:32Z

Any pull request for this coming? If not, I can do it.

KvanTTT · 2017-01-06T19:21:08Z

Not ready for now :(

Nulleye · 2017-01-11T09:54:02Z

Just to help on testing.
This:

ID : [a-zA-Z&/][a-zA-Z0-9\.@\-_\*/><]*;

fails on 4.6 but not on 4.5.3

KvanTTT · 2017-01-11T10:09:47Z

@Nulleye thank you for feedback!

Nulleye · 2017-01-11T11:07:27Z

Sorry, I realized that the web form has eaten some backslash chars.
To be accurate you have to put a backslash before the . - and *

parrt · 2017-01-11T17:39:49Z

@Nulleye i added the tick marks to make it appear as code. thanks!

sharwell · 2017-01-11T19:31:07Z

@Nulleye You don't need to escape the . and * characters inside a character set. Also, the - should appear last if it's not part of a range, like this:

ID : [a-zA-Z&/][a-zA-Z0-9.@_*/><-]*;

parrt · 2017-02-24T01:31:28Z

@KvanTTT can you take a look at this now that we've pulled in the unicode 32 stuff?

KvanTTT · 2017-02-24T15:44:26Z

OK, I'll take a look.

KvanTTT · 2017-02-24T22:32:26Z

Hmm...well, \v can be done with \u000B so let's leave it. it's rare.

I think we should not support it because of inconsistency.

In 4.5.3 we can not write \v char inside quotes: '\v' but can write in square brackets [\v].

Since 4.6 version the \v char is completely restricted.

parrt · 2017-02-24T22:34:17Z

Ok, since 4.6 will not allow, let's keep it as-is. no \v anywhere. Use \u000B.

KvanTTT · 2017-02-24T22:45:36Z

Moreover, I suggest just to close this issue.

In 4.5.3 version we can not use backslash inside quote literals.
Fo example: '\[' throws a error, [\[] does not throw a error.

Since 4.6 version we can not use backslash inside square-bracket blocks too.

So, since ANTLR 4.6 escape chars processing is consistent. See also testValidEscapeSequences test.

parrt · 2017-02-24T22:46:47Z

Ok, will close. Can we change the errors to be warnings (I think you made them errors) though that broke all the grammars-v4 grammars?

KvanTTT · 2017-02-24T22:49:50Z

Warnings for square-bracket block or for both syntax? It's easier to update our grammars :)

parrt · 2017-02-24T23:02:29Z

Well people over-escaped previously which caused lots of failures to build as it became an error rather than warning (or was previously just ignored). Could be lots of old 4.5.3 grammars out there that did \x inappropriately. Let's just make the stuff you added as errors into warnings if it makes sense like ignoring \ in \x if x is not a valid escape.

error-> warnings. Fixes #1537

…. This is related to antlr#1537. All tool errors pass now.

KvanTTT · 2017-03-02T19:48:54Z

@parrt what should we do with such char sets?

CHARSET_WITHOUT_START: [-z]
CHARSET_WITHOUT_END: [a-]

Possible solutions:

Consider that the first and last bounds are zero and infinity respectively, i. e
[\u0000-z] and [a-\uFFFF].
Add a new error "incorrect char set" for such cases.

I think the second choice is better.

parrt · 2017-03-02T20:56:09Z

yeah, an error is best choice. can we reuse a previous error?

KvanTTT · 2017-03-02T22:17:19Z

I'm afraid but I didn't find a corresponding type for such error.

parrt · 2017-03-02T22:45:51Z

dang. ok, maybe create another error maybe error code 165 INVALID_SET?

KvanTTT · 2017-03-02T22:52:36Z

I agree. I'll fix it tomorrow.

mwpowellhtx · 2019-02-01T05:16:58Z

Hello, along similar lines, I've got the following escape sequences to consider:

strLit = ( "'" { charValue } "'" ) | ( '"' { charValue } '"' )
charValue = hexEscape | octEscape | charEscape | /[^\0\n\\]/
hexEscape = '\' ( "x" | "X" ) hexDigit hexDigit
octEscape = '\' octalDigit octalDigit octalDigit
charEscape = '\' ( "a" | "b" | "f" | "n" | "r" | "t" | "v" | '\' | "'" | '"' )
quote = "'" | '"'

I know how I might approach it using something like Boost.Spirit.Qi, with:

hex_esc %= no_case["\\x"] >> uint_parser<unsigned char, 16, 2, 2>{};
oct_esc %= '\\' >> uint_parser<unsigned char, 8, 3, 3>{};
// The last bit in this phrase is literally, "Or Any Characters Not in the Sequence".
char_val %= hex_esc | oct_esc | char_esc | ~char_("\0\n\\");
str_lit %= ("'" >> *(char_val - "'") >> "'")
    | ('"' >> *(char_val - '"') >> '"')
    ;

And the escape sequences:

struct escapes_t : qi::symbols<char, char> {
    escapes_t() {
        this->add("\\a", '\a')
            ("\\b", '\b')
            ("\\f", '\f')
            ("\\n", '\n')
            ("\\r", '\r')
            ("\\t", '\t')
            ("\\v", '\v')
            ("\\\\", '\\')
            ("\\'", '\'')
            ("\\\"", '"')
            ;
    }
} char_esc;

Curious how that might flow in ANTLR4 targeting C#.

andreasabel · 2021-01-01T17:39:28Z

@parrt wrote:

Well people over-escaped previously which caused lots of failures to build as it became an error rather than warning (or was previously just ignored). Could be lots of old 4.5.3 grammars out there that did \x inappropriately. Let's just make the stuff you added as errors into warnings if it makes sense like ignoring \ in \x if x is not a valid escape.

It seem that this is not what has been implemented in the end. I tried this on 4.9:

lexer grammar Issue1537;

CHAR : '\''   -> more, mode(CHARMODE);

mode CHARMODE;
CHARANY     :  ~[\'\\] -> more, mode(CHAREND);  // extra escaping of '

mode CHAREND;
CHARENDC     :  '\''  -> type(CHAR), mode(DEFAULT_MODE);

This makes ANTLR crash:

$ java org.antlr.v4.Tool Issue1537.g4
warning(156): Issue1537.g4:7:16: invalid escape sequence \'
Exception in thread "main" java.lang.RuntimeException: set is empty
	at org.antlr.v4.runtime.misc.IntervalSet.getMaxElement(IntervalSet.java:421)
	at org.antlr.v4.runtime.atn.ATNSerializer.serialize(ATNSerializer.java:169)
	at org.antlr.v4.runtime.atn.ATNSerializer.getSerialized(ATNSerializer.java:601)
	at org.antlr.v4.Tool.generateInterpreterData(Tool.java:745)
	at org.antlr.v4.Tool.processNonCombinedGrammar(Tool.java:400)
	at org.antlr.v4.Tool.process(Tool.java:369)
	at org.antlr.v4.Tool.processGrammarsOnCommandLine(Tool.java:328)
	at org.antlr.v4.Tool.main(Tool.java:172)

See the discussion at BNFC/bnfc#329.

Should I open a new antlr4 issue for this?

parrt · 2021-01-01T17:58:12Z

hi. at this point I'm not doing a lot of fixes but it seems reasonable to prevent the tool from failing given \' inside of a character set. I'm just focused on other things now.

andreasabel · 2021-01-01T18:12:00Z

No hurry.
Software development is an eternal cycle of: new featues - new regressions - bug fixes, with epicycles at bug fixes - regressions introduced by bug fixes - regressions introduced by regression fixes etc.
I take your answer as a "yes".

KvanTTT · 2021-12-30T13:33:22Z

Now ANTLR reports error(156): Test.g4:8:16: invalid escape sequence \' without crash because now it's error instead of a warning (it's fixed by 7f07af8). Now it's eventually fixed I guess :)

parrt added the type:improvement label Dec 22, 2016

KvanTTT mentioned this issue Feb 21, 2017

New extended Unicode escape \u{10ABCD} to support Unicode literals > U+FFFF #1633

Merged

parrt added this to the 4.7 milestone Feb 24, 2017

parrt added comp:tool error-handling labels Feb 24, 2017

KvanTTT mentioned this issue Feb 25, 2017

Useless escape in charset warning #1696

Closed

parrt closed this as completed in 0708496 Mar 1, 2017

parrt added a commit that referenced this issue Mar 1, 2017

Merge pull request #1709 from parrt/fix-1537

ed4e358

error-> warnings. Fixes #1537

parrt added a commit to parrt/antlr4 that referenced this issue Mar 2, 2017

we left invalid escapes in string literals which was causing an error…

d9ae13f

…. This is related to antlr#1537. All tool errors pass now.

garyelephant mentioned this issue Mar 30, 2018

configParse对正则表达式支持不完善 apache/seatunnel#95

Closed

kaby76 mentioned this issue Dec 31, 2020

ANTLR backend: Spurious escaping of single quotes in character sets BNFC/bnfc#329

Closed

andreasabel mentioned this issue Jan 1, 2021

Over-escaping in char sets makes ANTLR 4.9 crash #3024

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escapes sequence recognition failure in character sets #1537

Escapes sequence recognition failure in character sets #1537

renatahodovan commented Dec 20, 2016

parrt commented Dec 20, 2016

renatahodovan commented Dec 20, 2016

parrt commented Dec 20, 2016

KvanTTT commented Dec 21, 2016

parrt commented Dec 22, 2016

renatahodovan commented Dec 22, 2016

KvanTTT commented Dec 22, 2016 •

edited

Loading

parrt commented Jan 6, 2017

KvanTTT commented Jan 6, 2017

Nulleye commented Jan 11, 2017 •

edited by parrt

Loading

KvanTTT commented Jan 11, 2017

Nulleye commented Jan 11, 2017

parrt commented Jan 11, 2017

sharwell commented Jan 11, 2017

parrt commented Feb 24, 2017

KvanTTT commented Feb 24, 2017

KvanTTT commented Feb 24, 2017

parrt commented Feb 24, 2017

KvanTTT commented Feb 24, 2017

parrt commented Feb 24, 2017

KvanTTT commented Feb 24, 2017 •

edited

Loading

parrt commented Feb 24, 2017

KvanTTT commented Mar 2, 2017

parrt commented Mar 2, 2017

KvanTTT commented Mar 2, 2017

parrt commented Mar 2, 2017

KvanTTT commented Mar 2, 2017

mwpowellhtx commented Feb 1, 2019

andreasabel commented Jan 1, 2021

parrt commented Jan 1, 2021

andreasabel commented Jan 1, 2021

KvanTTT commented Dec 30, 2021 •

edited

Loading

Escapes sequence recognition failure in character sets #1537

Escapes sequence recognition failure in character sets #1537

Comments

renatahodovan commented Dec 20, 2016

parrt commented Dec 20, 2016

renatahodovan commented Dec 20, 2016

parrt commented Dec 20, 2016

KvanTTT commented Dec 21, 2016

parrt commented Dec 22, 2016

renatahodovan commented Dec 22, 2016

KvanTTT commented Dec 22, 2016 • edited Loading

parrt commented Jan 6, 2017

KvanTTT commented Jan 6, 2017

Nulleye commented Jan 11, 2017 • edited by parrt Loading

KvanTTT commented Jan 11, 2017

Nulleye commented Jan 11, 2017

parrt commented Jan 11, 2017

sharwell commented Jan 11, 2017

parrt commented Feb 24, 2017

KvanTTT commented Feb 24, 2017

KvanTTT commented Feb 24, 2017

parrt commented Feb 24, 2017

KvanTTT commented Feb 24, 2017

parrt commented Feb 24, 2017

KvanTTT commented Feb 24, 2017 • edited Loading

parrt commented Feb 24, 2017

KvanTTT commented Mar 2, 2017

parrt commented Mar 2, 2017

KvanTTT commented Mar 2, 2017

parrt commented Mar 2, 2017

KvanTTT commented Mar 2, 2017

mwpowellhtx commented Feb 1, 2019

andreasabel commented Jan 1, 2021

parrt commented Jan 1, 2021

andreasabel commented Jan 1, 2021

KvanTTT commented Dec 30, 2021 • edited Loading

KvanTTT commented Dec 22, 2016 •

edited

Loading

Nulleye commented Jan 11, 2017 •

edited by parrt

Loading

KvanTTT commented Feb 24, 2017 •

edited

Loading

KvanTTT commented Dec 30, 2021 •

edited

Loading