Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape bad words during grammar generation #3451

Merged
merged 3 commits into from
Jan 5, 2022

Conversation

KvanTTT
Copy link
Member

@KvanTTT KvanTTT commented Jan 1, 2022

fixes #1070

Deprecate USE_OF_BAD_WORD

Ok, I'm converting this to a pull request since we've already started reviewing and discussing.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 1, 2022

I didn't want to make a draft because of failed checks (I haven't completed Go). But actually, it does not care for now.

…r -> for_ but RULE_for)

Deprecate USE_OF_BAD_WORD
@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 3, 2022

To get an updated, combined list, the easiest would be to use trxgrep and look for anything that ends in an underscore. But, that would be a merged list for all targets.

I realized it's ok to get a list even for all targets. I'll create a big test with all these words and run it on different runtimes. Failed runtimes will indicate correct targets of these words :)

@kaby76
Copy link
Contributor

kaby76 commented Jan 3, 2022

Repeating what I said over in the closed Issue, here is the aggregated bad word list (with underscores) derived from symbols in grammars-v4.

all_underscore.txt

@KvanTTT KvanTTT marked this pull request as ready for review January 3, 2022 16:33
@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 3, 2022

@parrt I think it makes sense to review and merge it now since there are a lot of fixes already. I'll add other reserved words provided by @kaby76 and fix potential bugs in the following requests (but now all tests are okay).

I decided to use orginalName for a name from grammar and escapedName for a name in runtime.

@parrt
Copy link
Member

parrt commented Jan 3, 2022

I decided to use orginalName for a name from grammar and escapedName for a name in runtime.

I always worry about making the smallest possible change, but I guess this makes sense to highlight the difference now.

Just so I'm clear, the list of reserved words will always be specific to a target right? I don't think it makes sense to have a general list that is always escaped.

I will try to review this branch today

@parrt
Copy link
Member

parrt commented Jan 3, 2022

looking really good! thanks for your efforts here. I just have a number of tiny comments and then the one issue about naming. I think we want to leave <r.name> in the code generation templates as the original name and keep it consistent so that field name in support code should be the original name. We then have name and escapedName or whatever in the code generation templates, and a field if we need it that way.

@parrt parrt added this to the 4.10 milestone Jan 3, 2022
@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 3, 2022

Just so I'm clear, the list of reserved words will always be specific to a target right? I don't think it makes sense to have a general list that is always escaped.

Generally yes, but I'm trying to figure out if common words exist (for instance, it looks like the word rule is common for all targets, but I haven't found out why it is so yet).

@parrt
Copy link
Member

parrt commented Jan 3, 2022

Generally yes, but I'm trying to figure out if common words exist (for instance, it looks like the word rule is common for all targets, but I haven't found out why it is so yet).

Yeah, that is really strange. There must be a quirk of code generation that I/we were trying to overcome. I say we try it without those weird reserved words and see what happens.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 3, 2022

I think we want to leave <r.name> in the code generation templates as the original name and keep it consistent so that field name in support code should be the original name. We then have name and escapedName or whatever in the code generation templates, and a field if we need it that way.

Do you mean escapedName should be always the escaped name independence of class package (tool or runtime)? originalName should be eliminated since name is always the original name. It's good for consistency and clarity but may cause more template changes that you'd like to get rid as I understand (but I can try, maybe not so massive as I think). Also, templates use classes both from tool package (Rule) and from codegen package (RuleFunction).

@parrt
Copy link
Member

parrt commented Jan 3, 2022

Do you mean escapedName should be always the escaped name independence of class package (tool or runtime)?

I would imagine that the simplest is that name is always the original name and escapedName is either the original name, or if necessary for that code gen target, the escaped name. In other words, the code generation templates would always refer to name when they know for a fact there's no collision, such as in a string "<r.name>". But, if there is a chance that the name could cause a problem in the generated code, we would always refer to it as <r.escapedName> even if the name is not actually escaped. Does that make sense?

originalName should be eliminated since name is always the original name.

Yep!

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 3, 2022

But, if there is a chance that the name could cause a problem in the generated code, we would always refer to it as <r.escapedName> even if the name is not actually escaped. Does that make sense?

Yes, it sounds reasonable and I like that. The only thing I was afraid of is massive template changes. Ok, I will try to implement it.

@kaby76
Copy link
Contributor

kaby76 commented Jan 3, 2022

I'm also trying to envision how the changed code would work.

Let's suppose the grammar Foobar.g4 grammar Foobar; boolean: 'true' | 'false';, with pom.xml for the Antlr4 Maven Plugin that gives entry point "boolean". This grammar is valid in CSharp but invalid in Java.

  • The grammar works for C#. I can generate a working C# parser, and there would be no compilation errors. The Antlr tool would work and the compilation of the generated files and existing code would work fine.
  • This code will not work for Java. If I try to do this for the Java target (java -jar C:/Users/Kenne/Downloads/antlr-4.9.3-complete.jar -Dlanguage=Java -encoding utf-8 Foobar.g4), I cannot because the Antlr tool will emit error(134): Foobar.g4:2:0: symbol boolean conflicts with generated code in target language or runtime.

Up to now, if I wanted this "target agnostic", I'd have to change quite a bit. In the .g4 file, I would change boolean to boolean_; In the pom.xml, I would change <entryPoint>boolean</entryPoint> to <entryPoint>boolean_</entryPoint>. If I had written a listener or visitor, I would have to change that because the renamed symbol would result in new generated code declarations.

With this PR change, if I didn't make any changes to the grammar, would I have to change anything in the driver, visitors or listeners I wrote in C#? Is the entry point in C# still parser.boolean();?

For Java, it would all be new code because the Antlr tool wouldn't have even generated any Java code prior to the PR. What would be the entry point be? parser.boolean() works fine for C# but not for Java, right? Are you planning on a string-named entry point, e.g., parser.EntryPoint("boolean") instead of "parser.boolean()"?

@parrt
Copy link
Member

parrt commented Jan 5, 2022

Fantastic work @KvanTTT. beautiful. I just have that one nit about the naming of the superclass chunk... what you think?

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 5, 2022

Also, dart tests are failed. But it looks like a problem on CI:

Error: Could not resolve the package 'antlr4' in 'package:antlr4/antlr4.dart'.

It works fine locally.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 5, 2022

As I noted here and here, I am working out a reserved word list per target.

Thank you for your work but wow I think it's enough to have global bad words list because I've already included them in a big test that I'm going to add in the following pull request. So, there is no need to split them by different runtimes.

Ok, reserved word lists per target are also useful because with them I can update reserved words in ANTLR tool, probably remove actually not reserved words.

@parrt parrt merged commit 09e917e into antlr:master Jan 5, 2022
@KvanTTT KvanTTT deleted the runtime-bad-words-escaping branch January 5, 2022 20:35
@kaby76
Copy link
Contributor

kaby76 commented Jan 5, 2022

Here are "bad words" for C#, Java, Dart, JavaScript. Working out C++, Go, Python3, PHP.

final-list.javascript.txt
final-list.csharp.txt
final-list.dart.txt
final-list.java.txt

Again, this uses a custom Antlr 4.9.3 tool with minimal bad word list, on a grammar with the symbol (lexer or parser) in a tiny grammar, trgen to generate a parser, build of the parser, and a parse in order to test from start to end what is a bad word.

@kaby76
Copy link
Contributor

kaby76 commented Jan 8, 2022

Attached are the last two "bad words", for Cpp and Go.

final-list.cpp.txt
final-list.go.txt

I did a check to see if this PR fixes Issue #1070 that it claims to fix. Indeed, there are many "bad words" that are not listed in the tables at

protected static final HashSet<String> reservedWords = new HashSet<>(Arrays.asList(
, and are not mangled on the generated files.

Checking the simple grammar grammar Foobar; x : Accept; Accept: 'xxx'; against the latest source for the Antlr tool, this works fine for CSharp, but it does not for Go: the "go build" fails. There is no name mangling for Accept.

This change does not fix "symbol conflicts": I will still have to change a "target agnostic" grammar for Go that declares a lexer rule for "Accept".

@parrt
Copy link
Member

parrt commented Jan 9, 2022

Ok, @KvanTTT: care to update the word lists?

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 9, 2022

Sure, I'll update the word lists but a bit later. I've encountered yet another symbol conflicts problem and I'm thinking how to resolve them in a better way.

Also, I think there is another way of bad words extracting. They can be extracted analytically based on runtime code, not only empirically based on incorrect generated code. For instance, we have bad word terminal in Dart runtime because there is conflict with default listener method enterTerminal:

  /// The default implementation does nothing.
  @override
  void enterEveryRule(ParserRuleContext ctx) {}

  /// The default implementation does nothing.
  @override
  void exitEveryRule(ParserRuleContext ctx) {}

  /// The default implementation does nothing.
  @override
  void visitTerminal(TerminalNode node) {}

  /// The default implementation does nothing.
  @override
  void visitErrorNode(ErrorNode node) {}

But other problem words such as everyRule, errorNode can be extracted as well. This approach works with other runtimes as well.

@kaby76
Copy link
Contributor

kaby76 commented Jan 9, 2022

To compute the "bad word" list analytically, one would have to look at the .stg files (since the name that is generated is some string in the .stg file (e.g. Enter<lname; format="cap">), along with computed strings by the Tool), the Antlr runtimes (all public methods, fields, etc.), and any other libraries that one commonly uses. For the Cpp target, we include stddef.h, which defines NULL. Any preprocessor symbol could turn the generated code on its head, especially in the first table in the generated source code for the parser that enumerates the lexer grammar symbols. One can't anticipate all the symbol conflicts at tool invocation time because the generated code will exist in an environment that one cannot predict.

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 9, 2022

Probably we should combine both approaches to extract the most general bad words list.

@kaby76
Copy link
Contributor

kaby76 commented Jan 9, 2022

Sure. An update to the script is here.
try-symbol.txt

But this illustrates why I've been an advocate of parameterizing templates, including the attributes and values themselves. This is exactly why I wrote trgen to allow the CI tester in grammars-v4 to specify the templates rather than just rely on hardwired templates, but doesn't go nearly far enough. Of course, it goes against "strict model-view separation".

@KvanTTT
Copy link
Member Author

KvanTTT commented Jan 9, 2022

@kaby76 take attention the following words are reserved in ANTLR, it's not necessary to add them to the bad words list:

  • catch
  • finally
  • fragment
  • grammar
  • import
  • options
  • throws

Also, for the following words ANTLR reports RESERVED_RULE_NAME or similar errors, they also shouldn't be added:

  • EOF
  • MORE
  • SKIP

@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 5, 2022

Here are "bad words" for C#, Java, Dart, JavaScript. Working out C++, Go, Python3, PHP.

Hi! Do you have updates for Python and PHP?

@kaby76
Copy link
Contributor

kaby76 commented Feb 7, 2022

Hi! Do you have updates for Python and PHP?

Sorry Ivan, I hadn't prioritized working on this. I can work out the PHP keywords the next couple of days. Then, I'll check what was done for Python. (I've been working out a machine learning scraper for grammars from documents in Trash.)

@KvanTTT
Copy link
Member Author

KvanTTT commented Feb 7, 2022

Ok, no worry, no rush. I'm just checking the status.

@parrt parrt mentioned this pull request Feb 20, 2022
@kaby76
Copy link
Contributor

kaby76 commented Feb 26, 2022

In determining the reserved words in the PHP target, I tested the tool and target against the current list of reserved words in PHPTarget.java with a special-built tool that removes all reserved word checks for PHP. In other words, I emptied out the list, built and ran the tool against the latest "dev" branch code to observe any problems. Note, I left in the double underscore symbols alone, even though one cannot use them in a grammar, i.e., you cannot define a TOKEN_REF or RULE_REF with the underscore character leading the name.

It turns out almost all the so-called "reserved words" do not cause any problem whatsoever. Attached is a grammar that I used to check the list and the generated PHP code.
Test.txt
TestParser.txt
TestListener.txt
TestVisitor.txt

I am using is PHP 7.4.3 (cli) (built: Nov 25 2021 23:16:22) ( NTS ). I have not tested this against PHP 8. There is no stated minimum requirements in the online documentation (), but Composer for antlr/antlr4-php-runtime states it is for "PHP 7 and 8.0 runtime for ANTLR 4". Really, properly released software should test the software against a stated minimum and maximum version, and tested in the CI build.

I will continue the experimental check using a version of grammars-v4 modified to remove the trailing underscores in symbol names. (I can do this quickly using Trash trrename.)

@parrt
Copy link
Member

parrt commented Feb 26, 2022

ok, please let us know if we can simplify the list of naughty words. wow PHP is very permissive I see.

@kaby76
Copy link
Contributor

kaby76 commented Feb 26, 2022

Yes, the PHP naughty word list can be shortened. Testing the "_" word list in grammars-g4, but I have a feeling it will be a small list. I agree...PHP is very robust here.

@parrt
Copy link
Member

parrt commented Feb 26, 2022

Ok, please let me know here #3460 if you would like to make a PR to reduced things; maybe coordinate with @KvanTTT

@kaby76
Copy link
Contributor

kaby76 commented Mar 2, 2022

I performed the additional tests for symbol conflicts for the PHP target. Here are the results and discussion.

Reserved words for PHP

CLASS
addContextToParseTree
addParseListener
catch
consume
createErrorNode
createTerminalNode
dumpDFA
enterOuterAlt
enterRecursionRule
enterRule
exitRule
finally
getATNWithBypassAlts
getBuildParseTree
getContext
getCurrentRuleName
getCurrentToken
getDFAStrings
getErrorHandler
getExpectedTokens
getExpectedTokensWithinCurrentRule
getInputStream
getInvokingContext
getNumberOfSyntaxErrors
getParseListeners
getPrecedence
getRuleIndex
getRuleInvocationStack
getSourceName
getTokenFactory
getTokenStream
inContext
isExpectedToken
isMatchedEOF
isTrace
match
matchWildcard
notifyErrorListeners
parserRule
precpred
pushNewRecursionContext
removeParseListener
removeParseListeners
reset
setBuildParseTree
setContext
setErrorHandler
setInputStream
setTokenFactory
setTokenStream
setTrace
triggerEnterRuleEvent
triggerExitRuleEvent
unrollRecursionContexts

Discussion

  1. PHP doesn't seem to tell me in one swoop all duplicated definitions. This was a problem in testing, forcing me to edit ONE non-terminal (aka "parser rule") at a time, build, and test. PHP seems to output one error, then quit. So, if there are multiple problems with the grammar, finding and correcting all problems is time consuming.
  2. There is only one terminal (aka, "lexer rule") that results in a symbol conflict: CLASS. There may be others.
  3. Most, but not all, methods in the parser class result in a symbol conflict. E.g., if the rule "match:'a';" is defined, then it will result in a symbol conflict. What this means is that this list is fragile: what is a "bad word" will depend on the runtime. If someone modifies the parser class with a new method, that method name will need to be tested and possibly entered in the "bad word" list. I haven't researched why "context" is a valid parser rule name when there is a method in the parser class by that name as well. There are other names like this.

Attached is a grammar file with the additional conflicts (i.e., the commented out rules).

Test.txt

I would prefer not to make the changes to the code. There are other PRs I have to complete, and I have a lot of work to do on Trash. Thanks.

parrt added a commit that referenced this pull request Apr 10, 2022
* Get rid of reflection in CodeGenerator

* Rename TargetType -> Language

* Remove TargetType enum, use String instead as it was before

Create CodeGenerator only one time during grammar processing, refactor code

* Add default branch to appendEscapedCodePoint for unofficial targets (Kotlin)

* Remove getVersion() overrides from Targets since they return the same value

* Remove getLanguage() overrides from Targets since common implementation returns correct value

* [again] don't use "quiet" option for mvn tests...hard to figure out what's wrong when failed.

* normalize targets to 80 char strings for ATN serialization, except Java which needs big strings for efficiency.

* Update actions.md

fixed a small typo

* Rename `CodeGenerator.createCodeGenerator` to `CodeGenerator.create`

* Replace constants on string literals in `appendEscapedCodePoint`

* Restore API of Target

getLanguage(): protected -> public as it was before

appendUnicodeEscapedCodePoint(int codePoint, StringBuilder sb, boolean escape): protected -> private (it's a new helper method, no need for API now)

Added comment for appendUnicodeEscapedCodePoint

* Introduce caseInsensitive lexer rule option, fixes #3436

* don't ahead of time compile for DART. See 8ca8804#commitcomment-62642779

* Simplify test rig related to timeouts (#3445)

* remove all -q quiet mvn options to see output on CI servers.

* run the various unit test classes in parallel rather than each individual test method, all except for Swift at the moment: `-Dparallel=classes -DthreadCount=4`

* use bigger machine at circleci

* No more test groups like parser1, parser2.

* simplify Swift like the other tests

* fix whitespace issues

* use 4.10 not 4.9.4

* improve releasing antlr doc

* Add Support For Swift Package Manager (#3132)

* Add Swift Package Manager Support

* Swift Package Dynamic

* 【fix】【test】Fix run process path

Co-authored-by: Terence Parr <parrt@cs.usfca.edu>

* use src 11 for tool, but 8 for plugin/runtime (#3450)

* use src 11 for tool, but 8 for plugin/runtime/runtime-tests.
* use 11 in CI builds

* cpp/cmake: Fix library install directories (#3447)

This installs DLLs in bin directory instead of lib.

* Python local import fixes (#3232)

* Fixed pygrun relative import issue

* Added name to contributors.txt

Co-authored-by: Terence Parr <parrt@cs.usfca.edu>

* Update javadoc to 8 and 11 (#3454)

* no need for plugin in runtime, always gen svg from dot for javadoc, gen 1.8 not 1.7 doc for runtime. Gen 11 for tool.

* tweak doc for 1.8 runtime.  Test rig should gen 1.8 not 1.7

* [Go] Fix (*BitSet).equals (#3455)

* set tool version for testing

* oops reversion tool version as it's not sync'd with runtime and not time to release yet.

* Remove unused variable from generated code (#3459)

* [C++] Fix bugs in UnbufferedCharStream (#3420)

* Escape bad words during grammar generation (#3451)

* Escape reserved words during grammar generation, fixes #1070 (for -> for_ but RULE_for)

Deprecate USE_OF_BAD_WORD

* Make name and escapedName consistent across tool and codegen classes

Fix other pull request notes

* Rename NamedActionChunk to SymbolRefChunk

* try out windows runners

* rename workflow

* Update windows.yml

Fix cmd line issue

* fix maven issue on windows

* use jdk 11

* remove arch arg

* display Github status for windows

* try testing python3 on windows

* try new run for python3 windows

* try new run for python3 windows (again)

* try new run for python3 windows (again2)

* try new run for python3 windows (again3)

* try new run for python3 windows (again4)

* try new run for python3 windows (again5)

* try new run for python3 windows

* try new run for python3 windows

* try new run for python3 windows

* ugh i give up. python won't install on github actions.

* Update windows.yml

try python 3

* Update windows.yml

* Update run-tests-python3.cmd

* Update run-tests-python3.cmd

* Create run-tests-python2.cmd

* Update windows.yml

* Update run-tests-python2.cmd

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Create run-tests-javascript.cmd

* Update run-tests-javascript.cmd

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Create run-tests-csharp.cmd

* Update windows.yml

* fix warnings in C# CI

* Update windows.yml

* Update windows.yml

* Create run-tests-dart.cmd

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update run-tests-dart.cmd

* Update run-tests-dart.cmd

* Update run-tests-dart.cmd

* Update run-tests-dart.cmd

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Create run-tests-go.cmd

* Update windows.yml

* Update windows.yml

* Update windows.yml

* GitHub action php (#3474)

* Update windows.yml

* Create run-tests-php.cmd

* Update run-tests-php.cmd

* Update run-tests-php.cmd

* Update run-tests-php.cmd

* Update run-tests-php.cmd

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update run-tests-php.cmd

* Update windows.yml

* Cleanup ci (#3476)

* Delete .appveyor directory

* Delete .travis directory

* Improve CI concurrency (#3477)

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Optimize toArray

replace toArray(new T[size]) with toArray(new T[0]) for better performance

https://shipilev.net/blog/2016/arrays-wisdom-ancients/#_conclusion

* add contributor

* resolve conflicts

* fix-maven-concurrency (#3479)

* fix-maven-concurrency

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update run-tests-python2.cmd

* Update run-tests-python3.cmd

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update windows.yml

* Update run-tests-php.cmd

* Update windows.yml

* Update run-tests-dart.cmd

* Update run-tests-csharp.cmd

* Update run-tests-go.cmd

* Update run-tests-java.cmd

* Update run-tests-javascript.cmd

* Update run-tests-php.cmd

* Update run-tests-python2.cmd

* Update run-tests-python3.cmd

* increase Windows CI concurrency for all targets except Dart

* Preserve line separators for input runtime tests data (#3483)

* Preserve line separators for input data in runtime tests, fix test data

Refactor and improve performance of BaseRuntimeTest

* Add LineSeparator (\n, \r\n) tests

* Set up .gitattributes for LineSeparator_LF.txt (eol=lf) and LineSeparator_CRLF.txt (eol=crlf)

* Restore `\n` for all input in runtime tests, add extra LexerExec tests (LineSeparatorLf, LineSeparatorCrLf)

* Add generated LargeLexer test, remove LargeLexer.txt descriptor

* tweak name to be GeneratedLexerDescriptors

* [JavaScript] Migrate from jest to jasmine

* [C++] Fix Windows min/max macro collision

* [C++] Update cmake README.md to C++17

* remove unnecessary comparisons.

* Add useful function writeSerializedATNIntegerHistogram for writing out information concerning how many of each integer value appear in a serialized ATN.

* fix  comment indicating what goes in the serialized ATN.

* move writeSerializedATNIntegerHistogram out of runtime.

* follow guidelines

* Fix .interp file parsing test for the Java runtime.

Also includes separating the generation of the .interp file from writing it out so that we can use both independently.

* Delete files no longer needed. Should have been part of #3520

* [C++] Optimizations and cleanups and const correctness, oh my

* [C++] Optimize LL1Analyzer

* [C++] Fix missing virtual destructors

* Remove not used PROTECTED, PUBLIC, PRIVATE tokens from ANTLRLexer.g

* Remove ANTLR 3 stuff from ANTLR grammars, deprecate ANTLR 3 errors

* Remove not used imaginary tokens from ANTLRParser.g

* Fix misprints in grammars

* ATN serialized data: remove shifting by 2, remove UUID; fix #3515

Regenerate XPathLexer files

* Disable native runtime tests (see #3521)

* Implement Java-specific ATN data optimization (+-2 shift)

* [C++] Remove now unused antlrcpp::Guid

* pull new branch diagram from master

* use dev not master branch for CI github

* update doc from master

* add back missing author

* [C++] Fix const correctness in ATN and DFA

* keep getSerializedATNSegmentLimit at max int

* Fixes #3259 make InErrorRecoveryMode public for go

* Change code gen template to capitalize InErrorRecoveryMode

* [C++] Improve multithreaded performance, fix TSAN error, and fix profiling ATN simulator setup bug

* Get rid of unnecessary allocations and calculations in SerializedATN

* Get rid of excess char escaping in generated files, decrease size of output files

Fix creation of excess fragments for Dart, Cpp, PHP runtimes

* Swift: fix binary serialization and use instead of JSON

* Fix targetCharValueEscape, make them final and static

* [C++] Cleanup ATNDeserializer and remove related deprecated methods from ATNSimulator

* Fix for #3557 (getting "go test" to work again).

* Convert Python2/3 to use int arrays not strings for ATN encodings (#3561)

* Convert Python2/3 to use int arrays not strings for ATN encodings. Also make target indicate int vs string.

* rename and reverse ATNSerializedAsInts

* add override

* remove unneeded method

* [C++] Drastically improve multi-threaded performance (#3550)

Thanks guys. A major advancement.

* [C++] Remove duplicate includes and remove unused includes (#3563)

* [C++] Lazily deserialize ATN in generated code (#3562)

* [Docs] Update Swift Docs (#3458)

* Add Swift Package Manager Support

* Swift Package Dynamic

* 【fix】【test】Fix run process path

* [Docs] [Swift] update link, remove expired descriptions

Co-authored-by: Terence Parr <parrt@cs.usfca.edu>

* Ascii only ATN serialization (#3566)

* go back to generating pure ascii ATN serializations to avoid issues where target compilers might assume ascii vs utf-8.

* forgot I had to change php on previous ATN serialization tweak.

* change how we escapeChar() per target.

* oops; gotta use escapeChar method

* rm unneeded case

* add @OverRide

* use ints not chars for C# (#3567)

* use ints not chars for C#

* oops. remove 'quotes'

* regen from XPathLexer.g4

* simplify ATN with bypass alts mechanism in Java.

* Change string to int[] for serialized ATN for C#; removed unneeded `use System` from XPathLexer.g4; regen that grammar.

* [C++] Use camel case name in generated lexers and parsers (#3565)

* Change string to int array for serialized ATN for JavaScript (#3568)

* perf: Add default implementation for Visit in ParseTreeVisitor.  (#3569)

* perf: Add default implementation for Visit in ParseTreeVisitor.

Reference: https://github.com/antlr/antlr4/blob/ad29539cd2e94b2599e0281515f6cbb420d29f38/runtime/Java/src/org/antlr/v4/runtime/tree/AbstractParseTreeVisitor.java#L18

* doc: add contributor

* Don't use utf decoding...these are just ints (#3573)

* [Go] Cleanup and fix ATN deserialization verification (#3574)

* [C++] Force generated static data type name to titlecase (#3572)

* Use int array not string for ATN in Swift (#3575)

* [C++] Fix generated Lexer static data constructor (#3576)

* Use int array not string for ATN in Dart (#3578)

* Fix PHP codegen to support int ATN serialization (#3579)

* Update listener documentation to satisfy the discussion about improving exception handling: #3162

* tweak

* [C++] Remove unused LexerATNSimulator::match_calls (#3570)

* [C++] Remove unused LexerATNSimulator::match_calls

* Remove match_calls from other targets

* [Java] Preserve serialized ATN version 3 compatibility (#3583)

* add jcking to the contributors list

* Update releasing-antlr.md

* [C++] Avoid using dynamic_cast where possible by using hand rolled RTTI (#3584)

* Revert "[Java] Preserve serialized ATN version 3 compatibility (#3583)"

This reverts commit 01bc811.

* [C++] Add ANTLR4CPP_PUBLIC attributes to various symbols (#3588)

* Update editorconfig for c++ (#3586)

* Make it easier to contribute: Add c++ configuration for .editorconfig.

Using the observed style with 2 indentation spaces.

Signed-off-by: Henner Zeller <hzeller@google.com>

* Add hzeller to contributors.txt

Signed-off-by: Henner Zeller <hzeller@google.com>

* Fix code style and typing to support PHP 8 (#3582)

* [Go] Port locking algorithm from C++ to Go (#3571)

* Use linux DCO not our old contributors certificate of origin

* [C++] Fix bugs in SemanticContext (#3595)

* [Go] Do not export Array2DHashSet which is an implementation detail (#3597)

* Revert "Use linux DCO not our old contributors certificate of origin"

This reverts commit b0f8551.

* Use signed ints for ATN serialization not uint16, except for java (#3591)

* refactor serialize so we don't need comments

* more cleanup during refactor

* store language in serializer obj

* A lexer rule token type should never be -1 (EOF). 0 is fragment but then must be > 0.

* Go uses int not uint16 for ATN now. java/go/python3 pass

* remove checks for 0xFFFF in Go.

* C++ uint16_t to int for ATN.

* add mac php dir; fix type on accept() for generated code to be mixed.

* Add test from @KvanTTT. This PR fixes #3555 for non-Java targets.

* cleanup and add big lexer from #3546

* increase mvn mem size to 2G

* increase mvn mem size to 8G

* turn off the big ATN lexer test as we have memory issues during testing.

* Fixes #3592

* Revert "C++ uint16_t to int for ATN."

This reverts commit 4d2ebbf.

# Conflicts:
#	runtime/Cpp/runtime/src/atn/ATNSerializer.cpp
#	runtime/Cpp/runtime/src/tree/xpath/XPathLexer.cpp

* C++ uint16_t to int32_t for ATN.

* rm unnecessary include file, updating project file. get rid of the 0xFFFF does in the C++ deserialization

* rm refs to 0xFFFF in swift

* javascript tests were running as Node...added to ignore list.

* don't distinguish between 16 and 32 bit char sets in serialization; Python2/3  updated to work with this change.

* update C++ to deserialize only 32-bit sets

* 0xFFFF -> -1 for C++ target.

* get other targets to use 32-bit sets in serialization. tests pass locally.

* refactor to reduce code size

* add comment

* oops. comment out call to writeSerializedATNIntegerHistogram(). I wonder if this is why it ran out of memory during testing?

* all but Java, Node, PHP, Go work now for the huge lexer file; I have set them to ignore.  note that the swift target takes over a minute to lex it.  I've turned off Node but it does not seem to terminate but it could terminate eventually.

* all but Java, Node, PHP, Go work now for the huge lexer file; I have set them to ignore.  note that the swift target takes over a minute to lex it.  I've turned off Node but it does not seem to terminate but it could terminate eventually.

* Turn off this big lexer because we get memory errors during continuous integration

* Intermediate commit where I have shuffled around all of the -1 flipping and bumping by two.  work still needs to be done because the token stream rewriter stuff fails. and I assume the other decoding for human readability testing if doesn't work

* convert decode to use int[]; remove dead code. don't use serializeAsChar stuff. more tests pass.

* more tests passing. simplify. When copying atn, must run ATN through serializer to set some state flags.

* 0xFFFD+ are not valid char

* clean up. tests passing now

* huge clean up. Got Java working with 32-bit ATNs!Still working on cleanup but I want to run the tests

* Cleanup the hack I did earlier; everything still seems to work

* Use linux DCO not our old contributors certificate of origin

* remove bump-by-2 code

* clean up per @KvanTTT. Can't test locally on this box. Will see what CI says.

* tweak comment

* Revert "Use linux DCO not our old contributors certificate of origin"

This reverts commit b0f8551.

* see if C++ works in CI for huge ATN

* Use linux DCO not our old contributors certificate of origin (#3598)

* Use linux DCO not our old contributors certificate of origin

* Revert "Use linux DCO not our old contributors certificate of origin"

This reverts commit b0f8551.

* use linux DCO

* use linux DCO

* Use linux DCO not our old contributors certificate of origin

* update release documentation

Signed-off-by: Terence Parr <parrt@antlr.org>

* Equivalent of #3537

* clean up setup

* clean up doc version

* [Swift] improvements to equality functions (#3302)

* fix default equality

* equality cases

* optional unwrapping

* [Swift] Use for in loops (#3303)

* common for in loops

* reversed loop

* drop first loop

* for in with default BitSet

* [Go] Fix symbol collision in generated lexers and parsers (#3603)

* [C++] Refactor and optimize SemanticContext (#3594)

* [C++] Devirtualize hand rolled RTTI for performance (#3609)

* [C++] Add T::is for type hierarchy checks and remove some dynamic_cast (#3612)

* [C++] Avoid copying statically generated serialized ATNs (#3613)

* [C++] Refactor PredictionContext and yet more performance improvements (#3608)

* [C++] Cleanup DFA, DFAState, LexerAction, and yet more performance improvements (#3615)

* fix dependabot issues

* [Swift] use stdlib (single pass) (#3602)

* this was added to the stdlib in Swift 5

* &>> is defined as lhs >> (rhs % lhs.bitwidth)

* the stdlib has these

* reduce loops

* use indices

* append(contentsOf:)

* Array literal init works for sets too!

* inline and remove bit query functions

* more optional handling (#3605)

* [C++] Minor improvements to PredictionContext (#3616)

* use php runtime dev branch to test dev

* update doc to be more explicit about the interaction between lexer actions and semantic predicates; Fixes #3611. Fixes #3606.

Signed-off-by: Terence Parr <parrt@antlr.org>

* Refactor js runtime in preparation of future improvements

* refactor, 1 file per class, use import, use module semantics, use webpack 5, use eslint

* all tests pass

* simplifications and alignment with standard js idioms

* simplifications and alignment with standard js idioms

* support reading legacy ATN

* support both module and non-module imports

* fix failing tests

* fix failing tests

* No longer necessary too generate sets or single atom transit that are bigger than 16bits. (#3620)

* Updated getting started with Cpp documentation. (#3628)

Included specific examples of using ANTLR4_TAG and ANTLR4_ZIP_REPOSITORY in the sample CMakeLists file.

* [C++] Free ATNConfig lookup set in readonly ATNConfigSet (#3630)

* [C++] Implement configurable PredictionContextMergeCache (#3627)

* Allow to choose to switch off building tests in C++ (#3624)

The new option to cmake ANTLR_BUILD_CPP_TESTS is default
on (so the behavior is as before), but it provides a way to
switch off if not needed.

The C++ tests pull in an external dependency (googletests),
which might conflict if ANTLR is used as a subproject in
another cmake project.

Signed-off-by: Henner Zeller <h.zeller@acm.org>

* Fix NPE for undefined label, fix #2788

* An interval ought to be a value

Interval was a pointer to 2 Ints
it ought to be just 2 Ints, which is smaller and more semantically correct,
with no need for a cache.

However, this technically breaks metadata and AnyObject conformance but people shouldn't be relying on those for an Interval.

* [C++] Remove more dynamic_cast usage

* [C++] Introduce version macros

* add license prefix

* Prep 4.10 (#3599)

* Tweak doc

* Swift was referring to hardcoded version

* Start version update script.

* add files to update

* clean up setup

* clean up setup

* clean up setup

* don't need file

* don't need file

* Fixes #3600.  add instructions and associated code necessary to build the xpath lexers.

* clean up version nums

* php8

* php8

* php8

* php8

* php8

* php8

* php8

* php8

* tweak doc

* ok, i give up. php won't  bump up too v8

* tweak doc

* version number bumped to 4.10 in runtime.

* Change the doc for releasing and update to use latest ST 4.3.2

* fix dart version to 4.10.0

* cmd files Cannot use export bash command.

* try fixing php ci again

* working on deploy

Signed-off-by: Terence Parr <parrt@antlr.org>

* php8 always install.

* set js to 4.10.0 not 4.10

* turn off apt update for php circleci

* try w/o cimg/php

* try setting branch

* ok i give up

* tweak

* update docs for release.

* php8 circleci

* use 3.5.3 antlr

* use 3.5.3-SNAPSHOT antlr

* use full 3.5.3 antlr

* [Swift] reduce Optionals in APIs (#3621)

* ParserRuleContext.children

see comment in removeLastChild

* TokenStream.getText

* Parser._parseListeners

this might require changes to the code templates?

* ATN {various}

* make computeReachSet return empty, not nil

* overrides refine optionality

* BufferedTokenStream getHiddenTokensTo{Left, Right} return empty not nil

* Update Swift.stg

* avoid breakage by adding overload of `getText` in extension

* tweak to kick off build

Signed-off-by: Terence Parr <parrt@antlr.org>

* try     parallelism: 4 circleci

* Revert "[Swift] reduce Optionals in APIs (#3621)"

This reverts commit b5ccba0.

* tweaks to doc

* Improve the deploy  script and tweak the released doc.

* use 4.10 not Snapshot for scripts

Co-authored-by: Ivan Kochurkin <kvanttt@gmail.com>
Co-authored-by: Alexandr <60813335+Alex-Andrv@users.noreply.github.com>
Co-authored-by: 100mango <100mango@users.noreply.github.com>
Co-authored-by: Biswapriyo Nath <nathbappai@gmail.com>
Co-authored-by: Benjamin Spiegel <bspiegel11@gmail.com>
Co-authored-by: Justin King <jcking@google.com>
Co-authored-by: Eric Vergnaud <eric.vergnaud@wanadoo.fr>
Co-authored-by: Harry Chan <harry.chan@codersatlas.com>
Co-authored-by: Ken Domino <kenneth.domino@domemtech.com>
Co-authored-by: chenquan <chenquan.dev@gmail.com>
Co-authored-by: Marcos Passos <marcospassos@users.noreply.github.com>
Co-authored-by: Henner Zeller <h.zeller@acm.org>
Co-authored-by: Dante Broggi <34220985+Dante-Broggi@users.noreply.github.com>
Co-authored-by: chris-miner <94078897+chris-miner@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Have ANTLR4 prevent conflict with user rule names by behind-the-scenes renaming of its own variables
4 participants