Performance tips #1540

ashishnegi · 2016-12-21T09:15:06Z

We are planning to replace our hand written parser with ANTLR4 in production for parsing our graph database language spec. ANTLR grammar looks elegant and precise.
Our language spec is a variant of GraphQL. The user queries that we will have to parse look like :

// List of directors with whom Tom Hanks has worked
{
  debug(_xid_: m.0bxtg) {
    type.object.name.en
    film.actor.film {
      film.performance.film {
        film.film.directed_by {
          type.object.name.en
        }
      }
    }
  }
}

We started benchmarking from the simplest subset grammar.
Benchmarks :

dgraph/antlr4go/graphqlpm$ gotb -test.run=XXX -benchtime=2s -v
BenchmarkQuery/q1-4                 5000       1221967 ns/op
BenchmarkQuery/q2-4                 2000       1710850 ns/op
BenchmarkQuery/q3-4                 3000       1230518 ns/op
BenchmarkQuery/q4-4                 3000       1568564 ns/op
BenchmarkQuery/q5-4                 2000       1803392 ns/op
PASS
ok      github.com/dgraph-io/dgraph/antlr4go/graphqlpm    22.179s

We expected these numbers to be under 0.05 ms. They are currently around 1.5 ms.

Here are comparisons with handwritten parser and antlr golang parser over practical queries :
Benchmarks :

query$ gotb -test.run=XXX -benchtime=2s -v
BenchmarkQueryParse/spielberg:handwitten:-4               100000         46918 ns/op
BenchmarkQueryParse/spielberg:antlr:-4                      2000       1741364 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4                100000         25982 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4                       2000       1654579 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4             30000         73053 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4                    1000       3385005 ns/op
PASS
ok      github.com/dgraph-io/dgraph/query    21.922s

Antlr4 is around 40x slower.

I have also tried SLL parsing. It also did not helped.

Q i did not got any SLL parsing failure over multiple inputs. In what kind of query and grammar can SLL fail and we have to move to LL ? I am asking this to confirm if i am doing something wrong.

Can i get some performance tips ?
Also, How can we benchmark lexer and parser separately ?

The text was updated successfully, but these errors were encountered:

KvanTTT · 2016-12-21T10:05:59Z

GitHub is not a place for performance and other questions.

Can i get some performance tips ?

Try to reduce the number of rules. For example, you can use token 'query' directly in operationDefinition rule instead of additional operationType:

operationType
   : 'query'
   ;

But I'am not sure that you'll get the significant speedup with such small optimizations. At first glance this grammar looks good.

Also, How can we benchmark lexer and parser separately ?

Within Java you can use the following code:

String code = readFile(args[0]);
ANTLRInputStream codeStream = new ANTLRInputStream(code);
SeparatedLexer lexer = new SeparatedLexer(codeStream);

// Start lexer benchmark
List<? extends Token> tokens = lexer.getAllTokens();
// End lexer benchmark

ListTokenSource tokensSource = new ListTokenSource(tokens);
CommonTokenStream tokensStream = new CommonTokenStream(tokensSource);
SeparatedParser parser = new SeparatedParser(tokensStream);

// Start parser benchmark
ParserRuleContext ast = parser.rule1();
// End parser benchmark

String stringTree = ast.toStringTree(parser);
System.out.print("Tree " + stringTree);

ashishnegi · 2016-12-21T10:22:14Z

Apologies for asking performance problem. I was not aware of it.
Is antlr google group only place to ask these kind of questions ?

And thanks for giving example for lexer and parser separate benchmarking.

ashishnegi · 2016-12-21T10:51:57Z

with only lexing :

func runAntlrParser(q string, b *testing.B) {
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        input := antlr.NewInputStream(q)
        lexer := parser.NewGraphQLPMLexer(input)
        // for only lexer benchmark
        t := lexer.NextToken()
        for t.GetTokenType() != antlr.TokenEOF {
            t = lexer.NextToken()
        }
    }
}

i found that lexing is taking most of the time :

benchmark with only lexer for antlr :

/query$ gotb -test.run=XXX -v  -benchtime=5s
BenchmarkQueryParse/spielberg:handwitten:-4         	  200000	     45724 ns/op
BenchmarkQueryParse/spielberg:antlr:-4              	    5000	   1468218 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4          	  500000	     28649 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4               	    5000	   1538988 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4      	  100000	     80210 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4            	    5000	   3029668 ns/op
PASS
ok  	github.com/dgraph-io/dgraph/query	63.546s

with both lexer and parser :

ashishnegi@ashish:~/work/golang/src/github.com/dgraph-io/dgraph/query$ gotb -test.run=XXX -v  -benchtime=5s
BenchmarkQueryParse/spielberg:handwitten:-4         	  300000	     47772 ns/op
BenchmarkQueryParse/spielberg:antlr:-4              	    3000	   1868297 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4          	  500000	     27980 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4               	    5000	   1616518 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4      	  100000	     74961 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4            	    2000	   3312977 ns/op
PASS
ok  	github.com/dgraph-io/dgraph/query	58.056s

I think that this is well known issue that lexing takes most of the time.
Where can i read more about bringing down the lexing time ?

millergarym · 2016-12-21T12:17:23Z

@KvanTTT I personally think that it is totally appropriate to asking this type of performance issue question here. This does not look the type of question that could be answered by reading a faq. Ashish has reproducible code sitting in a public repository.
Performance is a bug like any other.

ghost · 2016-12-21T13:16:39Z

@KvanTTT I agree with @millergarym 100% here. Not only are performance issues bugs, Google Groups is a useless piece of crap in which it's impossible to not lose formatting of the information @ashishnegi so meticulously put into tables.

So either accept GitHub issues like these, or use something that's not utterly broken as a forum (Discourse works)

sharwell · 2016-12-22T07:38:20Z

@ashishnegi Can you try the following to rule out some more things?

Add an explicit EOF to the end of your entry rule (you should always have this):
```
document
   : definition EOF
   ;
```
Add the following rule to the very end of your grammar to make sure your lexer handles all input successfully:
```
ErrorChar
   : .
   ;
```

ashishnegi · 2016-12-22T11:03:54Z

@sharwell thanks for looking into this.
Here are the benchmarks after applying the changes.

lexing only benchmarks for antlr4 golang target
;; after adding EOF and ErrorCode
/query$ gotb -test.run=XXX -v  -benchtime=5s
BenchmarkQueryParse/spielberg:handwitten:-4         	  200000	     55461 ns/op
BenchmarkQueryParse/spielberg:antlr:-4              	    3000	   1887333 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4          	  300000	     32664 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4               	    5000	   2058863 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4      	  100000	     74588 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4            	    3000	   3573973 ns/op
PASS
ok  	github.com/dgraph-io/dgraph/query	57.434s

Here is the result of running grun GraphQLPM document -trace -diagnostics -tokens -SLL -tree on java target if that can help.

ashishnegi · 2016-12-22T12:14:24Z

I "benchmarked" one query nestedquery (which is taking most of the time) with Cpp target and the numbers are different :

Only lexing on nestedquery query for 1 million iterations => average time is ~42 microseconds.
Same query in golang target takes on average ~3.3 milliseconds for lexing.

/graphqlpmcpp$ time ./graphqlexe 

real	0m42.380s
user	0m42.368s
sys	0m0.000s

and with lexing and parsing : ~107 microseconds on average over 1 million iterations.

/graphqlpmcpp$ time ./graphqlexe 

real	1m46.194s
user	1m46.172s
sys	0m0.004s

If someone can proof read the cpp benchmark, it would be a double check.
~~Is it possible that golang target does not have caching for lexers and parsers support yet ?~~
Actually it has caching.

parrt · 2016-12-22T17:25:46Z

It could be as simple as a hash function for ATNConfigSet that is not optimal for our usage. @sharwell had to put in a murmur hash to get java speed.

sharwell · 2016-12-23T12:12:19Z

@parrt That would explain a case of slow behavior on the first pass. If the lexer remains slow on the exact same input after the first pass, it suggests that edges are getting dropped from the DFA somehow. Of course this always happens in cases where a semantic predicate is crossed, but there are no semantic predicates in the example grammar.

parrt · 2016-12-23T16:50:32Z

Hmm...yeah, I'd have to compare the go/java runtime code.

millergarym · 2017-03-10T01:32:22Z

@sharwell (cc: @parrt, @pboyer ),
Sam, I really need to pay more attention to your comments. Your comment above only made sense to me this morning. That was after spending more time than I should of improving the performance by 50% with murmur hash and then hitting a wall.

Before going into what I found, here is the result.
old = current anltr4 Go runtime (non murmur hash)
new = minor change to generated lexer (still no murmur)

benchmark            old ns/op     new ns/op     delta
BenchmarkLexer-4     99298         3188          -96.79%

benchmark            old allocs     new allocs     delta
BenchmarkLexer-4     457            22             -95.19%

benchmark            old bytes     new bytes     delta
BenchmarkLexer-4     16132         1200          -92.56%

The Go runtime is naive port of the Java runtime. It is a good start, but in many places it leaves a bit to be desired.

The hashing code was an example of this. The port basically used String() string (equivalent to String toString()) and then hashed the string. The result was a large number of memory allocs.

The "real" issue in the lexer is the that static fields in the Java are ported to non-static fields in the Go and initialization is done per new lexer. This is somewhat understandable as Go doesn't have an explicit static keyword. To achieve static semantics in Go the fields needs to be at the 'package level' and initialized in a func init() { .. } (aka in Java a static initializer).

This will be a minor change in the template. I'll tidy up my murmur hash code, make this change and hopefully turn it into a PR next week.

It would be nice to look for all similar issues in the template and runtime. I'll probably open a new issues for this, consolidating this and #1705.

Cheers
Gary

pboyer · 2017-03-10T01:45:56Z

Uh, wow!

…

On Thu, Mar 9, 2017, 8:32 PM Gary Miller ***@***.***> wrote: @sharwell <https://github.com/sharwell> (cc: @parrt <https://github.com/parrt>, @pboyer <https://github.com/pboyer> ), Sam, I really need to pay more attention to your comments. Your comment above only made sense to me this morning. That was after spending more time than I should of improving the performance by 50% with murmur hash and then hitting a wall. Before going into what I found, here is the result. old = current anltr4 Go runtime (non murmur hash) new = minor change to generated lexer (still no murmur) benchmark old ns/op new ns/op delta BenchmarkLexer-4 99298 3188 -96.79% benchmark old allocs new allocs delta BenchmarkLexer-4 457 22 -95.19% benchmark old bytes new bytes delta BenchmarkLexer-4 16132 1200 -92.56% The Go runtime is naive port of the Java runtime. It is a good start, but in many places it leaves a bit to be desired. The hashing code was an example of this. The port basically used String() string (equivalent to String toString()) and then hashed the string. The result was a large number of memory allocs. The "real" issue in the lexer is the that static fields in the Java are ported to non-static fields in the Go and initialization is done per new lexer. This is somewhat understandable as Go doesn't have an explicit static keyword. To achieve static semantics in Go the fields needs to be at the 'package level' and initialized in a func init() { .. } (aka in Java a static initializer). This will be a minor change in the template. I'll tidy up my murmur hash code, make this change and hopefully turn it into a PR next week. It would be nice to look for all similar issues in the template and runtime. I'll probably open a new issues for this, consolidating this and #1705 <#1705>. Cheers Gary — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1540 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA37eYnOHuxqcIT9BEx_sFyqNfDzdEuQks5rkKgugaJpZM4LSugk> .

pboyer · 2017-03-10T22:09:07Z

@millergarym Do you have this in a branch? How can I help? I'll have some free time this weekend and it would be good to take a look.

millergarym · 2017-03-11T11:37:38Z

@pboyer Just saw your message. I don't have time right now so actually haven't test this. It should work as it is the changes I made to the generated code.
https://github.com/wxio/antlr4/tree/static2packagevar

parrt · 2017-03-14T17:54:03Z

@ashishnegi can you try latest HEAD? Just incorporated Go speed improvements.

ashishnegi · 2017-04-06T07:57:37Z

@parrt I benchmarked again on master of antlr4 and with antlr4-4.7.1-*.jar.

➜  query git:(8ccf5cb) ✗ got -v -bench=QueryParse -run=XXXX    
BenchmarkQueryParse/spielberg:handwitten:-4         	   50000	     51960 ns/op
BenchmarkQueryParse/spielberg:antlr:-4              	   10000	    236041 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4          	   50000	     32912 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4               	   10000	    148723 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4      	   20000	     86670 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4            	    5000	    406756 ns/op
PASS
ok  	github.com/dgraph-io/dgraph/query	13.449s

➜  query git:(8ccf5cb) ✗ which antlr4
antlr4: aliased to java -Xmx500M -cp "/usr/local/lib/antlr4-4.7.1-SNAPSHOT-complete.jar:$CLASSPATH" org.antlr.v4.Tool

Numbers are now within range of 5 times. This is definitely a very good improvement. 👍

Do i need to try any other branch ?

wdscxsj · 2017-04-06T08:35:32Z

It seems that the Go runtime is still 1~2x slower than Java.

I did some test with the C grammar and a macro-expanded source file from the Lua project (~200KB). Both the Go version and the Java version just walk the C file.

Noticeably the Go version spent over 40% of time on memory management. This is probably related to the binary-trees benchmark: the Go runtime eagerly scans all objects and pointers to free some garbage, but barely achieves anything in this case. Turning off GC is no good, either. Maybe some kind of memory pool? [1, 2]

Among the 10 most expensive functions, 7 are related to GC:

(pprof) top10
4260ms of 6680ms total (63.77%)
Dropped 134 nodes (cum <= 33.40ms)
Showing top 10 nodes out of 137 (cum >= 2790ms)
      flat  flat%   sum%        cum   cum%
    1300ms 19.46% 19.46%     2360ms 35.33%  runtime.scanobject
     960ms 14.37% 33.83%      980ms 14.67%  github.com/antlr/antlr4/runtime/Go/antlr.(*BaseSingletonPredictionContext).hash
     580ms  8.68% 42.51%      580ms  8.68%  runtime.heapBitsForObject
     500ms  7.49% 50.00%      500ms  7.49%  runtime.greyobject
     260ms  3.89% 53.89%      410ms  6.14%  runtime.mallocgc
     150ms  2.25% 56.14%     1560ms 23.35%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).closureWork
     150ms  2.25% 58.38%      150ms  2.25%  runtime.casgstatus
     130ms  1.95% 60.33%      380ms  5.69%  runtime.mapassign
     120ms  1.80% 62.13%      120ms  1.80%  runtime.heapBitsSetType
     110ms  1.65% 63.77%     2790ms 41.77%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).execATN

In the cumulative list, GC work accounts for over 40% of execution time:

(pprof) top5 -cum
0.13s of 6.68s total ( 1.95%)
Dropped 134 nodes (cum <= 0.03s)
Showing top 5 nodes out of 137 (cum >= 2.69s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%      2.81s 42.07%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).AdaptivePredict
     0.11s  1.65%  1.65%      2.79s 41.77%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).execATN
     0.02s   0.3%  1.95%      2.79s 41.77%  runtime.systemstack
         0     0%  1.95%      2.76s 41.32%  runtime.startTheWorldWithSema
         0     0%  1.95%      2.69s 40.27%  _/D_/Lab/ANTLR/perftest/go.(*CParser).ExternalDeclaration

millergarym · 2017-04-06T12:46:26Z

@wdscxsj can you please provide some more details.

What version of Antlr?
How did you generate the profiles (eg. benchmark, one off run, etc.)?
Is you testing setup in a public repo so it can be repeated / looking into?

If possible can you please include profile listing for the hot spots.
eg

list AdaptivePredict
list execATN

I'm not sure if list BaseSingletonPredictionContext.*hash will work. Else list hash and just copy the BaseSingletonPredictionContext method.

wdscxsj · 2017-04-06T13:46:23Z

@millergarym Thanks for your reply! Please see this repo: https://github.com/wdscxsj/antlrperfcomp.

millergarym · 2017-04-06T14:09:04Z

@wdscxsj a couple of things to note.

Since the parse is only run once, the assumption is the use-case is some sort of code analysis and not a long running job.
1-2x? 1x would mean Java time = Go time?
Only running the test once (ie not in a benchmark) is of little use or no use for long running processes. No JIT in the case of Java, and not caching of the DFA in both cases.

What use-case are you tested for?

wdscxsj · 2017-04-06T14:35:03Z

@millergarym Thanks again for the quick response. A little background may be helpful.

I'm using ANTLR in a private project that does some on-and-off code analysis and conversion. The original Java implementation is to be ported to Go.

We (I and my colleagues) found that the Go version was consistently slower than the original Java one. For example, for a middle-sized input, the lexing + parsing time (but dominantly lexing, as observed by ashishnegi) was 10.0s in Java and 27.9s in Go (or 125.2s for Go with ANTLR 4.6). Profiling indicates that the Go version spends too much time managing the memory. The memory consumption is significantly lower, but we just want it to run as fast.

And by "1~2x slower", I meant "2 to 3 times as as slow" as Java.

Have you tried to compare both runtimes' performance? Don't you think Go should run at roughly the same (if not a bit higher) speed as Java?

millergarym · 2017-04-06T23:15:09Z

@wdscxsj if Go is 2-3x slower on a single run then my guess would be that is would be even slower on a benchmark because of the JIT. I agree, in general if Go is slower than Java it is likely an implementation issue. (see http://benchmarksgame.alioth.debian.org/u64q/go.html)

Can you please add the anltr jars (complete and source) you used to the repo. When, in the future I do another round of performance work this will be a valuable baseline.

Cheers
Gary

wdscxsj · 2017-04-07T01:31:07Z

@millergarym The repo has been updated. Thanks.

wdscxsj · 2019-04-12T08:29:49Z

@millergarym I suppose this issue is (at least partially) caused by Go's GC design: golang/go#23044. There is discussion about adding a GOGC_MIN or debug.SetMaxHeap().

By the way, @sharwell's optimized Java version is lightning fast. Thank you very much!

* Fixes #1154 * Generally helps catch trailing syntax errors * Performance-neutral relative to previous grammar * Recommended by antlr4 devs, can help performance in some cases * See antlr/antlr4#1540

* Fixes #1154 * Tests that #1154 example produces SyntaxErrorsException * Generally helps catch trailing syntax errors * Performance-neutral relative to previous grammar * Recommended by antlr4 devs, can help performance in some cases * See antlr/antlr4#1540

* Fixes #1154 * Tests that #1154 example produces SyntaxErrorsException * Generally helps catch trailing syntax errors * Performance-neutral relative to previous grammar * Recommended by antlr4 devs, can help performance in some cases * See antlr/antlr4#1540 (cherry picked from commit 8f108c1)

aisbergg · 2022-04-25T09:26:28Z

I ran the original benchmark against the current version of Antlr just to see how far it has improved since then:

BenchmarkQueryParse/spielberg:handwitten:-12         	  180318	      6493 ns/op	    4608 B/op	      74 allocs/op
BenchmarkQueryParse/spielberg:antlr:-12              	   49819	     24021 ns/op	   15768 B/op	     263 allocs/op
BenchmarkQueryParse/tomhanks:handwitten:-12          	  289688	      4097 ns/op	    2608 B/op	      48 allocs/op
BenchmarkQueryParse/tomhanks:antlr:-12               	   81811	     14642 ns/op	   10600 B/op	     186 allocs/op
BenchmarkQueryParse/nestedquery:handwritten:-12      	  112033	     10636 ns/op	    6304 B/op	      98 allocs/op
BenchmarkQueryParse/nestedquery:antlr:-12            	   27948	     42982 ns/op	   28784 B/op	     496 allocs/op

Antlr is now down to being only ~4 times slower than the handwritten parser.

parrt · 2022-04-25T17:58:45Z

Cool. @jcking is working on Go target as we speak. He just squeezed a big improvement out of C++ target for 4.10.1.

jimidle · 2022-08-23T03:28:13Z

@ashishnegi Please recheck your performance using the @dev branch of the go runtime - you should see big improvements. I just submitted PRs that correct most of the performance issues, and I will be doing more to improve it over time.

jimidle · 2022-08-23T03:29:31Z

@millergarym I suppose this issue is (at least partially) caused by Go's GC design: golang/go#23044. There is discussion about adding a GOGC_MIN or debug.SetMaxHeap().

By the way, @sharwell's optimized Java version is lightning fast. Thank you very much!

The problem was a number of bugs in the runtime, which are now fixed. The runtime no longer generates millions of allocations for a start :)

parrt · 2022-08-23T04:33:40Z

If you folks can confirm, I'll close :)

wdscxsj · 2022-08-23T05:54:57Z

Confirmed. With https://github.com/wdscxsj/antlrperfcomp, the Go dev runtime now runs faster than the Java runtime. On an old laptop the time cost is 1.3s against 2.1s. Need to try some other data later, but a huge performance boost indeed! Hats off to @jimidle for the wonderful work.

Update: It seems the Java runtime may scale better on a multicore system. The previous result was from a '16 ultrabook running Window 10. On a 32-core Linux server, the Go version (with profiling turned off) is still slower than the Java version, and GOMAXPROCS=1 performs better than the default setting.

jimidle · 2022-08-23T05:58:35Z

Confirmed. With https://github.com/wdscxsj/antlrperfcomp, the Go dev runtime now runs faster than the Java runtime. On an old laptop the time cost is 1.3s against 2.1s. Need to try some other data later, but a huge performance boost indeed! Hats off to @jimidle for the wonderful work.

That's good to hear - we are getting somewhere with the go runtime :). I will look at your example grammar in case there is any low hanging fruit that I can find.

jimidle · 2022-08-23T07:11:10Z

Confirmed. With https://github.com/wdscxsj/antlrperfcomp, the Go dev runtime now runs faster than the Java runtime. On an old laptop the time cost is 1.3s against 2.1s. Need to try some other data later, but a huge performance boost indeed! Hats off to @jimidle for the wonderful work.

Update: It seems the Java runtime may scale better on a multicore system. The previous result was from a '16 ultrabook running Window 10. On a 32-core Linux server, the Go version (with profiling turned off) is still slower than the Java version, and GOMAXPROCS=1 performs better than the default setting.

I believe that there is problem somewhere that is causing a lot of allocations of SingltonPredictionContext in closureChecking... I need a little time to investigate that. GOMAXPROCS should not affect things per se, but it can indicate other problems in the code of course.

Let's not close this just yet.

jimidle · 2022-08-30T07:39:59Z

Confirmed. With https://github.com/wdscxsj/antlrperfcomp, the Go dev runtime now runs faster than the Java runtime. On an old laptop the time cost is 1.3s against 2.1s. Need to try some other data later, but a huge performance boost indeed! Hats off to @jimidle for the wonderful work.
Update: It seems the Java runtime may scale better on a multicore system. The previous result was from a '16 ultrabook running Window 10. On a 32-core Linux server, the Go version (with profiling turned off) is still slower than the Java version, and GOMAXPROCS=1 performs better than the default setting.

I believe that there is problem somewhere that is causing a lot of allocations of SingletonPredictionContext in closureChecking... I need a little time to investigate that. GOMAXPROCS should not affect things per se, but it can indicate other problems in the code of course.

Let's not close this just yet.

By fixing #2016, I may have found why all the SingletonPredictionContexts are being created, but I will need a little time to work through it and improve it.

parrt · 2022-08-30T15:53:58Z

Should I leave this one open @jimidle or start a new one?

jimidle · 2022-08-30T17:27:22Z

I think leave this because I think I might have found the reason we are not quite there (though it is already a vast improvement of course).

…

On Tue, Aug 30, 2022 at 23:54 Terence Parr ***@***.***> wrote: Should I leave this one open @jimidle <https://github.com/jimidle> or start a new one? — Reply to this email directly, view it on GitHub <#1540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ7TMAAJ7K64MPFY4DUIZ3V3YVCFANCNFSM4C2K5ASA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

iSuslov · 2023-04-03T05:18:59Z

That is one of the interesting threads to follow. Any updates?

jimidle · 2023-04-03T06:18:15Z

Yes. I have solved all the algorithm issues. Well to medium well formed grammars are now fast in go. Poorly constructed grammars such as the MySQL are still slower than Java. But I know why and am working on solving the memory pressure on the GC. I need probably this week to solve that issue. It’s not a difficult solution, but the algorithm wasn’t implemented in a way that allows easy allocation tracking. So I have had to write some analytic tools. With that solved, I can finish some structural reconfigure that will make quite some more impact on well formed grammars. After that, it will be CPU to finish it off. Then finally, documentation and any remains bugs. I’m leaving this open until I can review all the different parts of this thread and make sure all is good . The dev branch is very much better than before but I intend that the go runtime will be the most performant runtime.

…

On Mon, Apr 3, 2023 at 13:19 Ivan Suslov ***@***.***> wrote: That is one of the interesting threads to follow. Any updates? — Reply to this email directly, view it on GitHub <#1540 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ7TMBR24ICVWYO6SDA3CDW7JMU7ANCNFSM4C2K5ASA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jimidle · 2023-05-04T06:55:08Z

@wdscxsj I have come back to this to test the idea that the go runtime is somehow not able to scale when there are multiple cores. This seems to be a red herring and basically is just a function of context switching etc. If I lock your driver to a single thread and then run using hyperfine, we can see that the user and system time are affected by the context switching. So I am going to ask @parrt to close this now. However, there will still be performance improvements related to interface use and memory use (especially escapes to heap I hope) down the line. The allocation of ATNConfigs and the CPU in adaptive predict and executeATN takes all the time, and the GC has to track that. However the GC runs on separate threads anyway, so it might take CPU time from the system and will cause some stopping to do the collection, but if execution times are small then it doesn't really matter. Also, the benchmark here does not measure the difference between pre cache warmup and post cache warmup.

So, let's close this as I don't think there are now any performance issues in this thread that I have not fixed.

For people's interest, here are the results using hyperfine, which is much more sophisticated than say time, with various settings:

Command	Mean [s]	Min [s]	Max [s]	Relative
`./antlrperfcomp test/input.c`	1.165 ± 0.021	1.143	1.236	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`GOMAXPROCS=1 ./antlrperfcomp test/input.c`	1.557 ± 0.038	1.513	1.669	1.00

You can see here that though the wall clock appears to be longer without GOMAXPROCS, in fact leaving go to work it out actually results in better throughput. And if we play with GOMAXPROCS:

Command	Mean [s]	Min [s]	Max [s]	Relative
`GOMAXPROCS=5 ./antlrperfcomp test/input.c`	1.123 ± 0.011	1.109	1.155	1.00

Command	Mean [s]	Min [s]	Max [s]	Relative
`GOMAXPROCS=8 ./antlrperfcomp test/input.c`	1.153 ± 0.016	1.132	1.189	1.00

If we play with garbage collection:

Command	Mean [s]	Min [s]	Max [s]	Relative
`GOGC=off ./antlrperfcomp test/input.c`	1.323 ± 0.011	1.314	1.357	1.00

So, garbage collection is not really a burden even at these fairly high parse times. The C grammar is fairly well written as it is by Sam, but I am sure that there are lots of improvements that could be made. I will try the parse in SLL mode.

So using range on hyperfine, then we get this:

Command	Mean [s]	Min [s]	Max [s]	Relative
`GOMAXPROCS=1 ./antlrperfcomp test/input.c`	1.762 ± 0.395	1.579	3.188	1.57 ± 0.35
`GOMAXPROCS=2 ./antlrperfcomp test/input.c`	1.244 ± 0.019	1.220	1.284	1.11 ± 0.02
`GOMAXPROCS=3 ./antlrperfcomp test/input.c`	1.184 ± 0.031	1.164	1.293	1.05 ± 0.03
`GOMAXPROCS=4 ./antlrperfcomp test/input.c`	1.155 ± 0.041	1.119	1.254	1.03 ± 0.04
`GOMAXPROCS=5 ./antlrperfcomp test/input.c`	1.126 ± 0.011	1.111	1.150	1.00 ± 0.02
`GOMAXPROCS=6 ./antlrperfcomp test/input.c`	1.123 ± 0.019	1.106	1.176	1.00
`GOMAXPROCS=7 ./antlrperfcomp test/input.c`	1.149 ± 0.026	1.129	1.227	1.02 ± 0.03
`GOMAXPROCS=8 ./antlrperfcomp test/input.c`	1.152 ± 0.017	1.132	1.187	1.03 ± 0.02
`GOMAXPROCS=9 ./antlrperfcomp test/input.c`	1.146 ± 0.011	1.137	1.178	1.02 ± 0.02
`GOMAXPROCS=10 ./antlrperfcomp test/input.c`	1.166 ± 0.021	1.143	1.214	1.04 ± 0.03
`GOMAXPROCS=11 ./antlrperfcomp test/input.c`	1.165 ± 0.015	1.148	1.209	1.04 ± 0.02
`GOMAXPROCS=12 ./antlrperfcomp test/input.c`	1.170 ± 0.037	1.149	1.307	1.04 ± 0.04
`GOMAXPROCS=13 ./antlrperfcomp test/input.c`	1.163 ± 0.015	1.145	1.195	1.04 ± 0.02
`GOMAXPROCS=14 ./antlrperfcomp test/input.c`	1.180 ± 0.020	1.158	1.217	1.05 ± 0.03
`GOMAXPROCS=15 ./antlrperfcomp test/input.c`	1.176 ± 0.015	1.160	1.211	1.05 ± 0.02
`GOMAXPROCS=16 ./antlrperfcomp test/input.c`	1.171 ± 0.030	1.146	1.281	1.04 ± 0.03
`GOMAXPROCS=17 ./antlrperfcomp test/input.c`	1.176 ± 0.018	1.149	1.212	1.05 ± 0.02
`GOMAXPROCS=18 ./antlrperfcomp test/input.c`	1.175 ± 0.015	1.156	1.204	1.05 ± 0.02
`GOMAXPROCS=19 ./antlrperfcomp test/input.c`	1.172 ± 0.022	1.153	1.226	1.04 ± 0.03
`GOMAXPROCS=20 ./antlrperfcomp test/input.c`	1.169 ± 0.018	1.152	1.211	1.04 ± 0.02
`GOMAXPROCS=21 ./antlrperfcomp test/input.c`	1.172 ± 0.025	1.149	1.259	1.04 ± 0.03
`GOMAXPROCS=22 ./antlrperfcomp test/input.c`	1.171 ± 0.017	1.157	1.225	1.04 ± 0.02
`GOMAXPROCS=23 ./antlrperfcomp test/input.c`	1.171 ± 0.025	1.151	1.270	1.04 ± 0.03
`GOMAXPROCS=24 ./antlrperfcomp test/input.c`	1.175 ± 0.021	1.150	1.224	1.05 ± 0.03
`GOMAXPROCS=25 ./antlrperfcomp test/input.c`	1.181 ± 0.034	1.147	1.271	1.05 ± 0.04

Which tells us:

Summary
  'GOMAXPROCS=6 ./antlrperfcomp test/input.c' ran
    1.00 ± 0.02 times faster than 'GOMAXPROCS=5 ./antlrperfcomp test/input.c'
    1.02 ± 0.02 times faster than 'GOMAXPROCS=9 ./antlrperfcomp test/input.c'
    1.02 ± 0.03 times faster than 'GOMAXPROCS=7 ./antlrperfcomp test/input.c'
    1.03 ± 0.02 times faster than 'GOMAXPROCS=8 ./antlrperfcomp test/input.c'
    1.03 ± 0.04 times faster than 'GOMAXPROCS=4 ./antlrperfcomp test/input.c'
    1.04 ± 0.02 times faster than 'GOMAXPROCS=13 ./antlrperfcomp test/input.c'
    1.04 ± 0.02 times faster than 'GOMAXPROCS=11 ./antlrperfcomp test/input.c'
    1.04 ± 0.03 times faster than 'GOMAXPROCS=10 ./antlrperfcomp test/input.c'
    1.04 ± 0.02 times faster than 'GOMAXPROCS=20 ./antlrperfcomp test/input.c'
    1.04 ± 0.04 times faster than 'GOMAXPROCS=12 ./antlrperfcomp test/input.c'
    1.04 ± 0.02 times faster than 'GOMAXPROCS=22 ./antlrperfcomp test/input.c'
    1.04 ± 0.03 times faster than 'GOMAXPROCS=23 ./antlrperfcomp test/input.c'
    1.04 ± 0.03 times faster than 'GOMAXPROCS=16 ./antlrperfcomp test/input.c'
    1.04 ± 0.03 times faster than 'GOMAXPROCS=21 ./antlrperfcomp test/input.c'
    1.04 ± 0.03 times faster than 'GOMAXPROCS=19 ./antlrperfcomp test/input.c'
    1.05 ± 0.03 times faster than 'GOMAXPROCS=24 ./antlrperfcomp test/input.c'
    1.05 ± 0.02 times faster than 'GOMAXPROCS=18 ./antlrperfcomp test/input.c'
    1.05 ± 0.02 times faster than 'GOMAXPROCS=17 ./antlrperfcomp test/input.c'
    1.05 ± 0.02 times faster than 'GOMAXPROCS=15 ./antlrperfcomp test/input.c'
    1.05 ± 0.03 times faster than 'GOMAXPROCS=14 ./antlrperfcomp test/input.c'
    1.05 ± 0.04 times faster than 'GOMAXPROCS=25 ./antlrperfcomp test/input.c'
    1.05 ± 0.03 times faster than 'GOMAXPROCS=3 ./antlrperfcomp test/input.c'
    1.11 ± 0.02 times faster than 'GOMAXPROCS=2 ./antlrperfcomp test/input.c'
    1.57 ± 0.35 times faster than 'GOMAXPROCS=1 ./antlrperfcomp test/input.c

My system would use GOMAXPROCS=12 by default I think.

So, please close this issue @parrt.

jimidle · 2023-05-04T07:11:20Z

Just as an added bonus, if you switch this driver in to SLL mode, then it is quite a bit quicker:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./antlrperfcomp test/input.c`	688.1 ± 11.3	680.1	727.2	1.00

	source, err := antlr.NewFileStream("test/input.c")
	if err != nil {
		fmt.Println("Could not open file", err)
		return
	}
	tokens := antlr.NewCommonTokenStream(parsing.NewCLexer(source), 0)
	p := parsing.NewCParser(tokens)
	p.GetInterpreter().SetPredictionMode(antlr.PredictionModeSLL)
	tree := p.CompilationUnit()
	antlr.NewParseTreeWalker().Walk(new(parsing.BaseCListener), tree)

ashishnegi mentioned this issue Dec 22, 2016

Performance tips pboyer/antlr4#92

Open

parrt added target:go type:question labels Dec 22, 2016

gmwxio mentioned this issue Feb 23, 2017

Go micro performance. lexer atn config using murmur hash 32. #1689

Closed

This was referenced Mar 10, 2017

Go runtime - Performance - static fields in Java currently implemented as struct fields - should be package vars #1743

Closed

Grammar that parses in milliseconds in Java runtime, takes many seconds in python #1219

Open

claytondaley mentioned this issue Mar 20, 2018

Python2 Memory Usage #2258

Closed

nishtahir mentioned this issue Apr 20, 2018

Performance willowtreeapps/wist#71

Closed

NodixBlockchain mentioned this issue May 28, 2018

Compiler Implementation keean/zenscript#18

Open

albert-magyar mentioned this issue Oct 29, 2019

Add explicit EOF to top-level parser rule chipsalliance/firrtl#1217

Merged

parrt added comp:performance and removed type:question labels Aug 23, 2022

parrt added this to the 4.10 milestone Aug 23, 2022

jasmith-hs mentioned this issue Oct 25, 2022

Why not just use antlr to impl the interpretor? HubSpot/jinjava#935

Closed

Performance tips #1540

Performance tips #1540

Comments

ashishnegi commented Dec 21, 2016

KvanTTT commented Dec 21, 2016

ashishnegi commented Dec 21, 2016

ashishnegi commented Dec 21, 2016 • edited Loading

millergarym commented Dec 21, 2016

ghost commented Dec 21, 2016 • edited by ghost Loading

sharwell commented Dec 22, 2016

ashishnegi commented Dec 22, 2016

ashishnegi commented Dec 22, 2016 • edited Loading

parrt commented Dec 22, 2016

sharwell commented Dec 23, 2016 • edited Loading

parrt commented Dec 23, 2016

millergarym commented Mar 10, 2017

pboyer commented Mar 10, 2017 via email

pboyer commented Mar 10, 2017

millergarym commented Mar 11, 2017

parrt commented Mar 14, 2017

ashishnegi commented Apr 6, 2017

wdscxsj commented Apr 6, 2017 • edited Loading

millergarym commented Apr 6, 2017

wdscxsj commented Apr 6, 2017

millergarym commented Apr 6, 2017

wdscxsj commented Apr 6, 2017 • edited Loading

millergarym commented Apr 6, 2017

wdscxsj commented Apr 7, 2017

wdscxsj commented Apr 12, 2019 • edited Loading

aisbergg commented Apr 25, 2022

parrt commented Apr 25, 2022

jimidle commented Aug 23, 2022

jimidle commented Aug 23, 2022

parrt commented Aug 23, 2022

wdscxsj commented Aug 23, 2022 • edited Loading

jimidle commented Aug 23, 2022

jimidle commented Aug 23, 2022

jimidle commented Aug 30, 2022

parrt commented Aug 30, 2022

jimidle commented Aug 30, 2022 via email

iSuslov commented Apr 3, 2023

jimidle commented Apr 3, 2023 via email

jimidle commented May 4, 2023

jimidle commented May 4, 2023

ashishnegi commented Dec 21, 2016 •

edited

Loading

ghost commented Dec 21, 2016 •

edited by ghost

Loading

ashishnegi commented Dec 22, 2016 •

edited

Loading

sharwell commented Dec 23, 2016 •

edited

Loading

wdscxsj commented Apr 6, 2017 •

edited

Loading

wdscxsj commented Apr 6, 2017 •

edited

Loading

wdscxsj commented Apr 12, 2019 •

edited

Loading

wdscxsj commented Aug 23, 2022 •

edited

Loading