Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance tips #1540

Open
ashishnegi opened this issue Dec 21, 2016 · 40 comments
Open

Performance tips #1540

ashishnegi opened this issue Dec 21, 2016 · 40 comments

Comments

@ashishnegi
Copy link

We are planning to replace our hand written parser with ANTLR4 in production for parsing our graph database language spec. ANTLR grammar looks elegant and precise.
Our language spec is a variant of GraphQL. The user queries that we will have to parse look like :

// List of directors with whom Tom Hanks has worked
{
  debug(_xid_: m.0bxtg) {
    type.object.name.en
    film.actor.film {
      film.performance.film {
        film.film.directed_by {
          type.object.name.en
        }
      }
    }
  }
}

We started benchmarking from the simplest subset grammar.
Benchmarks :

dgraph/antlr4go/graphqlpm$ gotb -test.run=XXX -benchtime=2s -v
BenchmarkQuery/q1-4                 5000       1221967 ns/op
BenchmarkQuery/q2-4                 2000       1710850 ns/op
BenchmarkQuery/q3-4                 3000       1230518 ns/op
BenchmarkQuery/q4-4                 3000       1568564 ns/op
BenchmarkQuery/q5-4                 2000       1803392 ns/op
PASS
ok      github.com/dgraph-io/dgraph/antlr4go/graphqlpm    22.179s

We expected these numbers to be under 0.05 ms. They are currently around 1.5 ms.

Here are comparisons with handwritten parser and antlr golang parser over practical queries :
Benchmarks :

query$ gotb -test.run=XXX -benchtime=2s -v
BenchmarkQueryParse/spielberg:handwitten:-4               100000         46918 ns/op
BenchmarkQueryParse/spielberg:antlr:-4                      2000       1741364 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4                100000         25982 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4                       2000       1654579 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4             30000         73053 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4                    1000       3385005 ns/op
PASS
ok      github.com/dgraph-io/dgraph/query    21.922s

Antlr4 is around 40x slower.

I have also tried SLL parsing. It also did not helped.

Q i did not got any SLL parsing failure over multiple inputs. In what kind of query and grammar can SLL fail and we have to move to LL ? I am asking this to confirm if i am doing something wrong.

Can i get some performance tips ?
Also, How can we benchmark lexer and parser separately ?

@KvanTTT
Copy link
Member

KvanTTT commented Dec 21, 2016

GitHub is not a place for performance and other questions.

Can i get some performance tips ?

Try to reduce the number of rules. For example, you can use token 'query' directly in operationDefinition rule instead of additional operationType:

operationType
   : 'query'
   ;

But I'am not sure that you'll get the significant speedup with such small optimizations. At first glance this grammar looks good.

Also, How can we benchmark lexer and parser separately ?

Within Java you can use the following code:

String code = readFile(args[0]);
ANTLRInputStream codeStream = new ANTLRInputStream(code);
SeparatedLexer lexer = new SeparatedLexer(codeStream);

// Start lexer benchmark
List<? extends Token> tokens = lexer.getAllTokens();
// End lexer benchmark

ListTokenSource tokensSource = new ListTokenSource(tokens);
CommonTokenStream tokensStream = new CommonTokenStream(tokensSource);
SeparatedParser parser = new SeparatedParser(tokensStream);

// Start parser benchmark
ParserRuleContext ast = parser.rule1();
// End parser benchmark

String stringTree = ast.toStringTree(parser);
System.out.print("Tree " + stringTree);

@ashishnegi
Copy link
Author

Apologies for asking performance problem. I was not aware of it.
Is antlr google group only place to ask these kind of questions ?

And thanks for giving example for lexer and parser separate benchmarking.

@ashishnegi
Copy link
Author

ashishnegi commented Dec 21, 2016

with only lexing :

func runAntlrParser(q string, b *testing.B) {
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        input := antlr.NewInputStream(q)
        lexer := parser.NewGraphQLPMLexer(input)
        // for only lexer benchmark
        t := lexer.NextToken()
        for t.GetTokenType() != antlr.TokenEOF {
            t = lexer.NextToken()
        }
    }
}

i found that lexing is taking most of the time :

benchmark with only lexer for antlr :

/query$ gotb -test.run=XXX -v  -benchtime=5s
BenchmarkQueryParse/spielberg:handwitten:-4         	  200000	     45724 ns/op
BenchmarkQueryParse/spielberg:antlr:-4              	    5000	   1468218 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4          	  500000	     28649 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4               	    5000	   1538988 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4      	  100000	     80210 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4            	    5000	   3029668 ns/op
PASS
ok  	github.com/dgraph-io/dgraph/query	63.546s

with both lexer and parser :

ashishnegi@ashish:~/work/golang/src/github.com/dgraph-io/dgraph/query$ gotb -test.run=XXX -v  -benchtime=5s
BenchmarkQueryParse/spielberg:handwitten:-4         	  300000	     47772 ns/op
BenchmarkQueryParse/spielberg:antlr:-4              	    3000	   1868297 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4          	  500000	     27980 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4               	    5000	   1616518 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4      	  100000	     74961 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4            	    2000	   3312977 ns/op
PASS
ok  	github.com/dgraph-io/dgraph/query	58.056s

I think that this is well known issue that lexing takes most of the time.
Where can i read more about bringing down the lexing time ?

@millergarym
Copy link
Contributor

@KvanTTT I personally think that it is totally appropriate to asking this type of performance issue question here. This does not look the type of question that could be answered by reading a faq. Ashish has reproducible code sitting in a public repository.
Performance is a bug like any other.

@ghost
Copy link

ghost commented Dec 21, 2016

@KvanTTT I agree with @millergarym 100% here. Not only are performance issues bugs, Google Groups is a useless piece of crap in which it's impossible to not lose formatting of the information @ashishnegi so meticulously put into tables.

So either accept GitHub issues like these, or use something that's not utterly broken as a forum (Discourse works)

@sharwell
Copy link
Member

@ashishnegi Can you try the following to rule out some more things?

  1. Add an explicit EOF to the end of your entry rule (you should always have this):

    document
       : definition EOF
       ;
    
  2. Add the following rule to the very end of your grammar to make sure your lexer handles all input successfully:

    ErrorChar
       : .
       ;
    

@ashishnegi
Copy link
Author

@sharwell thanks for looking into this.
Here are the benchmarks after applying the changes.

lexing only benchmarks for antlr4 golang target
;; after adding EOF and ErrorCode
/query$ gotb -test.run=XXX -v  -benchtime=5s
BenchmarkQueryParse/spielberg:handwitten:-4         	  200000	     55461 ns/op
BenchmarkQueryParse/spielberg:antlr:-4              	    3000	   1887333 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4          	  300000	     32664 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4               	    5000	   2058863 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4      	  100000	     74588 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4            	    3000	   3573973 ns/op
PASS
ok  	github.com/dgraph-io/dgraph/query	57.434s

Here is the result of running grun GraphQLPM document -trace -diagnostics -tokens -SLL -tree on java target if that can help.

@ashishnegi
Copy link
Author

ashishnegi commented Dec 22, 2016

I "benchmarked" one query nestedquery (which is taking most of the time) with Cpp target and the numbers are different :

Only lexing on nestedquery query for 1 million iterations => average time is ~42 microseconds.
Same query in golang target takes on average ~3.3 milliseconds for lexing.

/graphqlpmcpp$ time ./graphqlexe 

real	0m42.380s
user	0m42.368s
sys	0m0.000s

and with lexing and parsing : ~107 microseconds on average over 1 million iterations.

/graphqlpmcpp$ time ./graphqlexe 

real	1m46.194s
user	1m46.172s
sys	0m0.004s

If someone can proof read the cpp benchmark, it would be a double check.
Is it possible that golang target does not have caching for lexers and parsers support yet ?
Actually it has caching.

@parrt
Copy link
Member

parrt commented Dec 22, 2016

It could be as simple as a hash function for ATNConfigSet that is not optimal for our usage. @sharwell had to put in a murmur hash to get java speed.

@sharwell
Copy link
Member

sharwell commented Dec 23, 2016

@parrt That would explain a case of slow behavior on the first pass. If the lexer remains slow on the exact same input after the first pass, it suggests that edges are getting dropped from the DFA somehow. Of course this always happens in cases where a semantic predicate is crossed, but there are no semantic predicates in the example grammar.

@parrt
Copy link
Member

parrt commented Dec 23, 2016

Hmm...yeah, I'd have to compare the go/java runtime code.

@millergarym
Copy link
Contributor

@sharwell (cc: @parrt, @pboyer ),
Sam, I really need to pay more attention to your comments. Your comment above only made sense to me this morning. That was after spending more time than I should of improving the performance by 50% with murmur hash and then hitting a wall.

Before going into what I found, here is the result.
old = current anltr4 Go runtime (non murmur hash)
new = minor change to generated lexer (still no murmur)

benchmark            old ns/op     new ns/op     delta
BenchmarkLexer-4     99298         3188          -96.79%

benchmark            old allocs     new allocs     delta
BenchmarkLexer-4     457            22             -95.19%

benchmark            old bytes     new bytes     delta
BenchmarkLexer-4     16132         1200          -92.56%

The Go runtime is naive port of the Java runtime. It is a good start, but in many places it leaves a bit to be desired.

The hashing code was an example of this. The port basically used String() string (equivalent to String toString()) and then hashed the string. The result was a large number of memory allocs.

The "real" issue in the lexer is the that static fields in the Java are ported to non-static fields in the Go and initialization is done per new lexer. This is somewhat understandable as Go doesn't have an explicit static keyword. To achieve static semantics in Go the fields needs to be at the 'package level' and initialized in a func init() { .. } (aka in Java a static initializer).

This will be a minor change in the template. I'll tidy up my murmur hash code, make this change and hopefully turn it into a PR next week.

It would be nice to look for all similar issues in the template and runtime. I'll probably open a new issues for this, consolidating this and #1705.

Cheers
Gary

@pboyer
Copy link
Contributor

pboyer commented Mar 10, 2017 via email

@pboyer
Copy link
Contributor

pboyer commented Mar 10, 2017

@millergarym Do you have this in a branch? How can I help? I'll have some free time this weekend and it would be good to take a look.

@millergarym
Copy link
Contributor

@pboyer Just saw your message. I don't have time right now so actually haven't test this. It should work as it is the changes I made to the generated code.
https://github.com/wxio/antlr4/tree/static2packagevar

@parrt
Copy link
Member

parrt commented Mar 14, 2017

@ashishnegi can you try latest HEAD? Just incorporated Go speed improvements.

@ashishnegi
Copy link
Author

@parrt I benchmarked again on master of antlr4 and with antlr4-4.7.1-*.jar.

➜  query git:(8ccf5cb) ✗ got -v -bench=QueryParse -run=XXXX    
BenchmarkQueryParse/spielberg:handwitten:-4         	   50000	     51960 ns/op
BenchmarkQueryParse/spielberg:antlr:-4              	   10000	    236041 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4          	   50000	     32912 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4               	   10000	    148723 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4      	   20000	     86670 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4            	    5000	    406756 ns/op
PASS
ok  	github.com/dgraph-io/dgraph/query	13.449s

➜  query git:(8ccf5cb) ✗ which antlr4
antlr4: aliased to java -Xmx500M -cp "/usr/local/lib/antlr4-4.7.1-SNAPSHOT-complete.jar:$CLASSPATH" org.antlr.v4.Tool

Numbers are now within range of 5 times. This is definitely a very good improvement. 👍

Do i need to try any other branch ?

@wdscxsj
Copy link

wdscxsj commented Apr 6, 2017

It seems that the Go runtime is still 1~2x slower than Java.

I did some test with the C grammar and a macro-expanded source file from the Lua project (~200KB). Both the Go version and the Java version just walk the C file.

Noticeably the Go version spent over 40% of time on memory management. This is probably related to the binary-trees benchmark: the Go runtime eagerly scans all objects and pointers to free some garbage, but barely achieves anything in this case. Turning off GC is no good, either. Maybe some kind of memory pool? [1, 2]

Among the 10 most expensive functions, 7 are related to GC:

(pprof) top10
4260ms of 6680ms total (63.77%)
Dropped 134 nodes (cum <= 33.40ms)
Showing top 10 nodes out of 137 (cum >= 2790ms)
      flat  flat%   sum%        cum   cum%
    1300ms 19.46% 19.46%     2360ms 35.33%  runtime.scanobject
     960ms 14.37% 33.83%      980ms 14.67%  github.com/antlr/antlr4/runtime/Go/antlr.(*BaseSingletonPredictionContext).hash
     580ms  8.68% 42.51%      580ms  8.68%  runtime.heapBitsForObject
     500ms  7.49% 50.00%      500ms  7.49%  runtime.greyobject
     260ms  3.89% 53.89%      410ms  6.14%  runtime.mallocgc
     150ms  2.25% 56.14%     1560ms 23.35%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).closureWork
     150ms  2.25% 58.38%      150ms  2.25%  runtime.casgstatus
     130ms  1.95% 60.33%      380ms  5.69%  runtime.mapassign
     120ms  1.80% 62.13%      120ms  1.80%  runtime.heapBitsSetType
     110ms  1.65% 63.77%     2790ms 41.77%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).execATN

In the cumulative list, GC work accounts for over 40% of execution time:

(pprof) top5 -cum
0.13s of 6.68s total ( 1.95%)
Dropped 134 nodes (cum <= 0.03s)
Showing top 5 nodes out of 137 (cum >= 2.69s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%      2.81s 42.07%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).AdaptivePredict
     0.11s  1.65%  1.65%      2.79s 41.77%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).execATN
     0.02s   0.3%  1.95%      2.79s 41.77%  runtime.systemstack
         0     0%  1.95%      2.76s 41.32%  runtime.startTheWorldWithSema
         0     0%  1.95%      2.69s 40.27%  _/D_/Lab/ANTLR/perftest/go.(*CParser).ExternalDeclaration

@millergarym
Copy link
Contributor

@wdscxsj can you please provide some more details.

  • What version of Antlr?
  • How did you generate the profiles (eg. benchmark, one off run, etc.)?
  • Is you testing setup in a public repo so it can be repeated / looking into?

If possible can you please include profile listing for the hot spots.
eg

list AdaptivePredict
list execATN

I'm not sure if list BaseSingletonPredictionContext.*hash will work. Else list hash and just copy the BaseSingletonPredictionContext method.

@wdscxsj
Copy link

wdscxsj commented Apr 6, 2017

@millergarym Thanks for your reply! Please see this repo: https://github.com/wdscxsj/antlrperfcomp.

@millergarym
Copy link
Contributor

@wdscxsj a couple of things to note.

  • Since the parse is only run once, the assumption is the use-case is some sort of code analysis and not a long running job.
  • 1-2x? 1x would mean Java time = Go time?
  • Only running the test once (ie not in a benchmark) is of little use or no use for long running processes. No JIT in the case of Java, and not caching of the DFA in both cases.

What use-case are you tested for?

@wdscxsj
Copy link

wdscxsj commented Apr 6, 2017

@millergarym Thanks again for the quick response. A little background may be helpful.

I'm using ANTLR in a private project that does some on-and-off code analysis and conversion. The original Java implementation is to be ported to Go.

We (I and my colleagues) found that the Go version was consistently slower than the original Java one. For example, for a middle-sized input, the lexing + parsing time (but dominantly lexing, as observed by ashishnegi) was 10.0s in Java and 27.9s in Go (or 125.2s for Go with ANTLR 4.6). Profiling indicates that the Go version spends too much time managing the memory. The memory consumption is significantly lower, but we just want it to run as fast.

And by "1~2x slower", I meant "2 to 3 times as as slow" as Java.

Have you tried to compare both runtimes' performance? Don't you think Go should run at roughly the same (if not a bit higher) speed as Java?

@millergarym
Copy link
Contributor

@wdscxsj if Go is 2-3x slower on a single run then my guess would be that is would be even slower on a benchmark because of the JIT. I agree, in general if Go is slower than Java it is likely an implementation issue. (see http://benchmarksgame.alioth.debian.org/u64q/go.html)

Can you please add the anltr jars (complete and source) you used to the repo. When, in the future I do another round of performance work this will be a valuable baseline.

Cheers
Gary

@wdscxsj
Copy link

wdscxsj commented Apr 7, 2017

@millergarym The repo has been updated. Thanks.

@wdscxsj
Copy link

wdscxsj commented Apr 12, 2019

@millergarym I suppose this issue is (at least partially) caused by Go's GC design: golang/go#23044. There is discussion about adding a GOGC_MIN or debug.SetMaxHeap().

By the way, @sharwell's optimized Java version is lightning fast. Thank you very much!

albert-magyar added a commit to chipsalliance/firrtl that referenced this issue Oct 29, 2019
* Fixes #1154
* Generally helps catch trailing syntax errors
* Performance-neutral relative to previous grammar
* Recommended by antlr4 devs, can help performance in some cases
* See antlr/antlr4#1540
albert-magyar added a commit to chipsalliance/firrtl that referenced this issue Oct 29, 2019
* Fixes #1154
* Generally helps catch trailing syntax errors
* Performance-neutral relative to previous grammar
* Recommended by antlr4 devs, can help performance in some cases
* See antlr/antlr4#1540
albert-magyar added a commit to chipsalliance/firrtl that referenced this issue Oct 29, 2019
* Fixes #1154
* Tests that #1154 example produces SyntaxErrorsException
* Generally helps catch trailing syntax errors
* Performance-neutral relative to previous grammar
* Recommended by antlr4 devs, can help performance in some cases
* See antlr/antlr4#1540
mergify bot pushed a commit to chipsalliance/firrtl that referenced this issue Nov 4, 2019
* Fixes #1154
* Tests that #1154 example produces SyntaxErrorsException
* Generally helps catch trailing syntax errors
* Performance-neutral relative to previous grammar
* Recommended by antlr4 devs, can help performance in some cases
* See antlr/antlr4#1540
mergify bot pushed a commit to chipsalliance/firrtl that referenced this issue Nov 22, 2019
* Fixes #1154
* Tests that #1154 example produces SyntaxErrorsException
* Generally helps catch trailing syntax errors
* Performance-neutral relative to previous grammar
* Recommended by antlr4 devs, can help performance in some cases
* See antlr/antlr4#1540

(cherry picked from commit 8f108c1)
mergify bot added a commit to chipsalliance/firrtl that referenced this issue Nov 22, 2019
* Fixes #1154
* Tests that #1154 example produces SyntaxErrorsException
* Generally helps catch trailing syntax errors
* Performance-neutral relative to previous grammar
* Recommended by antlr4 devs, can help performance in some cases
* See antlr/antlr4#1540

(cherry picked from commit 8f108c1)
@aisbergg
Copy link

I ran the original benchmark against the current version of Antlr just to see how far it has improved since then:

BenchmarkQueryParse/spielberg:handwitten:-12         	  180318	      6493 ns/op	    4608 B/op	      74 allocs/op
BenchmarkQueryParse/spielberg:antlr:-12              	   49819	     24021 ns/op	   15768 B/op	     263 allocs/op
BenchmarkQueryParse/tomhanks:handwitten:-12          	  289688	      4097 ns/op	    2608 B/op	      48 allocs/op
BenchmarkQueryParse/tomhanks:antlr:-12               	   81811	     14642 ns/op	   10600 B/op	     186 allocs/op
BenchmarkQueryParse/nestedquery:handwritten:-12      	  112033	     10636 ns/op	    6304 B/op	      98 allocs/op
BenchmarkQueryParse/nestedquery:antlr:-12            	   27948	     42982 ns/op	   28784 B/op	     496 allocs/op

Antlr is now down to being only ~4 times slower than the handwritten parser.

@parrt
Copy link
Member

parrt commented Apr 25, 2022

Cool. @jcking is working on Go target as we speak. He just squeezed a big improvement out of C++ target for 4.10.1.

@jimidle
Copy link
Collaborator

jimidle commented Aug 23, 2022

@ashishnegi Please recheck your performance using the @dev branch of the go runtime - you should see big improvements. I just submitted PRs that correct most of the performance issues, and I will be doing more to improve it over time.

@jimidle
Copy link
Collaborator

jimidle commented Aug 23, 2022

@millergarym I suppose this issue is (at least partially) caused by Go's GC design: golang/go#23044. There is discussion about adding a GOGC_MIN or debug.SetMaxHeap().

By the way, @sharwell's optimized Java version is lightning fast. Thank you very much!

The problem was a number of bugs in the runtime, which are now fixed. The runtime no longer generates millions of allocations for a start :)

@parrt parrt added this to the 4.10 milestone Aug 23, 2022
@parrt
Copy link
Member

parrt commented Aug 23, 2022

If you folks can confirm, I'll close :)

@wdscxsj
Copy link

wdscxsj commented Aug 23, 2022

Confirmed. With https://github.com/wdscxsj/antlrperfcomp, the Go dev runtime now runs faster than the Java runtime. On an old laptop the time cost is 1.3s against 2.1s. Need to try some other data later, but a huge performance boost indeed! Hats off to @jimidle for the wonderful work.

Update: It seems the Java runtime may scale better on a multicore system. The previous result was from a '16 ultrabook running Window 10. On a 32-core Linux server, the Go version (with profiling turned off) is still slower than the Java version, and GOMAXPROCS=1 performs better than the default setting.

@jimidle
Copy link
Collaborator

jimidle commented Aug 23, 2022

Confirmed. With https://github.com/wdscxsj/antlrperfcomp, the Go dev runtime now runs faster than the Java runtime. On an old laptop the time cost is 1.3s against 2.1s. Need to try some other data later, but a huge performance boost indeed! Hats off to @jimidle for the wonderful work.

That's good to hear - we are getting somewhere with the go runtime :). I will look at your example grammar in case there is any low hanging fruit that I can find.

@jimidle
Copy link
Collaborator

jimidle commented Aug 23, 2022

Confirmed. With https://github.com/wdscxsj/antlrperfcomp, the Go dev runtime now runs faster than the Java runtime. On an old laptop the time cost is 1.3s against 2.1s. Need to try some other data later, but a huge performance boost indeed! Hats off to @jimidle for the wonderful work.

Update: It seems the Java runtime may scale better on a multicore system. The previous result was from a '16 ultrabook running Window 10. On a 32-core Linux server, the Go version (with profiling turned off) is still slower than the Java version, and GOMAXPROCS=1 performs better than the default setting.

I believe that there is problem somewhere that is causing a lot of allocations of SingltonPredictionContext in closureChecking... I need a little time to investigate that. GOMAXPROCS should not affect things per se, but it can indicate other problems in the code of course.

Let's not close this just yet.

@jimidle
Copy link
Collaborator

jimidle commented Aug 30, 2022

Confirmed. With https://github.com/wdscxsj/antlrperfcomp, the Go dev runtime now runs faster than the Java runtime. On an old laptop the time cost is 1.3s against 2.1s. Need to try some other data later, but a huge performance boost indeed! Hats off to @jimidle for the wonderful work.
Update: It seems the Java runtime may scale better on a multicore system. The previous result was from a '16 ultrabook running Window 10. On a 32-core Linux server, the Go version (with profiling turned off) is still slower than the Java version, and GOMAXPROCS=1 performs better than the default setting.

I believe that there is problem somewhere that is causing a lot of allocations of SingletonPredictionContext in closureChecking... I need a little time to investigate that. GOMAXPROCS should not affect things per se, but it can indicate other problems in the code of course.

Let's not close this just yet.

By fixing #2016, I may have found why all the SingletonPredictionContexts are being created, but I will need a little time to work through it and improve it.

@parrt
Copy link
Member

parrt commented Aug 30, 2022

Should I leave this one open @jimidle or start a new one?

@jimidle
Copy link
Collaborator

jimidle commented Aug 30, 2022 via email

@iSuslov
Copy link

iSuslov commented Apr 3, 2023

That is one of the interesting threads to follow. Any updates?

@jimidle
Copy link
Collaborator

jimidle commented Apr 3, 2023 via email

@jimidle
Copy link
Collaborator

jimidle commented May 4, 2023

@wdscxsj  I have come back to this to test the idea that the go runtime is somehow not able to scale when there are multiple cores. This seems to be a red herring and basically is just a function of context switching etc. If I lock your driver to a single thread and then run using hyperfine, we can see that the user and system time are affected by the context switching. So I am going to ask @parrt to close this now. However, there will still be performance improvements related to interface use and memory use (especially escapes to heap I hope) down the line. The allocation of ATNConfigs and the CPU in adaptive predict and executeATN takes all the time, and the GC has to track that. However the GC runs on separate threads anyway, so it might take CPU time from the system and will cause some stopping to do the collection, but if execution times are small then it doesn't really matter. Also, the benchmark here does not measure the difference between pre cache warmup and post cache warmup.

So, let's close this as I don't think there are now any performance issues in this thread that I have not fixed.

For people's interest, here are the results using hyperfine, which is much more sophisticated than say time, with various settings:

Command Mean [s] Min [s] Max [s] Relative
./antlrperfcomp test/input.c 1.165 ± 0.021 1.143 1.236 1.00
Command Mean [s] Min [s] Max [s] Relative
GOMAXPROCS=1 ./antlrperfcomp test/input.c 1.557 ± 0.038 1.513 1.669 1.00

You can see here that though the wall clock appears to be longer without GOMAXPROCS, in fact leaving go to work it out actually results in better throughput. And if we play with GOMAXPROCS:

Command Mean [s] Min [s] Max [s] Relative
GOMAXPROCS=5 ./antlrperfcomp test/input.c 1.123 ± 0.011 1.109 1.155 1.00
Command Mean [s] Min [s] Max [s] Relative
GOMAXPROCS=8 ./antlrperfcomp test/input.c 1.153 ± 0.016 1.132 1.189 1.00

If we play with garbage collection:

Command Mean [s] Min [s] Max [s] Relative
GOGC=off ./antlrperfcomp test/input.c 1.323 ± 0.011 1.314 1.357 1.00

So, garbage collection is not really a burden even at these fairly high parse times. The C grammar is fairly well written as it is by Sam, but I am sure that there are lots of improvements that could be made. I will try the parse in SLL mode.

So using range on hyperfine, then we get this:

Command Mean [s] Min [s] Max [s] Relative
GOMAXPROCS=1 ./antlrperfcomp test/input.c 1.762 ± 0.395 1.579 3.188 1.57 ± 0.35
GOMAXPROCS=2 ./antlrperfcomp test/input.c 1.244 ± 0.019 1.220 1.284 1.11 ± 0.02
GOMAXPROCS=3 ./antlrperfcomp test/input.c 1.184 ± 0.031 1.164 1.293 1.05 ± 0.03
GOMAXPROCS=4 ./antlrperfcomp test/input.c 1.155 ± 0.041 1.119 1.254 1.03 ± 0.04
GOMAXPROCS=5 ./antlrperfcomp test/input.c 1.126 ± 0.011 1.111 1.150 1.00 ± 0.02
GOMAXPROCS=6 ./antlrperfcomp test/input.c 1.123 ± 0.019 1.106 1.176 1.00
GOMAXPROCS=7 ./antlrperfcomp test/input.c 1.149 ± 0.026 1.129 1.227 1.02 ± 0.03
GOMAXPROCS=8 ./antlrperfcomp test/input.c 1.152 ± 0.017 1.132 1.187 1.03 ± 0.02
GOMAXPROCS=9 ./antlrperfcomp test/input.c 1.146 ± 0.011 1.137 1.178 1.02 ± 0.02
GOMAXPROCS=10 ./antlrperfcomp test/input.c 1.166 ± 0.021 1.143 1.214 1.04 ± 0.03
GOMAXPROCS=11 ./antlrperfcomp test/input.c 1.165 ± 0.015 1.148 1.209 1.04 ± 0.02
GOMAXPROCS=12 ./antlrperfcomp test/input.c 1.170 ± 0.037 1.149 1.307 1.04 ± 0.04
GOMAXPROCS=13 ./antlrperfcomp test/input.c 1.163 ± 0.015 1.145 1.195 1.04 ± 0.02
GOMAXPROCS=14 ./antlrperfcomp test/input.c 1.180 ± 0.020 1.158 1.217 1.05 ± 0.03
GOMAXPROCS=15 ./antlrperfcomp test/input.c 1.176 ± 0.015 1.160 1.211 1.05 ± 0.02
GOMAXPROCS=16 ./antlrperfcomp test/input.c 1.171 ± 0.030 1.146 1.281 1.04 ± 0.03
GOMAXPROCS=17 ./antlrperfcomp test/input.c 1.176 ± 0.018 1.149 1.212 1.05 ± 0.02
GOMAXPROCS=18 ./antlrperfcomp test/input.c 1.175 ± 0.015 1.156 1.204 1.05 ± 0.02
GOMAXPROCS=19 ./antlrperfcomp test/input.c 1.172 ± 0.022 1.153 1.226 1.04 ± 0.03
GOMAXPROCS=20 ./antlrperfcomp test/input.c 1.169 ± 0.018 1.152 1.211 1.04 ± 0.02
GOMAXPROCS=21 ./antlrperfcomp test/input.c 1.172 ± 0.025 1.149 1.259 1.04 ± 0.03
GOMAXPROCS=22 ./antlrperfcomp test/input.c 1.171 ± 0.017 1.157 1.225 1.04 ± 0.02
GOMAXPROCS=23 ./antlrperfcomp test/input.c 1.171 ± 0.025 1.151 1.270 1.04 ± 0.03
GOMAXPROCS=24 ./antlrperfcomp test/input.c 1.175 ± 0.021 1.150 1.224 1.05 ± 0.03
GOMAXPROCS=25 ./antlrperfcomp test/input.c 1.181 ± 0.034 1.147 1.271 1.05 ± 0.04

Which tells us:

Summary
  'GOMAXPROCS=6 ./antlrperfcomp test/input.c' ran
    1.00 ± 0.02 times faster than 'GOMAXPROCS=5 ./antlrperfcomp test/input.c'
    1.02 ± 0.02 times faster than 'GOMAXPROCS=9 ./antlrperfcomp test/input.c'
    1.02 ± 0.03 times faster than 'GOMAXPROCS=7 ./antlrperfcomp test/input.c'
    1.03 ± 0.02 times faster than 'GOMAXPROCS=8 ./antlrperfcomp test/input.c'
    1.03 ± 0.04 times faster than 'GOMAXPROCS=4 ./antlrperfcomp test/input.c'
    1.04 ± 0.02 times faster than 'GOMAXPROCS=13 ./antlrperfcomp test/input.c'
    1.04 ± 0.02 times faster than 'GOMAXPROCS=11 ./antlrperfcomp test/input.c'
    1.04 ± 0.03 times faster than 'GOMAXPROCS=10 ./antlrperfcomp test/input.c'
    1.04 ± 0.02 times faster than 'GOMAXPROCS=20 ./antlrperfcomp test/input.c'
    1.04 ± 0.04 times faster than 'GOMAXPROCS=12 ./antlrperfcomp test/input.c'
    1.04 ± 0.02 times faster than 'GOMAXPROCS=22 ./antlrperfcomp test/input.c'
    1.04 ± 0.03 times faster than 'GOMAXPROCS=23 ./antlrperfcomp test/input.c'
    1.04 ± 0.03 times faster than 'GOMAXPROCS=16 ./antlrperfcomp test/input.c'
    1.04 ± 0.03 times faster than 'GOMAXPROCS=21 ./antlrperfcomp test/input.c'
    1.04 ± 0.03 times faster than 'GOMAXPROCS=19 ./antlrperfcomp test/input.c'
    1.05 ± 0.03 times faster than 'GOMAXPROCS=24 ./antlrperfcomp test/input.c'
    1.05 ± 0.02 times faster than 'GOMAXPROCS=18 ./antlrperfcomp test/input.c'
    1.05 ± 0.02 times faster than 'GOMAXPROCS=17 ./antlrperfcomp test/input.c'
    1.05 ± 0.02 times faster than 'GOMAXPROCS=15 ./antlrperfcomp test/input.c'
    1.05 ± 0.03 times faster than 'GOMAXPROCS=14 ./antlrperfcomp test/input.c'
    1.05 ± 0.04 times faster than 'GOMAXPROCS=25 ./antlrperfcomp test/input.c'
    1.05 ± 0.03 times faster than 'GOMAXPROCS=3 ./antlrperfcomp test/input.c'
    1.11 ± 0.02 times faster than 'GOMAXPROCS=2 ./antlrperfcomp test/input.c'
    1.57 ± 0.35 times faster than 'GOMAXPROCS=1 ./antlrperfcomp test/input.c

My system would use GOMAXPROCS=12 by default I think.

So, please close this issue @parrt.

@jimidle
Copy link
Collaborator

jimidle commented May 4, 2023

Just as an added bonus, if you switch this driver in to SLL mode, then it is quite a bit quicker:

Command Mean [ms] Min [ms] Max [ms] Relative
./antlrperfcomp test/input.c 688.1 ± 11.3 680.1 727.2 1.00
	source, err := antlr.NewFileStream("test/input.c")
	if err != nil {
		fmt.Println("Could not open file", err)
		return
	}
	tokens := antlr.NewCommonTokenStream(parsing.NewCLexer(source), 0)
	p := parsing.NewCParser(tokens)
	p.GetInterpreter().SetPredictionMode(antlr.PredictionModeSLL)
	tree := p.CompilationUnit()
	antlr.NewParseTreeWalker().Walk(new(parsing.BaseCListener), tree)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants