Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace edge representation in DFAState class with a fast hashmap (Java runtime). #2015

Closed
wants to merge 12 commits into from

Conversation

mdakin
Copy link

@mdakin mdakin commented Sep 17, 2017

Note: PR comment is edited to reflect latest changes.

This change fixes lexer performance degradation for non ASCII inputs.

@parrt @sharwell @bhamiltoncx @ahmetaa

Background:
Existing code uses a DFAState[] to represent edges from the DFAState class to other states and LexerATNSimulator and ParserATNSimulator classes were responsible of creating this array, inserting and retrieving elements.

Lexer edges are for deciding which state to go next when a character is read:
state --character--> otherState

Current implementation of lexer uses a 127 slot array to look up which state to go. This provides fast lookup for characters < 127 and a much slower path to determine next state for the rest of characters.

Parser: Parser also uses the same DFAState class, but in this case edges are defined for tokens instead of characters: state --token--> otherState, so parsers uses same array initialized with the maxTokenType.

Problems with current approach:

  1. The edge representation is not encapsulated properly and exposed
    publicly, lifetime of edges are controlled by the Lexer and Parser
    clients. resulting in a fragile and error prone structure.

  2. Because Lexer uses a limited sized lookup table, when it is exposed to input with non ASCII characters its performance degrades considerably. This is not an uncommon case, a lot of code contains non English comments, names, symbols, addresses etc. The slowdown is worse than 10x even if the input contains 10% non Ascii characters.

  3. Existing lookup tables by design wastes memory. Each Lexer DFAState object uses ~1KB memory in 64 bit systems. Parser DFAState objects uses 8*maxTokenType bytes each. For complex grammars maxTokenType could be >100. Most of these tables are sparse. (See the histogram below)

  4. Existing array based solution requires positive integer keys, so they are modified to handle -1 as a key, complicating access especially because edge access is not properly encapsulated.

New Solution
To represent DFAState transitions, we now use a small and fast
<int, DFAState> map and remove the DFAState[] .

The proposed map is a very limited hashmap with int keys. It uses open access with linear probing, also uses 2^n capacity for fast modulo look ups. I instrumented the map in a different branch and in most cases it finds the object at the first lookup.

Thread safety
Antlr lexer and parser threads shares the same DFAState graph so graph access must be thread safe. During parsing and lexing reads outnumber writes by a large margin as the number of inputs are processed. So I opted for cheap read-write lock trick as described here (see item 5) which offers good read performance, some reads may get a bit stale data but this is not a problem in antlrs case. (See @sharwell's comment below)

Performance
Performance is an important aspect of this patch, I tried to address
performance degeneration while not regressing fast cases.

  1. I downloaded source code of some open source projects (Java8 SDK,
    Linux kernel etc) . For each language, I copied content of all source
    files into a separate file. You can download these from here

  2. Tokenized each file and only counted tokens to isolate pure tokenizing performance. An example benchmark can be found here

  3. Tried tests a few times and on 2 different CPU architectures
    (AMD Ryzen 1700x and Intel Xeon E5-1650)

Results:

New approach still keeps performance close to existing approach.

  • When input is pure ASCII performance is similar on AMD Ryzen, up to ~30% regression
    on Intel Xeon. Newer Intel architectures and Arm may get different results.

  • For non ASCII input, completely removes the performance regression and works as fast as pure ASCII input.

Performance numbers are not very stable and changes depends on many factors like JVM parameters, active cpu governor, if tests are done together or separately etc.

Also keep in mind that, small differences in pure tokenizing performance might not matter much. To test this, I repeated performance benchmarks but this time did a little bit more meaningful tasks with read tokens, e.g. counted individual token types, token lengths and checked tokens if they contain only spaces. For pure ASCII inputs, small performance differences between old and new diminishes considerably.

I did not particularly focus on parsing, similar to Lexer with pure ASCII input, I do not expect a big regression on performance because parsing usually involves more work on each state transition, diminishing the importance of a few cycles spent on finding next state. (as explained above). I welcome more benchmarks focused on parsing as well.

Memory Usage
Because its capacity grows only when needed and starting size is quite small (2),
hashmaps may use memory more efficiently. This is not a big concern in case of Lexer, as even
for complex grammars Lexers tend to have <500 DFAState and <10K edges.

Java8 Lexer:
Total DFA states: 348
Total DFA edges: 5805

The histogram of number of edges of all Lexer DFAState nodes:

Edge count histogram:
(0..4]: 85 41.67%
(4..8]: 20 53.33%
(8..10]: 17 58.06%
(10..16]: 12 66.39%
(16..64]: 49 93.61%
(64..100]: 20 99.17%
(100..200]: 3 100.00%

Edges on Parser nodes tend to be even more sparse and number of state objects may get much bigger.
I did not specifically measure the memory usage for parser state node, but if max number of tokens are bigger than a certain amount, there could be gains there. For complex grammars it is not uncommon to have >100 tokens.

This change fixes lexer performance degradation for non ASCII inputs.

Background:
Existing code uses a DFAState[] to represent edges from the DFAState
class to other states and LexerATNSimulator and ParserATNSimulator
classes were responsible of creating this array, inserting and
retrieving elements.

Lexer edges are for deciding which state to go next when a character
is read: state --character--> otherState

Current impementation used a 127 slot array to look up which state
 to go. This provides fast lookup for characters < 127 and a
 much slower path to determine next state for the rest of characters.

Parser: Parser also uses the same DFAState class, but in this case
edges are defined for tokens instead of characters:
state --token--> nextState

So parsers uses same array initialized with the maxTokenType.

Problems with current approach:

1. The edge representation is not encapsulated properly and exposed
publicly, lifetime of edges are controlled by the Lexer and Parser
clients. resulting in a fragile and error prone structure.

2. Because Lexer uses a limited sized lookup table, when it is exposed
to input with non ASCII characters its performance degrades
considerably. This is not an uncommon case, a lot of code contains non
English comments, names, symbols, addresses etc.

3. Because lookup tables by design wastes memory. Each Lexer DFAState
object uses ~1KB memory (in 64 bit systems). Parser DFAState objects
uses 8*maxTokenType memory each.

4. Existing array based solution requires positive integer keys, so they
are modified to handle -1 as a key,  complicating access especially
because edge access is not properly encapsulated.

New Solution:
To represent DFAState transitions, we now use a small and fast
<int, DFAState> map and remove the DFAState[] .

The proposed map is a very limited hashmap with int keys. It uses open
access with linear probing, also uses 2^n capacity for fast modulo
look ups. I instrumented the map in a different branch and in most
cases (>99%) it finds the object at the first lookup.

Because its capacity grows only when it is needed, it may use memory
more efficiently. This is not a big concern in case of Lexer, as even
for complex grammars Lexers tend to have <500 DFAState and <10K edges.

Java8 Lexer:
Total DFA states: 348
Total DFA edges: 5805

Edge count histogram:
(0..4]: 85  41.67%
(4..8]: 20  53.33%
(8..10]: 17  58.06%
(10..16]: 12  66.39%
(16..64]: 49  93.61%
(64..100]: 20  99.17%
(100..200]: 3  100.00%

Performance:

1. I downloaded source code of some open source projects (Java8 SDK,
Linux kernel etc) . For each language, I copied content of all source
files into a separate file.

2. Tokenized each file and only counted tokens to isolate pure
tokenizing performance.

3. Tried tests a few times and on 2 different CPU architectures
(AMD Ryzen 1700x  and Intel Xeon E5-1650)

I summarized results here:
https://docs.google.com/spreadsheets/d/1LqpdXfDz19d0VEItl0T_WF7i1bM9WOj1gs9-IZWCY44

In short:
- New approach still keeps performance very close to existing approach
when input is pure ASCII  (same or better on AMD Ryzen, ~30% regression
on Intel Xeon)
- For non ASCII input, completely removes the performance regression.

A few interesting things to note here, even if input contains less than
~0.5% non ASCII characters, hashmap version beats the lookup in all
cases.

Also keep in mind that because of Amdahl's Law, small differences in
pure tokenizing perfomance might not matter much, to test this, I
repeated performance benchmarks but this time did a little bit more
meaningful tasks with read tokens, e.g. counted individual token types,
token lengths and checked tokens if they contain only spaces.
For pure ASCII inputs, small performance difference between
old and new approaches diminished considerably. (Data is in the spread
sheet linked above)

I did not particularly focus on parsing, but by intuition, because
parsing usually also involves more work on each state transition,
the importance of a few cycles won should be minimal (as explained
above) But I welcome more benchmarks focused on parsing as well.
@mdakin mdakin changed the title Replace edge representation in DFAState class with a fast hashmap. Replace edge representation in DFAState class with a fast hashmap (Java runtime). Oct 6, 2017
mdakin added 2 commits October 7, 2017 18:33
Slightly improves performance, but makes get method a bit more complex.

Also make initial map size 2, we do not care much about map resize
because this access characteristic for the map is write once read many.
And it is quite common to have a DFA node to have only 1 edge.
@sharwell
Copy link
Member

sharwell commented Oct 7, 2017

⚠️ @parrt Keep in mind there is an exceptionally high likelihood that this change will break the thread safety guarantees previously in place. Specifically, we previously guaranteed that two independent instances of a generated parser could be used with independent input streams, and each would return correct results. Race conditions in the shared DFA could result in cases where the parsers produce incorrect/unpredictable results and/or corrupt the internal DFA state.

@mdakin
Copy link
Author

mdakin commented Oct 8, 2017

@sharwell
Could you point me which particular change may break the thread safety guarantees? I can have a look, i don't think current method has any inherent issues with thread safety, on the contrary it should now be easier because the edge representation is now private in DFAState class. Last time i checked all java runtime unit tests passed, but of course there could be things i have missed. I will have a closer look at parser bits tomorrow.

@parrt , @sharwell I would appreciate a code review.

@mdakin
Copy link
Author

mdakin commented Oct 8, 2017

As far as I can see only changes concerning thread safety in this patch are the places where a new edge is inserted to the DFAState objects, e.g.:

PartserATNSimulator:

		synchronized (from) {
			if ( from.edges==null ) {
				from.edges = new DFAState[atn.maxTokenType+1+1];
			}

			from.edges[t+1] = to; // connect
		}

and very similar code in LexerAtnSimulator But note that Lexer version actually locks on the edges array not the DFAState object.

The one in DFA is a bit different:

		// synchronization on s0 here is ok. when the DFA is turned into a
		// precedence DFA, s0 will be initialized once and not updated again
		synchronized (s0) {
			// s0.edges is never null for a precedence DFA
			if (precedence >= s0.edges.length) {
				s0.edges = Arrays.copyOf(s0.edges, precedence + 1);
			}

			s0.edges[precedence] = startState;
		}

These are changed to addEdge method calls in DFAState class:

public void addEdge(int symbol, DFAState state) {
    synchronized(this) {
        edges.put(symbol, state);
    }
}

In the current version access edges array in DFAState class is created or replaced in a synchronized block in caller classes. In new version it is protected in the DFAState class.

Because accessing edges is not protected in both cases, both are susceptible to race conditions , if multiple threads write / read simultaneously. However new approach could be more fragile in the case of accessing it because backing arrays in the hashmap may be resized when new elements are inserted. Current approach mostly does not resize the array other than initialization (the one in DFA class does though) so probability of such a problem is smaller.

It is possible to add extra protection, either on the callers or in the DFAState class. I will check the performance hit that it may inflict.

@sharwell
Copy link
Member

sharwell commented Oct 8, 2017

@mdakin The read operation lies on the fast path (DFA already exists), so care was taken to eliminate that lock. The unprotected read will end up reading one of the following values if a race occurs:

  • Null (no edge)
  • A non-null, valid edge

A fully protected read would ensure the second always occurred. However, in the rare case a null is read, the worst thing that happens is it recalculates an edge with the same result it found previously. By allowing the map to resize, references in the array may move around during a write operation, eliminating the previous guarantees.

About 3-4 years ago, I worked with an ANTLR user who was dealing with intense lock contention during parsing. Solving the problem for this user required all locks on the fast path be eliminated. However, I believe this particular user is working with my fork that uses fewer locks, so they would not be directly impacted if this change was merged. The users most impacted by adding a lock on the read operation will be those running concurrent parsing on machines with 32+ cores, with greatest impacts for multi-socket scenarios.

@mdakin
Copy link
Author

mdakin commented Oct 8, 2017

@sharwell
Before offering any alternatives, I have a few questions;

  1. So while parsing code, does new DFA states and edges continue to be generated? if so, when parser object is shared between threads (I guess reasoning was because memory footprint of parser is large?) this indeed could cause problems for a non thread safe hashmap.
  2. Is this a problem for lexers as well? meaning, after the Lexer is generated, is there a possibility of adding new DFA states / edges while tokenizing?

@mdakin
Copy link
Author

mdakin commented Oct 9, 2017

@sharwell @parrt @ahmetaa

To answer my own question, indeed new edges are created while lexer / parser runs. So this means that edge representation must be multi thread write / multi thread read and non thread safe hashmap is problematic. There are a few solutions I can try:

  1. First abandon hashmap approach for parser, usually token count is not that high and definitely not unbounded like lexer inputs, so this was only saving some memory.

  2. Keep hashmap in the Lexer, but move the map reference to a ThreadLocal storage, this will slow things down, but not as much as making getters locked and it is not susceptible to contention. The additional memory use increase is (IMO) negligible.

or implement / use a lock free hashmap, this would again offer better characteristics. But implementing a lockless hashmap even a simple one like my version is tricky, and I did not try existing ones to gauge the performance impact.

@mdakin
Copy link
Author

mdakin commented Oct 9, 2017

@mike-lischke
Mike, As far as I remember you changed this to an unordered map in C++ version as well, how did you handle concurrent writers/readers?

@mike-lischke
Copy link
Member

@mdakin
Copy link
Author

mdakin commented Oct 9, 2017

@mike-lischke

Thanks, that bit is ok, I was more interested in reads, and I think I found it, reads are also lock protected:

https://github.com/antlr/antlr4/blob/master/runtime/Cpp/runtime/src/atn/LexerATNSimulator.cpp#L188

which is:
static antlrcpp::SingleWriteMultipleReadLock _edgeLock;

static antlrcpp::SingleWriteMultipleReadLock _edgeLock; // Lock for the sparse edge map in DFA states.

C++ version has -possibly- the same issues with my Java version then (If I add locking to reads), slower than using a lookup array also possible lock contention if many threads share the lexer/parser.

Another question is, how does SingleWriteMultipleReadLock work? It seems when lexer or parser is shared multiple threads can write.

@mike-lischke
Copy link
Member

mike-lischke commented Oct 9, 2017

Yes, right, this check makes the entire process slower. There has already been a discussion about slow parsing speed with multiple parsers. But I don't see another way to synchronize access to the static DFA.

And indeed you also found a bug. How could I miss that? The state lock should move to the DFA class. Each ATNSimulator instance has an own lock currently, hence the access serialization doesn't work as intended. We should write a thread safe unsorted map class (or use an existing one, preferably).

@mdakin
Copy link
Author

mdakin commented Oct 9, 2017

@sharwell @mike-lischke @parrt

To add more context to discussion, I made a very quick test with different locking schemes for protecting reads, here are the very rough numbers:
Platform: Intel Xeon, will test AMD Ryzen later,
I tested tokenizing 200MB Turkish text with our grammar

Current: 45.8s
Hashmap, No locking: 3.2s
Hashmap, Threadlocal: 4.5s
Hashmap, ReentrantLock: 6.9s
Hashmap, synchronized block: 7.2s

For Java8 lexer , 80MB Java SDK code:
Current: 1.21s
Hashmap, No locking: 1.41s
Hashmap, Threadlocal: 2.01s

I think Threadlocal offers a good compromise, some regression on pure ASCII input, no lock contention, and total memory footprint is possibly still smaller than using 127[] edge arrays if thread count is below a certain number.

@mike-lischke
Copy link
Member

How does the thread local impl look like? What exactly is thread local? You can't make the DFA thread local.

@mdakin
Copy link
Author

mdakin commented Oct 9, 2017

@mike-lischke
But you can make edge map inside DFAState class threadlocal, each thread would have its own copy of it. It increases memory footprint but I don't see a reason it should not work.

@mdakin
Copy link
Author

mdakin commented Oct 9, 2017

Something like this (my crude attempt), I hope I am not missing anything :)

	private final ThreadLocal<SimpleIntMap<DFAState>> localEdges = new ThreadLocal<>();
...
	public DFAState(ATNConfigSet configs) {
		this.configs = configs;
		localEdges.set(new SimpleIntMap<DFAState>(2));
	}

	public void addEdge(int symbol, DFAState state) {
		SimpleIntMap<DFAState> edges = localEdges.get();
		synchronized(this) {
			edges.put(symbol, state);
		}
	}

	public DFAState getTargetState(int symbol) {
		return localEdges.get().get(symbol);
	}

@mike-lischke
Copy link
Member

mike-lischke commented Oct 9, 2017

Hmm, but then you are no longer sharing the values in the edge map, duplicating potentially a lot of memory (depending on the number of parallel threads). Also this approach doesn't help with synchronizing access to the DFA. The state lock also costs time.

@mdakin
Copy link
Author

mdakin commented Oct 9, 2017

@mike-lischke
(I am speaking for Lexer)
For lexers, the number of edges are not big, if you check the numbers I mentioned in the description of the PR, even for large and complex grammars like Java8, number of States and edges is small:

Java8 Lexer:
Total DFA states: 348
Total DFA edges: 5805

And most of the edges has very few elements, so the size of the hashmap usually (80%) below 20. compared to The default case of using a DFAState[127] . So after a certain amount of threads (~6) it starts using more memory than current, but it should not be a problem because number of states is small.

also for syncronization, it is fine, getTargetState does not use synchronization and it is contention free (threads always read from their own version), and also I think because it is threadlocal I don't actually need the one in the addEdge.

For parser, I guess it is more complicated because it has more nodes and size of the DFAState array is limited to maxToken.

Another thing is I did not use ThreadLocal for very long time, I might have missed a limitation for it and maybe all of this is moot :) Need to read a bit about it to brush up. (e.g. I just realized I had to initialize threadlocal variable in the first addEdge call not in constructor :) )

@mdakin
Copy link
Author

mdakin commented Oct 9, 2017

Sorry for spamming, but I have another question regarding lexer, Lexers take input streams in their constructor, how exactly are they shared between threads? Tokenizing same input with multiple threads does not look a feasible concurrency scheme.

@mike-lischke
Copy link
Member

You cannot share lexers. Each parser needs an own lexer. What you share is only the static data (DFA etc.)

@mdakin
Copy link
Author

mdakin commented Oct 9, 2017

@mike-lischke
I am not following, then why do we lock anything in ATNLexerSimulator if it is always used by a single thread?

@mike-lischke
Copy link
Member

mike-lischke commented Oct 9, 2017

We are not locking lexers but static data (as I said before), like the DFA and the ATN, which can be shared between multiple parsers, even on different threads.

@mdakin
Copy link
Author

mdakin commented Oct 9, 2017

@mike-lischke
I see, thanks. Probably these (from the generated lexers):

  protected static final DFA[] _decisionToDFA;
  
  public static final ATN _ATN =
    new ATNDeserializer().deserialize(_serializedATN.toCharArray());

I will go ahead and change the lexer DFAState to use ThreadLocal hashmaps, and rollback parser edges to array version (maybe encapsulate it in the DFAState class though). Hopefully will update in a few days.

@sharwell
Copy link
Member

📝 I think I figured out a way to do this efficiently. I'll take a look tomorrow and see how it plays out.

@mdakin
Copy link
Author

mdakin commented Oct 10, 2017

@sharwell
Curious, I hope it plays out. For lexer, to preserve concurrency guarantees and performance together, the only way I can think of is to keep the existing array and fall back to a synchronized hashmap only if input is out of range. (TBH maybe this is not a bad solution anyway).

@sharwell
Copy link
Member

The three slow lines are the single threaded cases.

@mdakin
Copy link
Author

mdakin commented Oct 14, 2017

@sharwell I made a small change to this patch, it only replaces the edgemap when it is expanding, getters may get the older version, but this is ok for the case of antlr? Doesn't this kind of scheme solves the problem as well?

@mdakin
Copy link
Author

mdakin commented Oct 14, 2017

@sharwell
My approach has the benefit of creating a new map only when it is needed to expand, making it use less memory, but it also becomes a bit slower because each rewrite of the map flushes it from the cache, but this does not happen very often.
On the other hand, I don't care which implementation is used. The existing array lookup behaves terribly when input is unicode and I would appreciate if you could send your change as a separate patch for inclusion in next version of the antlr (at least for lexers).

@parrt
Could you chime in on how to progress on this?

@mdakin
Copy link
Author

mdakin commented Oct 15, 2017

@sharwell I am also curious,,why do you expand the array on collision, why not only when it needs to expand (e.g. reaches a threshold like my map) is it because of performance concerns, that a hashmap with colliding keys would be slower?

@sharwell
Copy link
Member

@mdakin My implementation behaves as a perfect hash (one hash function, and only one item allowed in a bucket). There is no advantage to expanding the underlying array prior to a collision. For a hash map with single-bucketing and a perfect hash function, there is no performance penalty even for operating at 100% capacity.

@mdakin
Copy link
Author

mdakin commented Oct 16, 2017

@sharwell
The advantage of non perfect hashmap is that it tends to use less memory though, because of caching effects it can also be faster than the perfect version, is the benchmark you used available so I can test as well?

Another question is why did you need to make key array atomic? It seems there is no reason for it.

But in the end, this is a nice solution, should be fine to add as default in antlr?

@mdakin
Copy link
Author

mdakin commented Oct 16, 2017

@sharwell
But in principle I agree that a simple perfect hashmap scheme could be a good fit for antlr case, it is basically (almost)write once, read many map. I wonder if a simple / fast scrambler method would help or make things worse:

 public static int mix( final int x ) {
    final int h = x * 0x9E3779B9; // int phi
    return h ^ (h >> 16);
}

I will try to add a similar scheme after coming back from holidays.

@ahmetaa
You have quite a bit experience with perfect hashmaps, any input on this?

@mdakin
Copy link
Author

mdakin commented Oct 16, 2017

@sharwell
BTW, Never mind the atomic key array question, it is ok to only protect key array. You also check if keys are same after getting the value so it is ok.

@ahmetaa
Copy link

ahmetaa commented Oct 16, 2017

@mdakin In the perfect and minimal perfect hash functions I dealt with, keys were known before constructing of the data structures. Main motivation was compression. I am not sure if these can be applied here. I need to understand the discussion better before making a further comment.

@parrt
Copy link
Member

parrt commented Oct 21, 2017

Guys,This is fascinating but I'm pretty nervous about changing the fundamental data structure we have, particularly in light of the potential for threading race conditions. Is there a way we can make this an option? I can see maybe adding an extra field to DFAState or perhaps allowing us to swap in a different DFAState subclass?

btw, This is too big of a change for our point release 4.7.1 coming up, if we decide to do it.

@mike-lischke
Copy link
Member

@parrt This patch changes things that certainly need consideration and improvement. I have already done similar changes in the C++ target and Sam in his optimized fork. The change has been backed with some tests and proves for its effectivity. It would be a shame if that would have to wait another year to be included. I vote for inclusion in 4.7.1. even if that means a week delay e.g. to write more tests and do another deep review.

@parrt
Copy link
Member

parrt commented Oct 21, 2017

My main concern would be thread safety I guess. Any thoughts on how we could fold this in as an option in 4.7.1?

@mdakin
Copy link
Author

mdakin commented Oct 24, 2017

@parrt
I will update the PR with a new proposal in a few days (Something similar to @sharwell 's perfect map), that will hopefully address all concerns.

@mdakin
Copy link
Author

mdakin commented Oct 26, 2017

@sharwell
A qq, the benchmark results you provided above with HashEdgeMap is against SparseEdgeMap, not the current antlr head implementation (DFAState[]) right?

@sharwell
Copy link
Member

benchmark results you provided above with HashEdgeMap is against SparseEdgeMap

Yes, that is correct.

@mdakin
Copy link
Author

mdakin commented Oct 26, 2017

@parrt
For thread safety I used cheap read-write lock trick as described here (see item 5)
Basically writes are synchronized, it rewrites the map when it needs to expand. Reads are not synchronized, but in the worst case they see a possibly stale but consistent version of it.

Otherwise it is still the same simple hashmap with int keys, I also tried a more complex approach like starting with a perfect hashmap similar to @sharwell 's (single lookup, no collisions) and falling back to int map after a certain size. However this did not give any benefit on my home PC (AMD Ryzen), but i will double check with an intel machine as it seems architectures behave quite differently. That version is in a different branch I opted for this simpler version.

Access is cheap through volatile reference, but nature of map and added indirections still inflicts a performance penalty for lexer when input is pure ASCII (up to 30%) depending on the input and grammar. For mixed inputi (10% non ascii), it still speeds up >10x.

I abstracted away edge access in a new class, if necessary we can still add a single array based solution in it if we know the upper limit of keys beforehand (e.g. parsers) in this class. Class has relatively well unit tests.

How should we proceed? Could you test it as well?

@parrt
Copy link
Member

parrt commented Oct 27, 2017

Any thoughts on how we could fold this in as an option in 4.7.1?

@mdakin
Copy link
Author

mdakin commented Oct 28, 2017

@parrt
I am not sure it is easy to make it an option, main problem is that in current antlr head, lifetime of the the edge representation (a raw DFASTate[]) is entirely controlled by the other classes (Parser, Lexer etc) without abstracting away the access into a separate class (e.g. similar to DFAEDgeCache class I introduced) it would be very ugly.

Is it possible to make this default and have a release candidate so it can be tried and we can address if anything arises?

Another question is, what is the mechanism to add options to antlr (or individually to parser - lexer classes)?

@mdakin
Copy link
Author

mdakin commented Oct 29, 2017

@parrt
My preference is that we bite the bullet and just use the map version.

@mdakin
Copy link
Author

mdakin commented Oct 29, 2017

@parrt, @sharwell

I updated the PR description to reflect the changes. Thanks for all the discussion and suggestions.

I also tried to test moving the current array version into DFAEdgeCache class, but I did not observe any significant performance benefit, so I think keeping it simple is the way to go.

If you have time, I would also appreciate a code review and testing, it is not much code, ~200 lines + tests.

@parrt
Copy link
Member

parrt commented Nov 1, 2017

Given that the DFA stuff is the secret sauce, you can imagine why I am hesitant to just give it the old smoke test. ;) It would affect hundreds of thousands of users around the world. As you say, other classes update its fields. Perhaps we come up with a patch that drops new classes in on top so it's less of an option and more of a monkey patch for people that want to try it?

@mdakin
Copy link
Author

mdakin commented Nov 2, 2017

@parrt
Unfortunately I don't have any time or energy left for this patch, feel free to close it.

@parrt
Copy link
Member

parrt commented Nov 2, 2017

Ok, I get it. Sorry for the hassle but you can imagine that I've worked on this for 30 years and I'm hesitant to completely replace something that is so critical to the correct functioning of the tool. Thank you for all of your contributions nonetheless and this could prove very useful in the future.

@parrt parrt closed this Nov 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants