-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace edge representation in DFAState class with a fast hashmap (Java runtime). #2015
Conversation
This change fixes lexer performance degradation for non ASCII inputs. Background: Existing code uses a DFAState[] to represent edges from the DFAState class to other states and LexerATNSimulator and ParserATNSimulator classes were responsible of creating this array, inserting and retrieving elements. Lexer edges are for deciding which state to go next when a character is read: state --character--> otherState Current impementation used a 127 slot array to look up which state to go. This provides fast lookup for characters < 127 and a much slower path to determine next state for the rest of characters. Parser: Parser also uses the same DFAState class, but in this case edges are defined for tokens instead of characters: state --token--> nextState So parsers uses same array initialized with the maxTokenType. Problems with current approach: 1. The edge representation is not encapsulated properly and exposed publicly, lifetime of edges are controlled by the Lexer and Parser clients. resulting in a fragile and error prone structure. 2. Because Lexer uses a limited sized lookup table, when it is exposed to input with non ASCII characters its performance degrades considerably. This is not an uncommon case, a lot of code contains non English comments, names, symbols, addresses etc. 3. Because lookup tables by design wastes memory. Each Lexer DFAState object uses ~1KB memory (in 64 bit systems). Parser DFAState objects uses 8*maxTokenType memory each. 4. Existing array based solution requires positive integer keys, so they are modified to handle -1 as a key, complicating access especially because edge access is not properly encapsulated. New Solution: To represent DFAState transitions, we now use a small and fast <int, DFAState> map and remove the DFAState[] . The proposed map is a very limited hashmap with int keys. It uses open access with linear probing, also uses 2^n capacity for fast modulo look ups. I instrumented the map in a different branch and in most cases (>99%) it finds the object at the first lookup. Because its capacity grows only when it is needed, it may use memory more efficiently. This is not a big concern in case of Lexer, as even for complex grammars Lexers tend to have <500 DFAState and <10K edges. Java8 Lexer: Total DFA states: 348 Total DFA edges: 5805 Edge count histogram: (0..4]: 85 41.67% (4..8]: 20 53.33% (8..10]: 17 58.06% (10..16]: 12 66.39% (16..64]: 49 93.61% (64..100]: 20 99.17% (100..200]: 3 100.00% Performance: 1. I downloaded source code of some open source projects (Java8 SDK, Linux kernel etc) . For each language, I copied content of all source files into a separate file. 2. Tokenized each file and only counted tokens to isolate pure tokenizing performance. 3. Tried tests a few times and on 2 different CPU architectures (AMD Ryzen 1700x and Intel Xeon E5-1650) I summarized results here: https://docs.google.com/spreadsheets/d/1LqpdXfDz19d0VEItl0T_WF7i1bM9WOj1gs9-IZWCY44 In short: - New approach still keeps performance very close to existing approach when input is pure ASCII (same or better on AMD Ryzen, ~30% regression on Intel Xeon) - For non ASCII input, completely removes the performance regression. A few interesting things to note here, even if input contains less than ~0.5% non ASCII characters, hashmap version beats the lookup in all cases. Also keep in mind that because of Amdahl's Law, small differences in pure tokenizing perfomance might not matter much, to test this, I repeated performance benchmarks but this time did a little bit more meaningful tasks with read tokens, e.g. counted individual token types, token lengths and checked tokens if they contain only spaces. For pure ASCII inputs, small performance difference between old and new approaches diminished considerably. (Data is in the spread sheet linked above) I did not particularly focus on parsing, but by intuition, because parsing usually also involves more work on each state transition, the importance of a few cycles won should be minimal (as explained above) But I welcome more benchmarks focused on parsing as well.
Slightly improves performance, but makes get method a bit more complex. Also make initial map size 2, we do not care much about map resize because this access characteristic for the map is write once read many. And it is quite common to have a DFA node to have only 1 edge.
|
@sharwell |
As far as I can see only changes concerning thread safety in this patch are the places where a new edge is inserted to the DFAState objects, e.g.:
and very similar code in LexerAtnSimulator But note that Lexer version actually locks on the edges array not the DFAState object. The one in DFA is a bit different:
These are changed to addEdge method calls in DFAState class:
In the current version access edges array in DFAState class is created or replaced in a synchronized block in caller classes. In new version it is protected in the DFAState class. Because accessing edges is not protected in both cases, both are susceptible to race conditions , if multiple threads write / read simultaneously. However new approach could be more fragile in the case of accessing it because backing arrays in the hashmap may be resized when new elements are inserted. Current approach mostly does not resize the array other than initialization (the one in DFA class does though) so probability of such a problem is smaller. It is possible to add extra protection, either on the callers or in the DFAState class. I will check the performance hit that it may inflict. |
@mdakin The read operation lies on the fast path (DFA already exists), so care was taken to eliminate that lock. The unprotected read will end up reading one of the following values if a race occurs:
A fully protected read would ensure the second always occurred. However, in the rare case a null is read, the worst thing that happens is it recalculates an edge with the same result it found previously. By allowing the map to resize, references in the array may move around during a write operation, eliminating the previous guarantees. About 3-4 years ago, I worked with an ANTLR user who was dealing with intense lock contention during parsing. Solving the problem for this user required all locks on the fast path be eliminated. However, I believe this particular user is working with my fork that uses fewer locks, so they would not be directly impacted if this change was merged. The users most impacted by adding a lock on the read operation will be those running concurrent parsing on machines with 32+ cores, with greatest impacts for multi-socket scenarios. |
@sharwell
|
To answer my own question, indeed new edges are created while lexer / parser runs. So this means that edge representation must be multi thread write / multi thread read and non thread safe hashmap is problematic. There are a few solutions I can try:
or implement / use a lock free hashmap, this would again offer better characteristics. But implementing a lockless hashmap even a simple one like my version is tricky, and I did not try existing ones to gauge the performance impact. |
@mike-lischke |
Thanks, that bit is ok, I was more interested in reads, and I think I found it, reads are also lock protected: which is:
C++ version has -possibly- the same issues with my Java version then (If I add locking to reads), slower than using a lookup array also possible lock contention if many threads share the lexer/parser. Another question is, how does SingleWriteMultipleReadLock work? It seems when lexer or parser is shared multiple threads can write. |
Yes, right, this check makes the entire process slower. There has already been a discussion about slow parsing speed with multiple parsers. But I don't see another way to synchronize access to the static DFA. And indeed you also found a bug. How could I miss that? The state lock should move to the DFA class. Each ATNSimulator instance has an own lock currently, hence the access serialization doesn't work as intended. We should write a thread safe unsorted map class (or use an existing one, preferably). |
@sharwell @mike-lischke @parrt To add more context to discussion, I made a very quick test with different locking schemes for protecting reads, here are the very rough numbers: Current: 45.8s For Java8 lexer , 80MB Java SDK code: I think Threadlocal offers a good compromise, some regression on pure ASCII input, no lock contention, and total memory footprint is possibly still smaller than using 127[] edge arrays if thread count is below a certain number. |
How does the thread local impl look like? What exactly is thread local? You can't make the DFA thread local. |
@mike-lischke |
Something like this (my crude attempt), I hope I am not missing anything :)
|
Hmm, but then you are no longer sharing the values in the edge map, duplicating potentially a lot of memory (depending on the number of parallel threads). Also this approach doesn't help with synchronizing access to the DFA. The state lock also costs time. |
@mike-lischke Java8 Lexer: And most of the edges has very few elements, so the size of the hashmap usually (80%) below 20. compared to The default case of using a DFAState[127] . So after a certain amount of threads (~6) it starts using more memory than current, but it should not be a problem because number of states is small. also for syncronization, it is fine, getTargetState does not use synchronization and it is contention free (threads always read from their own version), and also I think because it is threadlocal I don't actually need the one in the addEdge. For parser, I guess it is more complicated because it has more nodes and size of the DFAState array is limited to maxToken. Another thing is I did not use ThreadLocal for very long time, I might have missed a limitation for it and maybe all of this is moot :) Need to read a bit about it to brush up. (e.g. I just realized I had to initialize threadlocal variable in the first addEdge call not in constructor :) ) |
Sorry for spamming, but I have another question regarding lexer, Lexers take input streams in their constructor, how exactly are they shared between threads? Tokenizing same input with multiple threads does not look a feasible concurrency scheme. |
You cannot share lexers. Each parser needs an own lexer. What you share is only the static data (DFA etc.) |
@mike-lischke |
We are not locking lexers but static data (as I said before), like the DFA and the ATN, which can be shared between multiple parsers, even on different threads. |
@mike-lischke
I will go ahead and change the lexer DFAState to use ThreadLocal hashmaps, and rollback parser edges to array version (maybe encapsulate it in the DFAState class though). Hopefully will update in a few days. |
📝 I think I figured out a way to do this efficiently. I'll take a look tomorrow and see how it plays out. |
@sharwell |
The three slow lines are the single threaded cases. |
@sharwell I made a small change to this patch, it only replaces the edgemap when it is expanding, getters may get the older version, but this is ok for the case of antlr? Doesn't this kind of scheme solves the problem as well? |
@sharwell @parrt |
@sharwell I am also curious,,why do you expand the array on collision, why not only when it needs to expand (e.g. reaches a threshold like my map) is it because of performance concerns, that a hashmap with colliding keys would be slower? |
@mdakin My implementation behaves as a perfect hash (one hash function, and only one item allowed in a bucket). There is no advantage to expanding the underlying array prior to a collision. For a hash map with single-bucketing and a perfect hash function, there is no performance penalty even for operating at 100% capacity. |
@sharwell Another question is why did you need to make key array atomic? It seems there is no reason for it. But in the end, this is a nice solution, should be fine to add as default in antlr? |
@sharwell
I will try to add a similar scheme after coming back from holidays. @ahmetaa |
@sharwell |
@mdakin In the perfect and minimal perfect hash functions I dealt with, keys were known before constructing of the data structures. Main motivation was compression. I am not sure if these can be applied here. I need to understand the discussion better before making a further comment. |
Guys,This is fascinating but I'm pretty nervous about changing the fundamental data structure we have, particularly in light of the potential for threading race conditions. Is there a way we can make this an option? I can see maybe adding an extra field to DFAState or perhaps allowing us to swap in a different DFAState subclass? btw, This is too big of a change for our point release 4.7.1 coming up, if we decide to do it. |
@parrt This patch changes things that certainly need consideration and improvement. I have already done similar changes in the C++ target and Sam in his optimized fork. The change has been backed with some tests and proves for its effectivity. It would be a shame if that would have to wait another year to be included. I vote for inclusion in 4.7.1. even if that means a week delay e.g. to write more tests and do another deep review. |
My main concern would be thread safety I guess. Any thoughts on how we could fold this in as an option in 4.7.1? |
@sharwell |
Yes, that is correct. |
@parrt Otherwise it is still the same simple hashmap with int keys, I also tried a more complex approach like starting with a perfect hashmap similar to @sharwell 's (single lookup, no collisions) and falling back to int map after a certain size. However this did not give any benefit on my home PC (AMD Ryzen), but i will double check with an intel machine as it seems architectures behave quite differently. That version is in a different branch I opted for this simpler version. Access is cheap through volatile reference, but nature of map and added indirections still inflicts a performance penalty for lexer when input is pure ASCII (up to 30%) depending on the input and grammar. For mixed inputi (10% non ascii), it still speeds up >10x. I abstracted away edge access in a new class, if necessary we can still add a single array based solution in it if we know the upper limit of keys beforehand (e.g. parsers) in this class. Class has relatively well unit tests. How should we proceed? Could you test it as well? |
Any thoughts on how we could fold this in as an option in 4.7.1? |
@parrt Is it possible to make this default and have a release candidate so it can be tried and we can address if anything arises? Another question is, what is the mechanism to add options to antlr (or individually to parser - lexer classes)? |
@parrt |
…ix to get method, some doc updates.
I updated the PR description to reflect the changes. Thanks for all the discussion and suggestions. I also tried to test moving the current array version into DFAEdgeCache class, but I did not observe any significant performance benefit, so I think keeping it simple is the way to go. If you have time, I would also appreciate a code review and testing, it is not much code, ~200 lines + tests. |
Given that the DFA stuff is the secret sauce, you can imagine why I am hesitant to just give it the old smoke test. ;) It would affect hundreds of thousands of users around the world. As you say, other classes update its fields. Perhaps we come up with a patch that drops new classes in on top so it's less of an option and more of a monkey patch for people that want to try it? |
@parrt |
Ok, I get it. Sorry for the hassle but you can imagine that I've worked on this for 30 years and I'm hesitant to completely replace something that is so critical to the correct functioning of the tool. Thank you for all of your contributions nonetheless and this could prove very useful in the future. |
Note: PR comment is edited to reflect latest changes.
This change fixes lexer performance degradation for non ASCII inputs.
@parrt @sharwell @bhamiltoncx @ahmetaa
Background:
Existing code uses a DFAState[] to represent edges from the DFAState class to other states and LexerATNSimulator and ParserATNSimulator classes were responsible of creating this array, inserting and retrieving elements.
Lexer edges are for deciding which state to go next when a character is read:
state --character--> otherState
Current implementation of lexer uses a 127 slot array to look up which state to go. This provides fast lookup for characters < 127 and a much slower path to determine next state for the rest of characters.
Parser: Parser also uses the same DFAState class, but in this case edges are defined for tokens instead of characters: state --token--> otherState, so parsers uses same array initialized with the maxTokenType.
Problems with current approach:
The edge representation is not encapsulated properly and exposed
publicly, lifetime of edges are controlled by the Lexer and Parser
clients. resulting in a fragile and error prone structure.
Because Lexer uses a limited sized lookup table, when it is exposed to input with non ASCII characters its performance degrades considerably. This is not an uncommon case, a lot of code contains non English comments, names, symbols, addresses etc. The slowdown is worse than 10x even if the input contains 10% non Ascii characters.
Existing lookup tables by design wastes memory. Each Lexer DFAState object uses ~1KB memory in 64 bit systems. Parser DFAState objects uses 8*maxTokenType bytes each. For complex grammars maxTokenType could be >100. Most of these tables are sparse. (See the histogram below)
Existing array based solution requires positive integer keys, so they are modified to handle -1 as a key, complicating access especially because edge access is not properly encapsulated.
New Solution
To represent DFAState transitions, we now use a small and fast
<int, DFAState> map and remove the DFAState[] .
The proposed map is a very limited hashmap with int keys. It uses open access with linear probing, also uses 2^n capacity for fast modulo look ups. I instrumented the map in a different branch and in most cases it finds the object at the first lookup.
Thread safety
Antlr lexer and parser threads shares the same DFAState graph so graph access must be thread safe. During parsing and lexing reads outnumber writes by a large margin as the number of inputs are processed. So I opted for cheap read-write lock trick as described here (see item 5) which offers good read performance, some reads may get a bit stale data but this is not a problem in antlrs case. (See @sharwell's comment below)
Performance
Performance is an important aspect of this patch, I tried to address
performance degeneration while not regressing fast cases.
I downloaded source code of some open source projects (Java8 SDK,
Linux kernel etc) . For each language, I copied content of all source
files into a separate file. You can download these from here
Tokenized each file and only counted tokens to isolate pure tokenizing performance. An example benchmark can be found here
Tried tests a few times and on 2 different CPU architectures
(AMD Ryzen 1700x and Intel Xeon E5-1650)
Results:
New approach still keeps performance close to existing approach.
When input is pure ASCII performance is similar on AMD Ryzen, up to ~30% regression
on Intel Xeon. Newer Intel architectures and Arm may get different results.
For non ASCII input, completely removes the performance regression and works as fast as pure ASCII input.
Performance numbers are not very stable and changes depends on many factors like JVM parameters, active cpu governor, if tests are done together or separately etc.
Also keep in mind that, small differences in pure tokenizing performance might not matter much. To test this, I repeated performance benchmarks but this time did a little bit more meaningful tasks with read tokens, e.g. counted individual token types, token lengths and checked tokens if they contain only spaces. For pure ASCII inputs, small performance differences between old and new diminishes considerably.
I did not particularly focus on parsing, similar to Lexer with pure ASCII input, I do not expect a big regression on performance because parsing usually involves more work on each state transition, diminishing the importance of a few cycles spent on finding next state. (as explained above). I welcome more benchmarks focused on parsing as well.
Memory Usage
Because its capacity grows only when needed and starting size is quite small (2),
hashmaps may use memory more efficiently. This is not a big concern in case of Lexer, as even
for complex grammars Lexers tend to have <500 DFAState and <10K edges.
Java8 Lexer:
Total DFA states: 348
Total DFA edges: 5805
The histogram of number of edges of all Lexer DFAState nodes:
Edge count histogram:
(0..4]: 85 41.67%
(4..8]: 20 53.33%
(8..10]: 17 58.06%
(10..16]: 12 66.39%
(16..64]: 49 93.61%
(64..100]: 20 99.17%
(100..200]: 3 100.00%
Edges on Parser nodes tend to be even more sparse and number of state objects may get much bigger.
I did not specifically measure the memory usage for parser state node, but if max number of tokens are bigger than a certain amount, there could be gains there. For complex grammars it is not uncommon to have >100 tokens.