-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GSoC Progress #1
Comments
Now there are tests, and they have revealed some bugs, particularly with default attributes. |
All 8 tests now pass, and I should probably write some more. I'm wondering if it might be a good idea to add a way to specify that a rule can only apply in certain layers. A use case might be if you had a rule for Clipping and default attributes usually work, but they seem to occasionally fail so tomorrow I'll mess with those a bit and also add a few other bits and pieces. |
I restructured the compiler and wrote some more tests. The eng-spa rules seem to be occasionally failing at agreement and I'm not sure why (corresponding tests work ok). It's also not entirely clear that overriding output patterns is working properly. I'll have to write some tests for that as well. |
Profiling yesterday indicated that the most significant time-sink was repeatedly removing the first element of a The current largest piece is that the match function is being called more than 2 million times on 150 thousand tokens due to the multipass nature of the parser. Once the parser is converted to some version of LR, it looks like possibly as much as 1/4 of the program's runtime will be spent in |
For unknown reasons, the profiling mentioned in the previous comment gave 1.3s as the running time with ~60 rules on 75k tokens, but every subsequent run has given that time as about 3s, which is roughly level with the existing transfer system. The first attempt to switch from multipass to LR on Thursday and Friday went badly due to an overuse of I did some optimizing today and got the runtime down to about 2.2s. I think some further gains might be possible, but they would require messing with strings, so I'm not sure if it's worth it. |
The LR version is in the Under LR, the |
I've switched to GLR, which should fix the problems from the last comment. Reduce-reduce conflicts are resolved as they were under multipass and LR (by length and weight). Shift-reduce conflicts involve splitting the stack. Unfortunately, this makes the stack a directed graph which seems to be significantly slow things down. I'll post more details when the profiler finishes. |
As it turns out, the substantial increase in time was due to a value being set wrong and occasionally causing an infinite loop. I'm still not completely certain what was causing it, but it doesn't seem to be happening anymore and the overall runtime is now better than multipass. However, multipass and GLR sometimes apply rules in different orders, which means that the output is different, so some of the test rules may need to be rewritten. It's also possible that this difference in behavior is actually another bug. Either way, writing more tests will hopefully make things clearer. |
Most obsolete code has now been deleted or moved to As it turns out, none of the existing tests had shift-reduce conflicts. A grammar without conflicts will run pretty quickly, but one with conflicts will eat up a lot of memory and run rather slowly due to all the stack splitting. Given that each rule is executed separately on each branch, I'm now pretty sure that having some rules run at output rather than at parse would substantially speed things up. Clipping tags currently approaches half the runtime. It may be possible to make the compiler generate fewer clip instructions or to have the tokenizer break the input stream into tags. |
Adding output rules sped things up and saved enough copies that I can now actually run the test data I've been using. Unfortunately it takes 20 seconds and the memory usage goes as high as 5GB. I'll see if the profiler turns up anything. If not, I think I can get the compiler to generate less bytecode on the input side. |
I added a pool allocator for chunks and stack branches that resets on output, which seems to have sped things up a bit (now 15s), and it also substantially decreased the memory usage, which looks like it's now down to about 1GB. It also indicates that as things are currently set up, there's at least one sentence in my test data that allocates more than 500,000 chunks while parsing. |
I just noticed something that is either a bug or a really annoying caveat. A set of rules like
With input |
I started on an XML compiler today. In the absence of substantial setbacks I might have it working tomorrow. |
I finished most of the TODOs in the XML compiler, but didn't get as far as actually testing it. Things still unfinished:
|
It compiles the eng-spa files without apparent issue, but for some reason the first transition in the transducer is |
Issues with the XML compiler and interpreter:
However, it does appear to be somewhat faster than chunker/interchunk. |
This is basically identical to the HFST bug. |
One possible solution to that might be to not discard branches when reading a blank - then it would read the comma and both branches would die, but I don't know if there's a more complicated scenario where that would fail. |
@ftyers, thoughts on the above issue? |
The solution that I've found to the HFST-like bug is to write something like the following:
Since no rule generates Unfortunately, implementing this solution slows things down considerably (roughly a factor of 4, it looks like). |
General TODO list:
The text was updated successfully, but these errors were encountered: