-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let's booooooost! #814
Let's booooooost! #814
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check why CI is broken?
If rust version only matters, please feel free to update rust version of lalrpop for stablized features.
Thank you!
It's a regex-automata issue that I just reported. Luckily it was fixed in the blink of an eye. rust-lang/regex#1056 |
Any update? |
I triggered CI again |
I force pushed again. This time it should be fixed. @youknowone |
My fault. Force-pushed again. |
@nikomatsakis @yannham Could you review this patch? |
I'm back from vacation, I'll try to take a look soon! |
Did a small amount of local testing on my box. Compiling lrgrammar.lalrpop with the HEAD of this, vs just the skip rewrite commit and using /usr/bin/time -v. I got a very minor speed improvement from 4.03 seconds user time to 3.91 seconds user time with the lexer rewrite here. The peak RAM usage increased from 12.5MB to 13.8MB. (I was concerned about RAM because of the comment about "exhorbitant RAM usage" here.) On this lalrpop file that I care about, I regressed speed from 1.98 seconds to 2.5 seconds, with peak RAM going up from 11.4MB to 12.8MB. I also grabbed json.lalrpop which I think was used by the submitter for the benchmark. I got an improvement from 0.06 seconds user time to 0.03 seconds user time, with a peak RAM increase from 9.9MB to 11.4MB. Obviously this is all one box, using time, without any statistical sampling, so I wouldn't read too much into the above numbers. I don't view the RAM increases as "exhorbitant", but they are a trade-off to be aware of. I am a little concerned about the variation across different workloads. The tiny json file got good results. The percentage win I observed in the much larger lrgrammar.lalrpop file was much smaller though. And the third file I tested actually regressed. I tried running paguroidea's benchmarking, but I got unrelated compile errors, and didn't spend the time to debug them. The benchmark tests I run in my project show an extreme regression with the lexer rewrite, vs just the whitespace fix. Full system compile time: [268.80 ms 278.32 ms 288.93 ms] This is a sort of shockingly huge number though, particularly since I expect my codes performance to be the major bottleneck here, rather than lalrpop performance. So I'm wondering if there's some sort of measurement or environment error on my end, but I do seem to be getting similar results consistently. |
@dburgener Thanks for testing. I don't have time to dive into these results in recent days. But I do have some speculations. IIRC, the default strategy of If you look through the original implementation, you can see a lot of redundant calculations due to the limitations of Anyway, some simple profilings should reveal what's going wrong under the hood. Could you upload the complete benchmark to somewhere? |
lalrpop/lalrpop#814 is the lalrpop change Commenting in the 366e87a commit lines exercises the PR with just its first commit, and switching to d1e99cc exercises the full PR.
Sure. I have a lot of skepticism about my initial time based benchmarks, but here's a branch with my project that uses lalrpop: https://github.com/dburgener/cascade/tree/dburgener/bench-lalrpop-lexer I haven't dug into the results here much on my end yet (I do intend to in a day or two), so it's possible there's something in my code that's relevant here. But if you modify my Cargo.toml lines to point to your first commit, run That said, I haven't really dug into ruling out environmental differences while the benchmark was running too heavily, or done any profiling, so maybe there's something majorly wrong with the way I'm testing it. |
@QuarticCat This PR seems to require more discussion. If you don't mind, could you open another PR about regex pattern to fix #820 first? |
Cascade (old):
Cascade (
Cascade (
Obviously, it's the regex compilation that took too much time in your benchmark. Considering that Cascade (
Twitter JSON (
Surprisingly, this one is faster. This time all results are slower than initially so it's even faster than the number shows. I should have used this script to fix the environment. Anyway, that doesn't affect our conclusion. I'm a bit bored with this PR. If you want to tweak the regex-automata configuration further, I can grant you access to my forked repo. |
Oh I forgot to post memory changes. The results below are tested by Cascade (old):
This result is floating between 55MB and 59MB. Cascade (
It's a little bit larger. Presumably because the hybrid version reuses |
Thanks for the thorough investigation, @QuarticCat. Sorry for not being able to review as of yet, but backlog accumulates elsewhere...It's not entirely clear to me how the table you show relates to LALRPOP after This PR, LALRPOP before this PR, or to yet another potential combination that isn't used in this PR nor previously in LALRPOP? If the compilation can be a huge pessimization in some workflows, I would be tempted to not proceed with this PR. Especially because if the lexer's performance is of importance in one's use-case, you can still use Logos or even a roll out a fine-tuned custom lexer if needed which can be plugged in LALRPOP easily (the complexity is nowhere near the one of writing a custom parser, which would indeed defeats the purpose of LALRPOP entirely). What do you think? |
The original implementation also needs to compile regex on the fly. The More tests & benchmarks are welcome. I'm confident in my work. |
"old" -> LALRPOP before this PR |
IMHO what we really need is a good lalrpop benchmarking automation. This likely won't be the last PR we want to understand performance on. I've been super busy all week, and expect to be recovering and catching up all next week, but the week after, I can start looking our options for benchmarking automation, with a particular emphasis on applying it in this case. The claimed performance wins here are pretty considerable, so hopefully we can gain some confidence that they are fairly universal rather than workload dependent and move forward. I think the other problem here is that while @QuarticCat clearly understands the regex automata internals and can speak to them, it doesn't seem like any of the maintainers or reviewers have a ton of background there. I started looking at reviewing last week, and dipped my toe in the regex-automata docs, which is what motivated me to try more benchmarking, out of concerns for memory consumption, but then I got on the benchmarking path and didn't finish going through automata docs and the code here. So all that to say, I'm enthusiastic about this change in general, and would love to do what I can to drive it forward in terms of testing and reviewing, but it'll likely be a couple of weeks until I get sufficient reprieve from my current busyness to devote some time to it again. |
Somewhat related is https://github.com/orgs/lalrpop/discussions/809 |
Yes, good point. I had vaguely recalled that post, but forgot to go dig it up when posting. I very much agree with the sentiment there, and we should probably work on that as either part of all of "improve our benchmarking". I'll take a deeper look when I set aside lalrpop time in a week or two. The thing that a generic benchmark solution like that is lacking is testing workloads of lalrpop's actual users. I don't think it would be a ton of effort to set up a repo that automates fetching a handful of public repos that use lalrpop, and running their benchmarks to compare two pinned lalrpop commits. That doesn't isolate parser performance of course, unless any of the projects expose a parser-specific benchmark, but it would be nice to detect ahead of time if we're likely to regress actual users noticeably with a proposed lalrpop change. |
FYI, I've gotten lalrpop benchmarking merged in https://github.com/rosetta-rs/parse-rosetta-rs. That's certainly not the complete answer for our performance testing story, but it's a good step. I'll aim to get some time tomorrow to run head to head benchmarks with/without the perf part of this PR and see how that effects our results. After that, I'd like to do some benchmarking on various other workloads, and give this a thorough review. Thanks for your patience. |
Based on one benchmark run, it looks to me like this change cuts our runtime in half on the json parsing test in rosetta. Very good news. I'll aim to do more benchmarking and a code review later this week.
|
Okay, I've read through the code here, and I think I somewhat have my head around the new implementation, as well as the original regex crate based implementation. And the new implementation looks fine. I'm still a little fuzzy on why this is faster. Is the main win because in the original implementation regex::find() has to essentially start from the beginning at each search when looking for a longer match, while with the dfa::next_state() approach we can basically browse the input string once? I'd also like to make sure I understand all the compile time angles here. I think I understand that the both the original regex crate implementation and the final hybrid one, using LazyDFA (at least in the common case) don't need much overhead for initial compilation. The dfa::dense approach initially tried adds pre-compilation into MatchBuilder::new(), which slows down that function. The reason why this was so bad for Cascade is that Cascade (probably wrongly) assumes that FooParser::new() is cheap, and calls it in a loop, which caused a bunch of recompilation. Arguably, that's a Cascade performance bug, rather than anything about this implementation. But in general, the dfa::dense approach should make next() faster at the expense of new() by precompiling. Implementations with a low next() to new() ratio (eg small input) would see degradation in this case. I'm still confused by this statement though:
As far as I can tell, both the regex implementation and the hybrid automata approach will typically use LazyDFA under the hood, doing minimal setup in new(), at a tradeoff of a longer next(). There's a number of "compile times" in parser generator land, so I'm not sure if you mean "regex compile time", or "parser generation time" (ie build.rs), or "user compile time" (ie compiling a generated parser). Regardless of which you mean, I'm missing how the latest implementation changes that. If you could help me understand, that would be super helpful. The benchmarking does seem to show an improvement, and the proof is in the pudding in that sense, but I would feel more comfortable that we're not missing some horrendous corner cases if I felt better about understanding what was wrong in the initial implementation and how this differs. Given the above, I'd be curious to fix the Cascade performance bug of calling new() in a loop and rerun that benchmark against all three cases. I did start looking at more extensive automated benchmarking last week and got a decent start on the harness, but still have a ways to go. I'll see about making more progress there ASAP. |
In the old implementation (before this PR), every token is first matched by
True. But my
I mean "user compile time". And this PR doesn't change that. So I used the word "also". |
I just gave a very quick look to the code, and tried to grok what's going on at a high-level, but FWIW I think this whole PR and the convergence toward hybrid approach make a lot of sense. It sounds like the hybrid approach should not have corner cases which are obvious pessimization (such as Modulo code reviewing and technical considerations, my half-educated vote would be to accept the approach and the changes proposed here. |
I reran my Cascade benchmark, with fixing the performance bug I mentioned above, and compared this branch with just the skip rule commit to the latest commit on the branch, and it's a solid performance improvement in that case: Full system compile time: [4.2422 ms 4.3042 ms 4.3713 ms]
Ah, yes I see this now, and this makes a ton of sense. Thanks for the explanation.
Ah, I see now what you mean. I had been interpreting "also" to mean "in addition to the other problems with the original implementation" rather than "like the most recent version".
I agree. At this point, we have three benchmarks on this change (parse-rosetta, Cascade, twitter-json). It bothers me slightly that two of those are json parsing, so we may not be super representative of diverse performance cases, but I don't think that's a good reason to hold this up any longer. The approach makes sense, and should be an across the board improvement.
This is to some extent a fair point, although I do think that if we published a new release that tanked the performance of the built in lexer in a some use case, then "you can write a custom lexer" isn't a very user-friendly response. The bigger thing for me is that I don't think this PR is likely to be tanking performance in weird corner cases. If it somehow does, we can treat it as a performance bug and investigate and fix it then.
I did a pretty thorough technical review yesterday. I'm no regex-automata expert, but I dug through the docs and the code here makes sense. I'll go ahead and mark this as approved from me, and resolve the merge conflicts. If I don't hear objections from another maintainer I'll plan on merging it sometime over the weekend. |
I've resolved the conflicts via github's merge interface, so there's a merge commit from me on the HEAD of quarticcat/master now. The only issues in merge were the is-terminal PR updated the rust version past where this did, and the skip rule merged independently, but then the remaining commits modified it to remove the anchors. |
Sounds good to me! Hats off @QuarticCat for the high-quality contribution and @dburgener for having put quite a lot of time and energy to review and make sure we understand very precisely the tradeoffs involved 🙂 |
Merged! Thanks @QuarticCat for the contribution! |
regex_syntax::parse() converts our regex strings into the HIR of the regex, part of that includes unpacking various metacharacters into a list of symbols. In many cases, this expansion changes depending if it should expand into unicode or not. Prior to lalrpop#814, we were still outputting unicode regexes unconditionally, but regex internals seem to be compiling them away and avoiding errors. The switch in lalrpop#814 caused these to be result in real errors. Follow-up work will be needed to determine why existing tests didn't detect this.
regex_syntax::parse() converts our regex strings into the HIR of the regex, part of that includes unpacking various metacharacters into a list of symbols. In many cases, this expansion changes depending if it should expand into unicode or not. Prior to lalrpop#814, we were still outputting unicode regexes unconditionally, but regex internals seem to be compiling them away and avoiding errors. The switch in lalrpop#814 caused these to be result in real errors. Follow-up work will be needed to determine why existing tests didn't detect this.
regex_syntax::parse() converts our regex strings into the HIR of the regex, part of that includes unpacking various metacharacters into a list of symbols. In many cases, this expansion changes depending if it should expand into unicode or not. Prior to #814, we were still outputting unicode regexes unconditionally, but regex internals seem to be compiling them away and avoiding errors. The switch in #814 caused these to be result in real errors. Follow-up work will be needed to determine why existing tests didn't detect this.
I'm a developer of paguroidea. After we benchmarked several parsers, we noticed that lalrpop was abnormally slow. So I decided to dig up the reason. And it turned out to be the problem of the lexer. This PR aims to improve the lexing performance, making it a bit more decent.
First commit
skip
rule from\s*
to\s+
. If the match is empty, why do we bother to skip it? This change boosted throughput from 4.5 MBps to 5.5 MBps in our JSON benchmark.skip
rule when "unicode" feature is disabled. Purely an ergonomic improvement.Second commit
regex-automata
, which gives us more control to achieve better performance. This change boosted throughput from 5.5 MBps to 138 MBps in our JSON benchmark.The benchmark code can be found here https://github.com/SchrodingerZhu/paguroidea/tree/main/benches/json. I tested it under the
qc-temp
branch. Here are some of the final results:EDIT: In our benchmark, we didn't turn off the unicode feature, although it's not required. This is because, due to lalrpop limitations, if we disable unicode feature, we have to rewrite the skip rule. And if we rewrite the skip rule, we have to duplicate all other lexer rules. The first commit can solve this problem.