Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenStream manipulations are 1000x too slow #65080

Closed
dtolnay opened this issue Oct 4, 2019 · 15 comments
Closed

TokenStream manipulations are 1000x too slow #65080

dtolnay opened this issue Oct 4, 2019 · 15 comments
Labels
A-macros Area: All kinds of macros (custom derive, macro_rules!, proc macros, ..) C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@dtolnay
Copy link
Member

dtolnay commented Oct 4, 2019

Context: illicitonion/num_enum#14
Switching a proc macro from being token-based to operating on strings with just a final conversion from string to TokenStream can be a 100x improvement in compile time. If we care that people continue to write macros using tokens, the performance needs to be better.

I minimized the slow part of the num_enum macro to this benchmark:
https://github.com/alexcrichton/proc-macro2/tree/12bac84dd8d090d2987a57b747c7ae7bbeb8a3d0/benches/bench-libproc-macro
On my machine the string implementation takes 8ms and the token implementation takes 25721ms.

I know that there is a proc macro server that these calls end up talking to, but I wouldn't expect this huge of a factor from that. If the server calls are the only thing making this slow, is there maybe a way we could buffer operations in memory to defer and batch the server work?

I will file issues in proc-macro2 and quote as well to see if anything can be improved on their end.

FYI @eddyb @petrochenkov @alexcrichton

@dtolnay dtolnay added I-compiletime Issue: Problems and improvements with respect to compile times. A-macros Area: All kinds of macros (custom derive, macro_rules!, proc macros, ..) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 4, 2019
@Mark-Simulacrum
Copy link
Member

This is (AFAICT) from a brief analysis because the code appears to be allocating a fresh vector for the entire tokenstream each time we append one token. I believe that is because our Extend<TokenTree> for TokenStream implementation maps the token tree to a token stream with two elements: tree, and NonJoint, presumably called from this From<TokenTree> impl.
Ultimately, the root cause boils down to this piece of code:

#[stable(feature = "token_stream_extend", since = "1.30.0")]
impl Extend<TokenTree> for TokenStream {
fn extend<I: IntoIterator<Item = TokenTree>>(&mut self, trees: I) {
self.extend(trees.into_iter().map(TokenStream::from));
}
}
#[stable(feature = "token_stream_extend", since = "1.30.0")]
impl Extend<TokenStream> for TokenStream {
fn extend<I: IntoIterator<Item = TokenStream>>(&mut self, streams: I) {
// FIXME(eddyb) Use an optimized implementation if/when possible.
*self = iter::once(mem::replace(self, Self::new())).chain(streams).collect();
}
}
.

That has a FIXME from eddyb that this is super inefficient, and rightfully so. We are essentially always creating a TokenStreamBuilder, pushing on ~two streams, one with lots of elements and the other with two (token from original extend, as well as non joint). Having done that, we call TokenStreamBuilder::build, which calls TokenStream::from_streams with two streams, which allocates the vector for the contents of both streams, and we have a new tokenstream.

I believe this means that in order to append N tokens we will create N vectors of 1, 2, 3, 4, .. N elements (or so, possibly modulo a constant factor of some kind). If N is large, this is really slow.

cc @pnkfelix as well, as they did some investigating into a similar issue in #57735 (comment).


I am not sure what we can do about this -- ideally, we'd not be re-creating new tokenstreams on every appended token, but TokenStream seems to have a structure that isn't really amenable to us appending tokens into it (despite the name). Maybe we can have proc macro keep a TokenStreamBuilder instead of a TokenStream around, which would let us append a bunch of tiny tokenstreams together before collating them all, but that also seems non-ideal...

@Mark-Simulacrum
Copy link
Member

cc @nnethercote as well

@eddyb
Copy link
Member

eddyb commented Oct 4, 2019

I remember the FIXME but I also remember this not being that big of a problem for some reason. I guess if the first impl wasn't ending up using the second impl, it would be faster?

TokenStreamBuilder could totally take individual TokenTrees as well, AFAIK, and you could at least keep one builder for the entire extend (but if it's called with iter::once(tt) that would require even more optimization work).

Maybe the main problem here is that there is no API for creating a builder out of a TokenStream in CoW fashion?

Sadly the original vision of using ropes falls apart when you can't even slice a TokenStream.

@eddyb
Copy link
Member

eddyb commented Oct 4, 2019

Maybe we can have proc macro keep a TokenStreamBuilder instead of a TokenStream around, which would let us append a bunch of tiny tokenstreams together before collating them all, but that also seems non-ideal.

One nice way of doing this (if all else fails) is representing TokenStream on the server side (for proc macros not rustc in general) as TokenStream | TokenStreamBuilder and finishing the builder when any of the TokenStream methods are used which we can't easily implement on the builder itself.

@jonas-schievink jonas-schievink added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Oct 4, 2019
@nnethercote
Copy link
Contributor

@matklad: this might affect your approach in #64782.

@nnethercote
Copy link
Contributor

our Extend<TokenTree> for TokenStream implementation maps the token tree to a token stream with two elements: tree, and NonJoint

I think you misread that code. It's a token stream with a single element; that element is a 2-tuple.

we call TokenStreamBuilder::build, which calls TokenStream::from_streams with two streams, which allocates the vector for the contents of both streams, and we have a new tokenstream.

I have confirmed (via profiling and println! insertion) that this part is true. In the relevant part of execution, the second stream passed to TokenStream::from_streams always has a single element, while the first stream is one larger than the previous call.

@nnethercote
Copy link
Contributor

I tried changing TokenStream from this:

pub struct TokenStream(Option<Lrc<Vec<TreeAndJoint>>>);

to this

pub struct TokenStream(Vec<TreeAndJoint>>);

This drastically sped up the microbenchmark under discussion, because new elements can be trivially appended to an existing TokenStream, rather than having to duplicate before appending.

However, it was a big slowdown for benchmarks. Most of them slowed down 2-20%, but script-servo and style-servo were both almost 5x worse. This is because the second representation is much more expensive to clone, something which happens frequently and from many places.

@alexcrichton
Copy link
Member

@nnethercote I'm not entirely sure where this bottoms out so this may be a distracting comment, but would Rc::make_mut and such make those two representations "effectively equivalent" for the purposes of the benchmark here? That should still give us cheap clones and pushing a bunch of stuff onto one clone should still be quite cheap since it doesn't have to reallocate.

@nnethercote
Copy link
Contributor

@alexcrichton: Indeed, I had the same thought late last night! I have used make_mut successfully in TokenStream::from_streams to fix half of the quadratic behaviour. I am now working at using it in TokenStreamBuilder::push.

@nnethercote
Copy link
Contributor

nnethercote commented Oct 8, 2019

#65198 fixes this so that the TokenStream code is roughly 10x slower than the string code, and the slowdown is consistent for different values of N (i.e. no more quadratic behaviour). There is room for additional improvement, but this is a good start.

It would make sense to add a benchmark to rustc-perf for this kind of code. The given microbenchmark is a possibility, but I'd prefer to use real-world code that demonstrated the problem, if possible.

bors added a commit that referenced this issue Oct 8, 2019
Speed up `TokenStream` concatenation

This PR fixes the quadratic behaviour identified in #65080.

r? @Mark-Simulacrum
@Mark-Simulacrum
Copy link
Member

I'd personally prefer the microbenchmark as it's less likely to stop building in the future in some sense :)

But that PR looks excellent. Do we know if the remaining 10x cost is for some particular reason, or are strings just that much faster than the token code? (Lrc, etc.)

@nnethercote
Copy link
Contributor

The 10x is because there's a lot of faffing around: a couple of once iterators, a chain iterator, a conversion from a TokenTree to a TokenStream, creating a TokenStreamBuilder, pushing onto it, Lrc, etc.

bors added a commit that referenced this issue Oct 9, 2019
Speed up `TokenStream` concatenation

This PR fixes the quadratic behaviour identified in #65080.

r? @Mark-Simulacrum
@eddyb
Copy link
Member

eddyb commented Oct 9, 2019

Oh I thought the implementation must be using Lrc<[TreeAndJoint]>, I'm really glad to be wrong and that this is getting fixed 🎉

There's likely no benefit in representing a proc_macro::TokenStream as a builder server-side but we could probably add a way to extend a proc_macro::TokenStream with a proc_macro::TokenTree.

@nnethercote
Copy link
Contributor

@dtolnay: is the improved performance good enough for your purposes?

@dtolnay
Copy link
Member Author

dtolnay commented Oct 11, 2019

I confirmed that the performance is massively improved in nightly-2019-10-10. Thanks!

FYI @illicitonion @anlumo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-macros Area: All kinds of macros (custom derive, macro_rules!, proc macros, ..) C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants