-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tiny Vecs are dumb. #72227
Tiny Vecs are dumb. #72227
Conversation
Some local instruction count results, for
@bors try @rust-timer queue |
Awaiting bors try build completion |
⌛ Trying commit dac8c474a4a0cc81ba4936bf0d55841c79e1fb0d with merge 291f8c65f12a6ed4401e0c6cb477f3429c32b9ac... |
I wonder if we can do the same thing for |
In Python even small associative arrays are used often, in Rust I think they are quite less common (for various reasons, one reason is the lack of handy hashmap literal syntax). |
☀️ Try build successful - checks-azure |
Queued 291f8c65f12a6ed4401e0c6cb477f3429c32b9ac with parent 0271499, future comparison URL. |
I thought the same. It turns out that used to be the case until #50739. We will definitely need a regression test for this. |
Finished benchmarking try commit 291f8c65f12a6ed4401e0c6cb477f3429c32b9ac, comparison URL. |
max-rss shows a regressions in the results, but I'm not sure how reliable that indicator is or how much we care about it. |
I wonder how much more speedup we can get by skipping straight to 8 elements, but I am hesitant since we don't seem to have a good way of tracking memory usage. |
I'm not sure how representative the rustc results really are, since |
@Amanieu: The bad news is that max-rss is highly unreliable, alas. E.g. look at this noise run, which compares two revisions with no signficant differences. The good news is that I can get accurate peak heap memory measurements with DHAT. I can do some measurements with skip-to-4 and skip-to-8 on Monday. |
It's true that rustc uses
For vecs with four or more elements, the number of allocations in the |
Having said that, I can see that rustc's heavy use of But in general I'm not worried about memory usage. |
Fair point, but there are also factors pulling in the opposite direction. For Vecs with 1 element (also a common case in some parts of rustc), you don't save any reallocations. Even with 2 elements, you "only" save one reallocation instead of two, so depending on the distribution of Vec sizes in a program you'll see a proportionally smaller speed-up (imagine an application spending about as much time on reallocating Vecs as rustc does today, but more of it on tiny Vecs). And of course, the impact on memory usage is not entirely separate from performance (e.g. in a typical size-class-based allocator, Vecs with 1-2 small elements may waste half of each cache line that could hold other useful data). None of this is to say I think this is a bad change overall, it's just not quite so obvious to me how the effects on rustc will translate to other code bases and I really wish we had a good benchmark suite for runtime (not just compile time) of Rust applications. |
Would it be worth making the size increase dependent on the size of the unserlying object? In particular, if it is very small (1 or 2 bytes), you probably lose nothing at all starting at an 8 or 16 byte allocation (depending on memory manager) |
@ChrisJefferson: that's a good idea. And if the elements are really big (256B? 1024B?) we could ratchet down the minimum capacity. I will do some measurements. |
dac8c47
to
37dd3b6
Compare
Here are the results for four different minimum non-zero sizes: 1 (original), 2, 4, and 8.
Tiny2 is a clear improvement. Tiny4 is roughly 2x better than Tiny4 on all except
Tiny2 gives a tiny increase, Tiny4 a little bigger, and Tiny8 much more. Things to note:
Tiny4 looks like the sweet spot, giving near-maximal speed improvements for a modest peak memory cost. I have uploaded code containing Tiny4 with slight modifications:
I measured that too and it gave results incredibly similar to Tiny4, but it could help with some cases that might show up sometimes. |
r? @Amanieu |
37dd3b6
to
3cbc23e
Compare
Currently, if you repeatedly push to an empty vector, the capacity growth sequence is 0, 1, 2, 4, 8, 16, etc. This commit changes the relevant code (the "amortized" growth strategy) to skip 1 and 2 in most cases, instead using 0, 4, 8, 16, etc. (You can still get a capacity of 1 or 2 using the "exact" growth strategy, e.g. via `reserve_exact()`.) This idea (along with the phrase "tiny Vecs are dumb") comes from the "doubling" growth strategy that was removed from `RawVec` in rust-lang#72013. That strategy was barely ever used -- only when a `VecDeque` was grown, oddly enough -- which is why it was removed in rust-lang#72013. (Fun fact: until just a few days ago, I thought the "doubling" strategy was used for repeated push case. In other words, this commit makes `Vec`s behave the way I always thought they behaved.) This change reduces the number of allocations done by rustc itself by 10% or more. It speeds up rustc, and will also speed up any other Rust program that uses `Vec`s a lot.
3cbc23e
to
f4b9dc3
Compare
With regards the explicit check for size 1, it looks to me like when you allocate memory in Rust you get back the "actually allocated" amount of space -- is that right? In that case, assuming that the malloc never returns less than 8 bytes of space there is no need for an explicit check. However, there seems to be various different memroy interfaces with different properties in this area. (I was trying to do this myself, but got stuck trying to add debugging output inside liballoc, sorry) |
That's a good question! The short answer is "no", and for the purposes of this PR, it means that choosing a capacity of 8 for single-byte elements is reasonable, and so I don't need to make any changes to this PR's code. The long answer is murkier.
Next, the implementation.
In conclusion: There is also It's a shame that there doesn't seem to be a way to accurately get the actual size of an allocation with @Amanieu: have I got all that right? Any additional thoughts? |
AFAIK this is mostly historic -- After all, it took a while to devise |
…owup, r=Amanieu Adjust the zero check in `RawVec::grow`. This was supposed to land as part of rust-lang#72227. (I wish `git push` would abort when you have uncommited changes.) r? @Amanieu
…owup, r=Amanieu Adjust the zero check in `RawVec::grow`. This was supposed to land as part of rust-lang#72227. (I wish `git push` would abort when you have uncommited changes.) r? @Amanieu
I'd like to chime in and mention that, this does indeed increase memory usage in practice, not theoretically. In the past, Rust has exactly this behavior, and when we worked on integrating Servo's CSS engine into Gecko (the Stylo project), we found this to be one important reason (among several others) contributing to a much larger (actually double the) memory consumption on Stylo than Gecko with Gmail. See Gecko bug 1392314. The reason is that, Gmail had lots of In my experience, it is actually very common at least in browser development that you have a Vec which in majority of time holds zero or one item, and in very rare case it goes to two, and only in some extreme cases can it go even further. And in that kind of scenario, probably it is tolerable that item size is large as you are unlikely to reallocate a lot. If WDYT? |
This patch should help with that problem, as it only over-allocates if the objects allocated are not too big. I do wonder if the point where objects are considered "large" (1024 bytes) could be too large? Something much smaller (32 or even 16 bytes) might reduce memory wastage? |
That's sounds like perfect scenario for SmallVec/TinyVec TBH. Although https://doc.rust-lang.org/std/vec/struct.Vec.html#capacity-and-reallocation could be slightly expanded for really small vectors. |
Good point. I didn't notice that.
Based on the analysis in the related bug, each item was like 192 bytes, and having to allocate 4 making it 768 bytes aligned by the allocator to 1k, and there are ~12k such With no overallocating, 192 bytes would take exactly 192 bytes (because of the allocator strategy), costing only ~2MB.
Not necessarily. SmallVec/TinyVec in that case would need extra space on the stack or other struct, which can lead to extra memory copy and further bloated structs which holding it. This is especially a problem when the most common case is zero item. |
I will repeat my comment from above: "Vec makes no promises about capacities. If you have a program where the capacity of 1 and 2 length Vec's has a critical impact on memory usage (e.g. due to having many short vectors and/or short vectors with very large elements) then you should use a collection type that provides clear guarantees about capacity." |
Note that this change doesn't just affect dumb E.g. this new behavior could be limited to cases where |
Facebook did some research for their FBVector and picked 1.5 growth factor. They explicitly moved away from 2x growth because of cache unfriendliness. I know next to nothing about memory allocation in rust, but thought it worth mentioning in case it is applicable here. |
The argument for 1.5x is about memory reuse, rather than cache friendliness. It's a silly argument, IMO. That document says this:
The text I have emphasised is false. Modern allocators typically have size classes, which means that allocations of different sizes (e.g. 128 bytes vs 192 bytes) have no chance of being put next to each other. jemalloc, which Facebook uses, is such an allocator. Indeed, the next section of that document then goes on to talk about jemalloc's size classes, without realizing that they invalidate the reasoning in the previous section. |
But it seems like with a growth factor of 1.5, we get a guarantee that at least 66% of the allocated memory is actually used (assuming only pushes), whereas a growth factor of 2 makes that 50%? That could waste a lot of memory. |
Sure, there's a trade-off between frequency of reallocation and potential for unused memory no matter what growth factor you use. It's the "2x is the theoretical worst possible number" argument that I object to. It's false, and it also overlooks the benefits of 2x: 2x is simple, it is the cheapest possible multiplication, and powers of two are most likely to result in allocation sizes that match allocator size classes. |
Isn't it cache friendly to reuse memory?
I guess it is easy to measure if it has any visible effect on performance and/or memory usage. |
While we can tune this, until the memory allocator can accurately tell us the "true" amount of allocated memory, numbers like 1.5x are also much more likely to lead to entirely wasted memory. On most systems if you allocate (say) 600 bytes, you are probably getting 1024 anyway, and once you get up to page size you get whole pages. (NOTE: That is my belief from reading one memory manager long ago. I haven't actually read carefully the different mallocs which rust uses, which is I suppose proving my point, we don't know what we "actually" get). |
Yes. Here are the size classes from an older version of jemalloc.
The spacings increase more smoothly in recent versions (I can't find an exact list right now) but the general idea still holds. |
You can find the latest size classes from jemalloc's man page. |
Jemalloc is not the default allocator for Rust though. Do you know how the platform allocators that we use by default behave? Seems to make most sense to measure with the default setup. |
@RalfJung it is. At least when building Rust for Linux on it's CI:
Manual builds use system allocator. |
@mati865 Most users of |
But perf.rlo can only measure Rust performance. So all "official" measurements are done with jemalloc. |
That is correct. So what? The question was if 2x or 1.5x (or something else) is the better growth factor for |
There is a dedicated issue for the growth strategy of |
Layer::grow relies on reserving exactly as many bytes as specified in the argument. And it apparently has worked as long as the argument was a power of 2, which it was. This has changed for small vectors since: rust-lang/rust#72227 The fix is to use `reserve_exact` instead.
This micro-optimization makes Vecs harder to reason about, in my opinion. IIUC, given elements of size 1024 bytes, these Vecs will start off life in the memory allocator pools meant for 4*1024 = 4096 bytes. That is a big change from starting off in the pools for 1024 byte objects. Instead, what about calculating the cutoff such that the preallocation fits in to a cache line? Or just recommend that users call cc @nnethercote |
Do not inline finish_grow Fixes rust-lang#78471. Looking at libgkrust.a in Firefox, the sizes for the `gkrust.*.o` file is: - 18584816 (text) 582418 (data) with unmodified master - 17937659 (text) 582554 (data) with rust-lang#72227 reverted - 17968228 (text) 582858 (data) with `#[inline(never)]` on `grow_amortized` and `grow_exact`, but that has some performance consequences - 17927760 (text) 582322 (data) with this change So in terms of size, at least in the case of Firefox, this patch more than undoes the regression. I don't think it should affect performance, but we'll see.
Currently, if you repeatedly push to an empty vector, the capacity
growth sequence is 0, 1, 2, 4, 8, 16, etc. This commit changes the
relevant code (the "amortized" growth strategy) to skip 1 and 2, instead
using 0, 4, 8, 16, etc. (You can still get a capacity of 1 or 2 using
the "exact" growth strategy, e.g. via
reserve_exact()
.)This idea (along with the phrase "tiny Vecs are dumb") comes from the
"doubling" growth strategy that was removed from
RawVec
in #72013.That strategy was barely ever used -- only when a
VecDeque
was grown,oddly enough -- which is why it was removed in #72013.
(Fun fact: until just a few days ago, I thought the "doubling" strategy
was used for repeated push case. In other words, this commit makes
Vec
s behave the way I always thought they behaved.)This change reduces the number of allocations done by rustc itself by
10% or more. It speeds up rustc, and will also speed up any other Rust
program that uses
Vec
s a lot.In theory, the change could increase memory usage, but in practice it
doesn't. It would be an unusual program where very small
Vec
s having acapacity of 4 rather than 1 or 2 would make a difference. You'd need a
lot of very small
Vec
s, and/or some very smallVec
s with verylarge elements.
r? @Amanieu