-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String::as_mut_vec
prevents small string optimization
#20198
Comments
I do not see |
While this approach is possible, unlike SSO for strings it's not adopted by popular implementations of generic vector - folly::fbvector, boost::vector and not allowed by std::vector. There must be reasons, probably of statistical nature (I can start speculating on them but would prefer not to). |
The main reference I can provide for something like this on vectors is LLVM, but in that case the number of elements is parametrised (the user is free to choose it and the compiler is able to optimise it... AFAICT this would not be possible in rust). I believe you're right about the reason why SSO is more common than similar optimisations on other data structures. Also, the absence of a variable-sized type parameter might contribute in making it easier to find a reasonable default "local cache size" for strings. |
In the worst case, this can still be provided by upgrading an SSO to a |
@petrochenkov: It wouldn't make sense to provide a small string type without providing a small vector type since it needs to be written anyway. The small string / vector optimization increases the code size and adds extra branches / operations. Rust is always going to provide this string type based on It's important to note that |
The libc++ |
If you are really going to do that, then
This is what I would expect from "small optimized" vector too. |
Some hard data: as_mut_vec is not used anywhere in Rust and it is used four times in Rust outside of the String implementation itself. I suggest to make as_mut_vec private. You know: better safe than sorry. |
If there was a small string there would be an underlying small vector type to go along with it. Either way, there is going to be a string type to go along with |
I don't understand why the string type needs to have a vector type as underlying buffer. Strings and Vecs are pretty similar in how they work but completely different in what they are used for. Hence we need to treat String as another type with it's own challenges IMO. |
The design of strings is that they're a thin wrapper around the corresponding slice/vector type. A small string type would be a wrapper around a small vector type. It's a design trade-off so there would be a choice between the types as there always is. Unless someone can explain why an asymmetric string / vector design makes any sense... ? |
@LukasKalbertodt: Strings in Rust are slice/vector wrappers providing a UTF-8 encoding guarantee. There is always going to be a string type implemented as a wrapper around The preferred default string type is a separate issue from whether a string wrapper around vectors exists. Note that there is zero language support for either |
I cannot say if it makes any sense but I do not any sense if it is other way round. Firstly Sting is a sequence of "characters" in UTF-8 endcoding so I do see any direct similarity to Vec. I do not want to argue about implementation of String using Vec or whatsoever. The issue is just about future-proofing the design. At some time it can happen that it is better to have independent String and Vec but this little as_mut_vec methods just makes it harder (e.g. maybe SSO can be done better in strings that on generic vector). And the sad thing about it is that nobody cares about that method. |
It would also make sense to have a string implemented as a rope type, and implementing that is going to result in an underlying general purpose rope-based vector type. It's just the logical way to do things... there's no good reason to make the implementation unusable for other purposes than storing UTF-8. |
There is one UTF-8 string type that's a wrapper around
There is always going to be a UTF-8 string type that's a wrapper around |
If someone wants to write an RFC addressing the various points, then they're free to do so. The issue tracker is for actionable, concrete issues in the language / libraries rather than very subjective changes that are open to debate. I certainly think SSO is important and I'm not arguing against providing it... I am just explaining that the string type exists solely to provide a UTF-8 guarantee. It would not exist if Rust used (native endian) UTF-32 since it could just use |
I understand that the string type should just be a UTF-8 Wrapper with various underlying data structures. I am just worried that We should limit the String type so that we don't make assumptions about the internal implementation. But you are right, this issue page is probably the wrong place to discuss that. So we should continue the discussion here: http://discuss.rust-lang.org/t/small-string-optimization-remove-as-mut-vec/1320 |
A string wrapper around The |
I don't see why "I want option X" needs to become "cripple option Y". There is nothing preventing small string / vector types from having the same level of support as the ones without that optimization trade-off. It's certainly not a clear win because it adds a branch to many operations - it is, as these things tend to be, a compromise that makes sense in some cases and not others. |
Exactly which points are you talking about?
I disagree: The name is the problem. A type in the standard library that is called
I agree. What about that?
TL;DR: Create any string implementation with any special implementation dependent operations, but don't call it |
Sure, I agree that I would support an RFC proposing a rename but I don't expect that it would succeed. |
I don't think we need a type that's an "abstract" string. Individual projects are already free to use type aliases like that or simply mass-change from one string type to another. The hard part is not converting from one type to another but rather figuring out the cases where it makes sense to pay the cost of using small string optimization. LLVM's usage of small strings / vectors is a great example of this: all of the effort is spent on profiling / statistics, not changing code. |
Who "intended" that? Shouldn't the design of Maybe I will create a pull request that adds another SmallString class to |
@thestinger |
The ability to convert between the string and underlying byte vector type without copies is very important. Again, the |
@comex: Also, there's no need to rename everything. Rust has namespaces, so you can just change the imports across the project without touching any of the code. |
(You can import |
Since there would be no room for the argument if it was called |
@thestinger Can you quantify how important is it to have cheap conversion from How we can know that cheap conversion is bigger performance win than SSO? |
The thing is that the purpose of the |
Right. This basically describes my thought process here too, a question I answered in IRC about it, which kicked off this new discussion. After reading said discussion, I'm in agreement with your position here. But everyone has to come along on that journey, you know? Not everyone will agree all the time on a particular technical tradeoff. Anyway, this isn't even just about you here. The "Apparently some Rust developer are not interested in a future proof interface..." isn't helpful either. I'd just prefer that we stick to the technical side of things without getting all nasty. It really squashes discussion. |
A summary of the facts from the discussion:
@steveklabnik: I don't think there's much of a technical trade-off here. Unicode ruins mutable strings beyond the string builder use case, which is what |
Oh and |
@pepp-cz: It used to be called |
You are right and I am sorry, but I was a little bit shocked that @thestinger whose profile gave me the impression that he is a experienced Rust developer (professional) called my argumentation bullshit. Let's forget about that...
I agree with 1-3. I don't understand the last one though. Sure: It makes allocations on clones unnecessary but we still have to allocate once to create the string. And like Andrei Alexandrescu said: "No work is less work than some work". I think there are enough situations in which you want SSO for immutable strings. What about a huge vector of forum usernames (mostly < 23 bytes)? I know... it's a strange example but still. It's important to remember that most users of a language are no experts in either the language nor the theory of UTF8 strings (string theory if you like :P). And (IMO) most users that use the default String type don't care about |
It's not strange, a lot of strings are normal human words or word combinations. |
It was. You started hand waving about some future optimization that no one can think of today, which could be used to justify removing anything if it was considered a valid argument.
No it doesn't. Again, it's not the only method derived from the underlying implementation.
It's pretty clear that it slows down the cases that string builders (
Removing these methods hurts performance by forcing copies. You haven't shown that we get anything in return for hurting performance by crippling the API. The |
@thestinger Unfortunately |
It works fine, but those are not great types to use for immutable data either.
Lack of knowledge and missing features in the standard libraries doesn't make these appropriate types for this. |
So, the title of this issue might as well be "Mixed intents on usage of |
@thestinger When StrBuf was renamed to String, I remember someone arguing that there was not much point carefully distinguishing immutable and mutable strings, because the capacity field was not that much overhead; and now that it is renamed, it is quite hopeless to expect most people to use something else for general strings (which are usually not mutated - c.f. the decision of most high level languages to make strings immutable by default). Also, if
Therefore (to say this as neutrally as possible), the way the current 1.0 API is designed, any future optimizations that help immutable strings but hurt mutable strings are likely to speed up most programs in practice. (I would argue that immutable use not being 'appropriate' per implementation design or whatever is pretty irrelevant if people will be strongly encouraged to use them as such anyway.) |
@thestinger And I'm not buying the argument that a string builder would not benefit from SSO in all usage scenarios, either. The performance analysis that brought up the whole thing into being seems to indicate that there is any number of places where strings are built out of frequently small sources. If you use an insert-optimized, always allocating builder in these cases, you have already lost the main benefit of SSO. It appears that all major C++ standard library implementations nowadays use SSO for the bog standard string type, |
I don't see how renaming it is connected to a not yet existent immutable string type.
I seriously doubt that, and you've shown no evidence of it.
It does work fine, just as
Uh,
So then fix the real problem, the lack of proper immutable string support. It seems you have a problem with the name so you want to go out of your way to ruin the type. |
Please tell us more about your measurements then. We need a little bit more information and data to understand this measurement. It should be pretty clear that this measurement is highly dependent on your test data. If you are just pushing 1000 byte strings, it's obvious that SSO would be a bad idea. So yeah: It would be interesting if you'd provide more information. Also: I guess measuring the performance for small and larger strings is one thing... But knowing what kind of strings are mostly used in "the real world"... is another important thing. |
It is connected because everyone is going to use a type named
You have done microbenchmarks, which are useful, but not the macrobenchmarks which would be needed to resolve the question; and of course the results of such benchmarks would depend on the precise choices of representation for and amount of microoptimization done on the type in question, which would take quite a bit of time. Thankfully, I am in support merely of future proofing the design (a future decision in any case could take advantage of a number of large real-world Rust applications hopefully greater than two).
I humbly suggest that, as names and API designs stand, the best approach might be to view it the other way around: that if, in the future, SSO or other optimizations to edit: also, if you want absolute highest performance from a type, building UTF-8 validation into it by default sounds like a bad idea to me... YMMV |
It slows down mutation and other operations like slicing by adding branches. It doesn't matter if you buy it or not, that's how things are. Strings as large as 3 pointers are not uncommon so it fills the code with branches that are prone to being mispredicted, which is awful.
In every case, branches would be added. In every case, conversions to and from
That's what
SSO is not more balanced. It bloats the code and hurts performance in most cases. Here's the thing: I have done extensive measuring / profiling / statistics in regards to this and gave up on my original plan to implement SSO for the good old
For the nth time, In C++, the copy constructor calls are also implicit. It doesn't offer an efficient
Slowing down every operation via branches to speed up construction doesn't seem like a good trade-off for most applications. Cloning strings is essentially free because most strings are immutable and cloning will just be a reference count. Poorly written applications aren't the case that's optimized for by Rust. |
There doesn't actually have to be a branch. Conditional move should be doable... and there are other tricks. Which perhaps you have already tried, per your extensive measuring / profiling - but these decisions are basically wagers. To hear vague references to profiling on an old version of Rust increases my personal believed probability that SSO is a bad idea, but not enough that it would be worth making non-future-proof decisions at such an early stage, when the alternative is to temporarily sacrifice one method that is basically unnecessary in most cases, since you can still convert to (You say that other methods implicitly depend on there being a This issue is not a decision about whether or not to use SSO. |
The string slice type is
Yes, I have done macro-benchmarks. I have measured the impact on both the Rust compiler and Servo of various allocator optimizations / tradeoffs and various string / vector implementations. Crippling the type by forcing allocation/copying in cases that are current free of allocation/copying is not "future proofing", it would simply be misguided and poor API design. As I've already pointed out several times, migrating in between different vector / string types is one of the most trivial sweeping changes you can make. The LLVM project does it regularly, because they have widespread usage of small strings and vectors and need to carefully identify when it makes sense to use them. It causes significant code bloat and hurts the performance of the non-copying / non-creation methods in essentially every case. The vast majority of code is relatively cold code, and it should be optimized for size rather than raw performance. With this in mind, it would make absolutely no sense to pervasively use the small vector optimization.
The small vector optimization hurts the performance of every operation other than construction and cloning in every case. It hurts slicing performance, it hurts fetching the length, it hurts bounds checking performance and so on. It adds branches to these operations, and they are not trivially predictable ones. It also significantly bloats the code size. You're the one proposing that the status quo should be changed by crippling the type, so the burden of proof is on you to demonstrate that it's a good idea. I guess I need to restate for what seems like the 10th time that the |
(Not to imply that thestinger should have read what I wrote a few seconds ago, but just to keep the conversation in order, I asked in the last post for clarification on what these other methods from the last paragraph are.) |
A conditional move isn't enough without a whole bunch of extra bloat / hacks (at least for most of the operations) and isn't even an option everywhere. The more hacks like small vector optimization that are used, the more the code is bloated - hurting caching and branch prediction. It's better to have smaller code in nearly every case because few cases are directly in the hot inner loops. The sacrifice is more than one method. It introduces asymmetry into what should be a simple system where There isn't a performance argument for doing it. Small strings really only belong in carefully chosen hot code because they're just plain bloated. Storing 11 or 23 bytes inline isn't enough to justify it at all. The cases where they're great almost always want to add a bunch of padding to store far more inline, considering that they're paying the price of branches and code bloat for it. This is how it's done in LLVM, where they're fairly widely used, but aren't the majority of vectors / strings.
It's not a decision about anything because it's RFC material, not issue tracker material. The discussion here is a waste of time beyond developing points to copy-paste into an RFC discussion. |
It's very clean and useful to have UTF-8 wrappers around general purpose vector types. There simply isn't a valid reason to provide a string without exposing the underlying vector type. It would be a far inferior API and would bring Rust a lot closer to the mess of C++ collections ( |
Ultimately, it would be nice if the implementation of the string type could be extracted into a reusable facade to wrap around anything able to expose the random-access / contiguous vector API. Turning it into more than just a vector wrapper is stepping away from that kind of clean, reusable design. |
Can you please answer my question about the other methods (inherently) derived from the underlying representation being |
The allocation/copy-free conversion methods to/from |
If the string is not small, then this is possible regardless of the actual underlying type. If it is small - who said these had to be allocation free? You were just saying how cheap allocations were... and the copy for a small string would be all of 23 bytes (at maximum). |
As @steveklabnik was alluding to, the Rust community greatly values respectful, technical discussion and recognizes that most decisions involve tradeoffs between competing concerns. Dismissive or aggressive argumentation makes forums unwelcoming, rather than fostering meaningful exchange or insight. I don’t feel that this comment thread has lived up to the high standards we strive for. That said, in general the issue tracker is not a good place to discuss design changes, in part because of its relatively low visibility to the broader community. At this point, I don’t think that continued discussion on this github issue is likely to be productive. For the time being, I'd like to ask that discussion move to the discuss thread on the topic. I’ll help draw more attention to it, to ensure that it is seen more widely by the community. Finally, when previous design decisions have turned out not to address specific concerns, we’ve tried hard to revisit them. While there is limited time for such changes before 1.0, it might be worth trying to lay out the concerns in an RFC to foster a more in-depth, community-wide discussion. Hopefully such an RFC would draw from the conclusions from the discuss thread. I’d be glad to help with this process as well. |
Piling on dismissive, condescending stuff like this isn't raising the level of discussion at all.
I don't participate there because it's even more awful than the discussion system here. I think I've said everything that I plan on saying anyway, and I'll just refine the arguments and make a more coherent case against this if an RFC is actually filed. This is just a practice game. |
To cut back on unnecessary argument, perhaps it would be good to narrow down the issue:
|
It's suprising that indexing/slicing is brought up. Safe indexing/slicing cannot be fast on Just about the only safe way to read strings that's worth optimizing is sequential iteration. For ultra-fast access to raw bytes, transform into |
SSO is a popular optimization technique and is currently implemented in all the major C++ standard libraries*.
If Rust decides to adopt it too then it will come into contradiction with the
String::as_mut_vec
method, which exposes details of the currentString
implementation incompatible with SSO.I suppose, the choice between these two has to be made before marking
as_mut_vec
as stable.*Although it's not used by default in libstdc++ due to ABI compatibility
The text was updated successfully, but these errors were encountered: