Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String::as_mut_vec prevents small string optimization #20198

Closed
petrochenkov opened this issue Dec 24, 2014 · 88 comments
Closed

String::as_mut_vec prevents small string optimization #20198

petrochenkov opened this issue Dec 24, 2014 · 88 comments

Comments

@petrochenkov
Copy link
Contributor

SSO is a popular optimization technique and is currently implemented in all the major C++ standard libraries*.
If Rust decides to adopt it too then it will come into contradiction with the String::as_mut_vec method, which exposes details of the current String implementation incompatible with SSO.
I suppose, the choice between these two has to be made before marking as_mut_vec as stable.

*Although it's not used by default in libstdc++ due to ABI compatibility

@ranma42
Copy link
Contributor

ranma42 commented Dec 24, 2014

I do not see String::as_mut_vec as preventing SSO, instead I think it simply makes it natural to apply SSO to the underlying data structure (i.e. Vec). This would make it possible to keep String::as_mut_vec and to get the advantages of SSO (trading an additional comparison to gain more cache locality, to avoid allocating and dereferencing) on all Vec-based data structures.

@petrochenkov
Copy link
Contributor Author

instead I think it simply makes it natural to apply SSO to the underlying data structure (i.e. Vec)

While this approach is possible, unlike SSO for strings it's not adopted by popular implementations of generic vector - folly::fbvector, boost::vector and not allowed by std::vector. There must be reasons, probably of statistical nature (I can start speculating on them but would prefer not to).

@ranma42
Copy link
Contributor

ranma42 commented Dec 24, 2014

The main reference I can provide for something like this on vectors is LLVM, but in that case the number of elements is parametrised (the user is free to choose it and the compiler is able to optimise it... AFAICT this would not be possible in rust).
An interesting analysis of the results of using SSO to optimise JavaScript strings is available here. It might provide a starting point for choosing what sizes can be most effective.

I believe you're right about the reason why SSO is more common than similar optimisations on other data structures. Also, the absence of a variable-sized type parameter might contribute in making it easier to find a reasonable default "local cache size" for strings.
It might also make sense to keep this optimisation in mind when implementing ropes #7628

@huonw
Copy link
Member

huonw commented Dec 24, 2014

In the worst case, this can still be provided by upgrading an SSO to a Vec although it's a little sad to make as_... possibly expensive.

@thestinger
Copy link
Contributor

@petrochenkov: It wouldn't make sense to provide a small string type without providing a small vector type since it needs to be written anyway. The small string / vector optimization increases the code size and adds extra branches / operations. Rust is always going to provide this string type based on Vec even if the small string optimization is implemented because there's a trade-off.

It's important to note that std::string in C++ is an array of arbitrary bytes - there's no equivalent to Rust's string types. The only real difference between it and std::vector<char> is that it needs to maintain an internal NUL byte at the end for the data and c_str methods. Rust's string types are wrappers around byte slices / vectors enforcing UTF-8 encoding.

@thestinger
Copy link
Contributor

The libc++ std::string is 24 bytes on x86_64 just like the obvious implementation with a length, capacity and data pointer. It only uses the small string optimization for strings up to 23 bytes in length (including the trailing NUL). For small vectors, more flexibility is needed (integer type parameter) because some types are many bytes in length and the usage patterns vary more. Ideally, a small string would also expose that integer type parameter and pass it along to the underlying small vector.

@petrochenkov
Copy link
Contributor Author

Rust is always going to provide this string type based on Vec even if the small string optimization is implemented

If you are really going to do that, then as_mut_vec is not an issue, it's just the decision to make Vec "small optimized" is much more serious than making as_mut_vec experimental or replacing it with something.

... vectors is LLVM, but in that case the number of elements is parametrised


For small vectors, more flexibility is needed (integer type parameter)

This is what I would expect from "small optimized" vector too.
Then as_mut_vec will have to return something like &mut Vec<u8, MAGIC_BUFFER_SIZE> to work correcly, assuming the small buffer size is not kept at runtime. Edit: Interesting, LLVM SmallVector does seem to keep the capacity of small vectors in runtime.

@pepp-cz
Copy link

pepp-cz commented Jan 14, 2015

Some hard data: as_mut_vec is not used anywhere in Rust and it is used four times in Rust outside of the String implementation itself.

I suggest to make as_mut_vec private. You know: better safe than sorry.

@thestinger
Copy link
Contributor

If there was a small string there would be an underlying small vector type to go along with it. Either way, there is going to be a string type to go along with Vec<T> so there is no problem here.

@LukasKalbertodt
Copy link
Member

I don't understand why the string type needs to have a vector type as underlying buffer. Strings and Vecs are pretty similar in how they work but completely different in what they are used for. Hence we need to treat String as another type with it's own challenges IMO.
We discussed the issue here: http://discuss.rust-lang.org/t/small-string-optimization-remove-as-mut-vec/1320

@thestinger
Copy link
Contributor

The design of strings is that they're a thin wrapper around the corresponding slice/vector type. A small string type would be a wrapper around a small vector type. It's a design trade-off so there would be a choice between the types as there always is. Unless someone can explain why an asymmetric string / vector design makes any sense... ?

@thestinger
Copy link
Contributor

@LukasKalbertodt: Strings in Rust are slice/vector wrappers providing a UTF-8 encoding guarantee. There is always going to be a string type implemented as a wrapper around Vec<u8>. In the future, Rust can have a string type with small string optimization but writing that implies writing a small vector type and it wouldn't make sense to provide one without the other.

The preferred default string type is a separate issue from whether a string wrapper around vectors exists. Note that there is zero language support for either Vec<u8> or String, they are entirely unprivileged library types beyond the fact that a default prelude exists.

@pepp-cz
Copy link

pepp-cz commented Jan 14, 2015

I cannot say if it makes any sense but I do not any sense if it is other way round. Firstly Sting is a sequence of "characters" in UTF-8 endcoding so I do see any direct similarity to Vec.

I do not want to argue about implementation of String using Vec or whatsoever. The issue is just about future-proofing the design. At some time it can happen that it is better to have independent String and Vec but this little as_mut_vec methods just makes it harder (e.g. maybe SSO can be done better in strings that on generic vector). And the sad thing about it is that nobody cares about that method.

@thestinger
Copy link
Contributor

It would also make sense to have a string implemented as a rope type, and implementing that is going to result in an underlying general purpose rope-based vector type. It's just the logical way to do things... there's no good reason to make the implementation unusable for other purposes than storing UTF-8.

@thestinger
Copy link
Contributor

I cannot say if it makes any sense but I do not any sense if it is other way round. Firstly Sting is a sequence of "characters" in UTF-8 endcoding so I do see any direct similarity to Vec.

There is one UTF-8 string type that's a wrapper around [u8] and another that's a wrapper around Vec<u8>. The types exist to provide a UTF-8 guarantee while filling the same purpose as the corresponding byte array types. It's more than just a direct similarity - it's by design.

I do not want to argue about implementation of String using Vec or whatsoever. The issue is just about future-proofing the design. At some time it can happen that it is better to have independent String and Vec but this little as_mut_vec methods just makes it harder (e.g. maybe SSO can be done better in strings that on generic vector). And the sad thing about it is that nobody cares about that method.

There is always going to be a UTF-8 string type that's a wrapper around Vec<u8>. There can be a small string type wrapping a small vector, but it will never obsolete the use case for the type wrapping the normal vectors. There is nothing forcing the small string type to have a method converting to &mut Vec<u8>. It would instead have a method converting to &mut SmallVec<u8, N>.

@thestinger
Copy link
Contributor

If someone wants to write an RFC addressing the various points, then they're free to do so. The issue tracker is for actionable, concrete issues in the language / libraries rather than very subjective changes that are open to debate. I certainly think SSO is important and I'm not arguing against providing it...

I am just explaining that the string type exists solely to provide a UTF-8 guarantee. It would not exist if Rust used (native endian) UTF-32 since it could just use Vec<char>. There is a use case for a wrapper type providing a UTF-8 guarantee for vectors, small vectors and ropes. The existing type wrapping Vec<u8> isn't going to be removed regardless of additional collection types being added.

@LukasKalbertodt
Copy link
Member

I understand that the string type should just be a UTF-8 Wrapper with various underlying data structures. I am just worried that std::string::String will have a method called as_mut_vec when Rust 1.0 is released. If that happens, we wouldn't be able to revert it since we would break production code.

We should limit the String type so that we don't make assumptions about the internal implementation. But you are right, this issue page is probably the wrong place to discuss that. So we should continue the discussion here: http://discuss.rust-lang.org/t/small-string-optimization-remove-as-mut-vec/1320

@thestinger
Copy link
Contributor

A string wrapper around Vec<u8> exists and is going to continue to exist. There is no integration of either type into the language, so there is absolutely no risk because it's not a standard string type beyond the fact that it's provided by the standard libraries. Additional string types wrapping different underlying data structures can be provided in the future without any problems.

The as_mut_vec method is hardly the only one derived from the fact that it's a wrapper around Vec<u8>. There are various methods taking advantage of the ability to convert it to and from a byte vector without a copy. A string implemented as a small vector type would still provide all of these methods but would convert to the underlying small vector type instead. I fail to see why this is a problem in any way, and no one seems willing to respond to the points I've raised.

@thestinger
Copy link
Contributor

I don't see why "I want option X" needs to become "cripple option Y". There is nothing preventing small string / vector types from having the same level of support as the ones without that optimization trade-off. It's certainly not a clear win because it adds a branch to many operations - it is, as these things tend to be, a compromise that makes sense in some cases and not others.

@LukasKalbertodt
Copy link
Member

[...] no one seems willing to respond to the points I've raised.

Exactly which points are you talking about?

[...] so there is absolutely no risk because it's not a standard string type beyond the fact that it's provided by the standard libraries.

I disagree: The name is the problem. A type in the standard library that is called std::string::String is assumed to be the standard string type of the language. And yes sure, I don't argue against a wrapper around Vec, so there should be this kind of type. But it shouldn't be called String

It's certainly not a clear win because it adds a branch to many operations - it is, as these things tend to be, a compromise that makes sense in some cases and not others.

I agree. What about that?

  • VecString: UTF8 wrapper around Vec
  • SmallString: UTF8 wrapper around a SmallVec
  • RawString: Custom buffer and memory handling with SSO support.
  • WhateverString: Another implementation
  • String: Typedefs one of the types listed above. But the language should make clear that this standard string type doesn't have implementation dependent methods, like as_mut_vec

TL;DR: Create any string implementation with any special implementation dependent operations, but don't call it String.

@thestinger
Copy link
Contributor

Sure, I agree that String is a bad name for the type. It doesn't make sense to remove methods from it though, since the intended purpose is wrapping Vec<u8> and providing the UTF-8 guarantee.

I would support an RFC proposing a rename but I don't expect that it would succeed.

@thestinger
Copy link
Contributor

I don't think we need a type that's an "abstract" string. Individual projects are already free to use type aliases like that or simply mass-change from one string type to another. The hard part is not converting from one type to another but rather figuring out the cases where it makes sense to pay the cost of using small string optimization. LLVM's usage of small strings / vectors is a great example of this: all of the effort is spent on profiling / statistics, not changing code.

@LukasKalbertodt
Copy link
Member

[...] since the intended purpose is wrapping Vec<u8> and providing the UTF-8 guarantee.

Who "intended" that? Shouldn't the design of String just say: A growable sequence of characters with UTF-8 guarantee?
I think this is exactly what users want. I just don't see why this type was designed so closely related to Vec... Who said that String has to use a Vec as underlying data structure? I think those classes shouldn't even know of each other.

Maybe I will create a pull request that adds another SmallString class to std::string... But I will continue this discussion in the discuss-thread which link I posted twice already.

@comex
Copy link
Contributor

comex commented Jan 15, 2015

@thestinger
There may always be a string type implemented as a wrapper around Vec<u8> (though I'm not sure I see the use), but that string type doesn't have to be String. If String removed this single method exposing its Vec<u8> internals, then it would be feasible to switch it in the future to be based on a SSO vector type without requiring the entire Rust ecosystem to mass rename their Strings to SuperSSOString, or as_mut_vec to suddenly have to do some weird expensive conversion. Well, maybe not every last use of the struct in the ecosystem, but, per C++ discussion, it seems like SSO would be better for most people who just want a string; I doubt there are many people who have actually thought about it and decided they like the guarantee that String is a Vec<u8>.

@thestinger
Copy link
Contributor

The ability to convert between the string and underlying byte vector type without copies is very important. Again, the as_mut_vec method is hardly the only thing derived from the representation of String. The time spent performing small dynamic allocations is tiny. The real win from SSO is data locality, but it comes at the cost of a huge number of extra branches and splitting the vector ecosystem is going to cause expensive copies.

@thestinger
Copy link
Contributor

@comex: Also, there's no need to rename everything. Rust has namespaces, so you can just change the imports across the project without touching any of the code.

@thestinger
Copy link
Contributor

(You can import SmallString as Str and String as StrVec or whatever else you want to do in your project - names are not monopolized across the ecosystem simply because one library chooses to use it.)

@thestinger
Copy link
Contributor

Since there would be no room for the argument if it was called StrVec, this is simply a naming bikeshed. There's clearly no backwards compatibility issue.

@pepp-cz
Copy link

pepp-cz commented Jan 15, 2015

@thestinger Can you quantify how important is it to have cheap conversion from String to Vec<u8> and in the opposite direction? Uses of as_mut_vec and into_bytes on String are pretty rare in rustc and servo.
The from_utf8 is used more often but not very often.

How we can know that cheap conversion is bigger performance win than SSO?

@pepp-cz
Copy link

pepp-cz commented Jan 15, 2015

The thing is that the purpose of the String type is not clear from the documentation and the name itself suggest that it is some general purpose best-most-of-the time implementation of strings. Once this was revealed than there is nothing to argue about in technical terms. The thing that still can be disussed is the naming of the type and what properties should the default string type (if there is any) have but that should take place somewhere else.

@steveklabnik
Copy link
Member

At one point I wondered if it would make sense to use by default, but I changed my mind based on the evidence that's available.

Right. This basically describes my thought process here too, a question I answered in IRC about it, which kicked off this new discussion. After reading said discussion, I'm in agreement with your position here. But everyone has to come along on that journey, you know? Not everyone will agree all the time on a particular technical tradeoff.

Anyway, this isn't even just about you here. The "Apparently some Rust developer are not interested in a future proof interface..." isn't helpful either. I'd just prefer that we stick to the technical side of things without getting all nasty. It really squashes discussion.

@thestinger
Copy link
Contributor

A summary of the facts from the discussion:

  • String exists to provide a mutable, resizeable string. Mutation without the ability to perform in-place resizes would not be useful, due to the realities of Unicode. The only efficient mutable operations are appending and truncating, again thanks to Unicode.
  • An immutable string is inherently more efficient when resizing is not required, because clones do not need to be implemented by making a new dynamic memory allocation and copying the data.
  • The small string optimization can take advantage of the fact that dynamic arrays are represented as 3 words to store 23 bytes (with a 64-bit address space) in-place, avoiding a small memory allocation in many cases and somewhat improving data locality. The cost is increased code size and lots of extra branches.
  • The small string optimization really only makes sense for mutable strings, because immutability makes allocations on clones unnecessary.

@steveklabnik: I don't think there's much of a technical trade-off here. Unicode ruins mutable strings beyond the string builder use case, which is what String is. The small vector optimization only makes sense when the vector is mutable, so while it's compelling for many vector use cases the niche with strings is pretty small. Even string builders that are almost always small are going to take a big hit from the extra SSO branches.

@thestinger
Copy link
Contributor

Oh and Box<str> as a way to drop one word from String for long-term storage wouldn't work if it used SSO. However, that has very a dubious value. It makes a lot more sense for applications to prefer using task-local immutable strings for efficiency, as long as they don't need (frequent) appends.

@thestinger
Copy link
Contributor

@pepp-cz: It used to be called StrBuf but String is a nicer name. It's likely that immutable strings are more common in an application but there are lots of other nice names available (Text, for one).

@LukasKalbertodt
Copy link
Member

I'd just prefer that we stick to the technical side of things without getting all nasty. It really squashes discussion.

You are right and I am sorry, but I was a little bit shocked that @thestinger whose profile gave me the impression that he is a experienced Rust developer (professional) called my argumentation bullshit. Let's forget about that...

A summary of the facts from the discussion: [...]

I agree with 1-3. I don't understand the last one though. Sure: It makes allocations on clones unnecessary but we still have to allocate once to create the string. And like Andrei Alexandrescu said: "No work is less work than some work". I think there are enough situations in which you want SSO for immutable strings. What about a huge vector of forum usernames (mostly < 23 bytes)? I know... it's a strange example but still.

It's important to remember that most users of a language are no experts in either the language nor the theory of UTF8 strings (string theory if you like :P). And (IMO) most users that use the default String type don't care about as_mut_vec or SSO at all. To decide which implementation is better as default implementation we would need to spend a lot of time testing and benchmarking that. Removing as_mut_string gives us the opportunity to delay this decision. If we keep that method and it turns out that SSO indeed doesn't make sense in most cases, everything is fine. But if it turns out that SSO becomes even more important in the future, Rust's default string type is unable to use it. And that would be sad.

@petrochenkov
Copy link
Contributor Author

@LukasKalbertodt

What about a huge vector of forum usernames (mostly < 23 bytes)? I know... it's a strange example but still.

It's not strange, a lot of strings are normal human words or word combinations.
As an especially biased example, these days I write a lot of text processing scripts where the most of strings are words or bigrams, which would perfectly fit into 23 byte buffer.

@thestinger
Copy link
Contributor

You are right and I am sorry, but I was a little bit shocked that @thestinger whose profile gave me the impression that he is a experienced Rust developer (professional) called my argumentation bullshit.

It was. You started hand waving about some future optimization that no one can think of today, which could be used to justify removing anything if it was considered a valid argument.

It makes allocations on clones unnecessary but we still have to allocate once to create the string. And like Andrei Alexandrescu said: "No work is less work than some work". I think there are enough situations in which you want SSO for immutable strings.

String is not an immutable string. It's not a good design to use if the strings aren't being mutated. It may or may not make sense to use SSO for immutable strings but it has little to do with String.

Removing as_mut_string gives us the opportunity to delay this decision.

No it doesn't. Again, it's not the only method derived from the underlying implementation.

To decide which implementation is better as default implementation we would need to spend a lot of time testing and benchmarking that.

It's pretty clear that it slows down the cases that string builders (String) are actually designed for. I have already implemented and measured it. It slows down pushes by nearly 50% for vectors.

Removing as_mut_string gives us the opportunity to delay this decision. If we keep that method and it turns out that SSO indeed doesn't make sense in most cases, everything is fine. But if it turns out that SSO becomes even more important in the future, Rust's default string type is unable to use it. And that would be sad.

Removing these methods hurts performance by forcing copies. You haven't shown that we get anything in return for hurting performance by crippling the API. The String type is not a "default string type", it's a mutable string builder. The language has string literals, but it has no idea that String even exists. It's entirely implemented in the libraries without any privileged support in the language.

@mzabaluev
Copy link
Contributor

@thestinger Unfortunately String and Vec are also the go-to types for an owned (not necessarily mutable) string and vector, respectively. There is Cow infrastructure pointing their way, and so on. Boxed DSTs would be an alternative, but as far as I know that doesn't work.

@thestinger
Copy link
Contributor

Boxed DSTs would be an alternative, but as far as I know that doesn't work.

It works fine, but those are not great types to use for immutable data either.

Unfortunately String and Vec are also the go-to types for an owned (not necessarily mutable) string and vector, respectively.

Lack of knowledge and missing features in the standard libraries doesn't make these appropriate types for this.

@mzabaluev
Copy link
Contributor

So, the title of this issue might as well be "Mixed intents on usage of String prevent small string optimization for a common use case".

@comex
Copy link
Contributor

comex commented Jan 15, 2015

@thestinger When StrBuf was renamed to String, I remember someone arguing that there was not much point carefully distinguishing immutable and mutable strings, because the capacity field was not that much overhead; and now that it is renamed, it is quite hopeless to expect most people to use something else for general strings (which are usually not mutated - c.f. the decision of most high level languages to make strings immutable by default).

Also, if Box<str> works, that is news to me;

help.rs:4:59: 4:68 error: the trait `core::iter::FromIterator<char>` is not implemented for the type `Box<str>`
help.rs:4     let bs: Box<str> = ['a', 'b', 'c'].iter().map(|&x| x).collect();

Therefore (to say this as neutrally as possible), the way the current 1.0 API is designed, any future optimizations that help immutable strings but hurt mutable strings are likely to speed up most programs in practice.

(I would argue that immutable use not being 'appropriate' per implementation design or whatever is pretty irrelevant if people will be strongly encouraged to use them as such anyway.)

@mzabaluev
Copy link
Contributor

@thestinger And I'm not buying the argument that a string builder would not benefit from SSO in all usage scenarios, either.

The performance analysis that brought up the whole thing into being seems to indicate that there is any number of places where strings are built out of frequently small sources. If you use an insert-optimized, always allocating builder in these cases, you have already lost the main benefit of SSO.
As std probably shouldn't offer a wild variety of micro-optimized string types, the question is of preference and trade-offs: in the place of the go-to mutable string type, do we want a string builder for fast inserts, or a more balanced container that may not, however, be ideal for higher-volume applications? Note that the latter might be unhappy with the flat backing vector, either.

It appears that all major C++ standard library implementations nowadays use SSO for the bog standard string type, std::string. Because people, call them ignorant as you want, use it to build any kind of dynamic strings, and those strings often end up small enough to fit in 23 characters. For high-volume applications there is rope and other non-standard solutions.

@thestinger
Copy link
Contributor

When StrBuf was renamed to String, I remember someone arguing that there was not much point carefully distinguishing immutable and mutable strings, because the capacity field was not that much overhead; and now that it is renamed.

Box<str> is not any more immutable than &mut str, but in practice there are no useful mutable operations to perform without the ability to resize due to Unicode.

now that it is renamed, it is quite hopeless to expect most people to use something else for general strings (which are usually not mutated - c.f. the decision of most high level languages to make strings immutable by default).

I don't see how renaming it is connected to a not yet existent immutable string type.

Therefore (to say this as neutrally as possible), the way the current 1.0 API is designed, any future optimizations that help immutable strings but hurt mutable strings are likely to speed up most programs in practice.

I seriously doubt that, and you've shown no evidence of it.

Also, if Box works, that is news to me;

It does work fine, just as Box<[T]> does. It doesn't have an implementation of that trait. The Box<[T]> type does have allocation/copy-free conversions to and from Vec<T> and it can and should be added for Box<str> <-> String too.

Therefore (to say this as neutrally as possible), the way the current 1.0 API is designed, any future optimizations that help immutable strings but hurt mutable strings are likely to speed up most programs in practice.

Uh, String is a mutable string builder. That is the purpose of the type. Hurting the performance of the one thing it exists to do would be idiotic. More claims about performance from people who haven't done any measuring / profiling too, nice.

(I would argue that immutable use not being 'appropriate' per implementation design or whatever is pretty irrelevant if people will be strongly encouraged to use them as such anyway.)

So then fix the real problem, the lack of proper immutable string support. It seems you have a problem with the name so you want to go out of your way to ruin the type.

@LukasKalbertodt
Copy link
Member

@thestinger

I have already implemented and measured it. It slows down pushes by nearly 50% for vectors.
[...]
More claims about performance from people who haven't done any measuring / profiling too, nice.

Please tell us more about your measurements then. We need a little bit more information and data to understand this measurement. It should be pretty clear that this measurement is highly dependent on your test data. If you are just pushing 1000 byte strings, it's obvious that SSO would be a bad idea. So yeah: It would be interesting if you'd provide more information.

Also: I guess measuring the performance for small and larger strings is one thing... But knowing what kind of strings are mostly used in "the real world"... is another important thing.

@comex
Copy link
Contributor

comex commented Jan 15, 2015

I don't see how renaming it is connected to a not yet existent immutable string type.

It is connected because everyone is going to use a type named String for what they use similarly named types in every other language for, which usually includes a lot of immutable strings. (Without appropriate trait implementations for Box<str> [or Rc<str> even], most will likely do this even if they are aware of the issue. And in any case, for casual use Rust already has two 'string types', &str and String, so adding a third is somewhat confusing.)

Uh, String is a mutable string builder. That is the purpose of the type. Hurting the performance of the one thing it exists to do would be idiotic. More claims about performance from people who haven't done any measuring / profiling too, nice.

You have done microbenchmarks, which are useful, but not the macrobenchmarks which would be needed to resolve the question; and of course the results of such benchmarks would depend on the precise choices of representation for and amount of microoptimization done on the type in question, which would take quite a bit of time. Thankfully, I am in support merely of future proofing the design (a future decision in any case could take advantage of a number of large real-world Rust applications hopefully greater than two).

So then fix the real problem, the lack of proper immutable string support. It seems you have a problem with the name so you want to go out of your way to ruin the type.

I humbly suggest that, as names and API designs stand, the best approach might be to view it the other way around: that if, in the future, SSO or other optimizations to String make the mutable case significantly suboptimal (yes, 50% sounds rather worse than suboptimal, but without more details I'm not convinced your benchmark is representative of real world use cases), a separate type should be created for faster string building, rather than creating a new type for immutable strings. Or at least that the decision of whether to do this does not need to be made now, before all the data is available (since the optimizations in question have not even been written yet, and the impact of removing this one method is so low). Of course, in the meantime, since you are advocating library-based approaches in general, you could always write your own 'StrVec' type.

edit: also, if you want absolute highest performance from a type, building UTF-8 validation into it by default sounds like a bad idea to me... YMMV

@thestinger
Copy link
Contributor

And I'm not buying the argument that a string builder would not benefit from SSO in all usage scenarios, either.

It slows down mutation and other operations like slicing by adding branches. It doesn't matter if you buy it or not, that's how things are. Strings as large as 3 pointers are not uncommon so it fills the code with branches that are prone to being mispredicted, which is awful.

The performance analysis that brought up the whole thing into being seems to indicate that there is any number of places where strings are built out of frequently small sources.

In every case, branches would be added. In every case, conversions to and from Box<str> would be crippled. In every case, conversions to and from the type it is intended to be a UTF-8 wrapper for (Vec<u8>) would be crippled. The allocations / reallocations are dwarfed by the other work in nearly every case, because that's how the types are designed - they amortize the cost. I don't think you realize how cheap small allocations are. Branch misses are much worse, and the cost of large copies dwarfs anything else in most programs - especially rustc.

If you use an insert-optimized, always allocating builder in these cases, you have already lost the main benefit of SSO.

That's what String is for, it's a string builder type.

As std probably shouldn't offer a wild variety of micro-optimized string classes, the question is of preference and trade-offs: in the place of the go-to mutable string type, do we want a string builder for fast inserts, or a more balanced container that may not, however, be ideal for higher-volume applications? Note that the latter might be unhappy with the flat backing vector, either.

SSO is not more balanced. It bloats the code and hurts performance in most cases. Here's the thing: I have done extensive measuring / profiling / statistics in regards to this and gave up on my original plan to implement SSO for the good old ~str type while you're just making up bullshit as you go. It obviously slows down read operations too because you're adding branches to all of them...

It appears that all major C++ standard library implementations nowadays use SSO for the bog standard string type, std::string.

For the nth time, std::string has very little to with strings in Rust. It is a mutable vector of arbitrary bytes terminated by \0. It pays the cost of having plenty of bloat to maintain the \0 at the end and has no other real distinctions from the vector type. Some major implementations (not all as you incorrectly claim) use the small string optimization and a major part of the rationale is because it doesn't make sense to provide the same type twice (std::vector<T>, std::basic_string<T>).

In C++, the copy constructor calls are also implicit. It doesn't offer an efficient Rc<str> equivalent without double indirection either, although Rust can do better than it already does.

Because people, call them ignorant as you want, use it to build any kind of dynamic strings, and those strings often end up small enough to fit in 23 characters. For high-volume applications there is rope and other non-standard solutions.

Slowing down every operation via branches to speed up construction doesn't seem like a good trade-off for most applications. Cloning strings is essentially free because most strings are immutable and cloning will just be a reference count. Poorly written applications aren't the case that's optimized for by Rust.

@comex
Copy link
Contributor

comex commented Jan 15, 2015

SSO is not more balanced. It bloats the code and hurts performance in most cases. Here's the thing: I have done extensive measuring / profiling / statistics in regards to this and gave up on my original plan to implement SSO for the good old ~str type while you're just making up bullshit as you go. It obviously slows down read operations too because you're adding branches to all of them...

There doesn't actually have to be a branch. Conditional move should be doable... and there are other tricks. Which perhaps you have already tried, per your extensive measuring / profiling - but these decisions are basically wagers. To hear vague references to profiling on an old version of Rust increases my personal believed probability that SSO is a bad idea, but not enough that it would be worth making non-future-proof decisions at such an early stage, when the alternative is to temporarily sacrifice one method that is basically unnecessary in most cases, since you can still convert to Vec<u8> by value without copies – hardly crippling the API!

(You say that other methods implicitly depend on there being a Vec<u8> inside String, but I do not understand what you mean. Please elaborate?)

This issue is not a decision about whether or not to use SSO.

@thestinger
Copy link
Contributor

It is connected because everyone is going to use a type named String for what they use similarly named types in every other language for, which usually includes a lot of immutable strings. (Without appropriate trait implementations for Box [or Rc even], most will likely do this even if they are aware of the issue. And in any case, for casual use Rust already has two 'string types', &str and String, so adding a third is somewhat confusing.)

The string slice type is str. Rust users will already know how &T, Box<T>, Rc<T> and Arc<T> works separate from any usage of strings. Missing trait implementations are issues that need to fixed, but most of it is already dealt with. The Box<[T]> incarnation of slices is already fully usable via allocation/copy-free conversions to and from Vec<T> and the same thing will be provided for String. The Box<str> type is not an immutable string though and does not offer the performance advantages of one, as I already pointed out above.

You have done microbenchmarks, which are useful, but not the macrobenchmarks which would be needed to resolve the question; and of course the results of such benchmarks would depend on the precise choices of representation for and amount of microoptimization done on the type in question, which would take quite a bit of time. Thankfully, I am in support merely of future proofing the design (a future decision in any case could take advantage of a number of large real-world Rust applications hopefully greater than two).

Yes, I have done macro-benchmarks. I have measured the impact on both the Rust compiler and Servo of various allocator optimizations / tradeoffs and various string / vector implementations. Crippling the type by forcing allocation/copying in cases that are current free of allocation/copying is not "future proofing", it would simply be misguided and poor API design.

As I've already pointed out several times, migrating in between different vector / string types is one of the most trivial sweeping changes you can make. The LLVM project does it regularly, because they have widespread usage of small strings and vectors and need to carefully identify when it makes sense to use them. It causes significant code bloat and hurts the performance of the non-copying / non-creation methods in essentially every case.

The vast majority of code is relatively cold code, and it should be optimized for size rather than raw performance. With this in mind, it would make absolutely no sense to pervasively use the small vector optimization.

I humbly suggest that, as names and API designs stand, the best approach might be to view it the other way around: that if, in the future, SSO or other optimizations to String make the mutable case significantly suboptimal (yes, 50% sounds rather worse than suboptimal, but without more details I'm not convinced your benchmark is representative of real world use cases), a separate type should be created for faster string building, rather than creating a new type for immutable strings. Or at least that the decision of whether to do this does not need to be made now, before all the data is available (since the optimizations in question have not even been written yet, and the impact of removing this one method is so low). Of course, in the meantime, since you are advocating library-based approaches in general, you could always write your own 'StrVec' type.

The small vector optimization hurts the performance of every operation other than construction and cloning in every case. It hurts slicing performance, it hurts fetching the length, it hurts bounds checking performance and so on. It adds branches to these operations, and they are not trivially predictable ones. It also significantly bloats the code size.

You're the one proposing that the status quo should be changed by crippling the type, so the burden of proof is on you to demonstrate that it's a good idea.

I guess I need to restate for what seems like the 10th time that the as_mut_vec method is certainly not the only one derived from the underlying representation being Vec<u8>.

@comex
Copy link
Contributor

comex commented Jan 16, 2015

(Not to imply that thestinger should have read what I wrote a few seconds ago, but just to keep the conversation in order, I asked in the last post for clarification on what these other methods from the last paragraph are.)

@thestinger
Copy link
Contributor

There doesn't actually have to be a branch. Conditional move should be doable... and there are other tricks. Which perhaps you have already tried, per your extensive measuring / profiling - but these decisions are basically wagers. To hear vague references to profiling on an old version of Rust increases my personal believed probability that SSO is a bad idea, but not enough that it would be worth making non-future-proof decisions at such an early stage, when the alternative is to temporarily sacrifice one method that is basically unnecessary in most cases, since you can still convert to Vec by value without copies – hardly crippling the API!

A conditional move isn't enough without a whole bunch of extra bloat / hacks (at least for most of the operations) and isn't even an option everywhere. The more hacks like small vector optimization that are used, the more the code is bloated - hurting caching and branch prediction. It's better to have smaller code in nearly every case because few cases are directly in the hot inner loops.

The sacrifice is more than one method. It introduces asymmetry into what should be a simple system where Box<str> and Box<[T]> are the owned slice types, with String and Vec<T> having free (assuming no extra capacity needs to be dropped) conversions to and from them. It means an end to reliable cheap conversions between String and Vec<u8> too, along with introducing a lot of pain for FFI.

There isn't a performance argument for doing it. Small strings really only belong in carefully chosen hot code because they're just plain bloated. Storing 11 or 23 bytes inline isn't enough to justify it at all. The cases where they're great almost always want to add a bunch of padding to store far more inline, considering that they're paying the price of branches and code bloat for it. This is how it's done in LLVM, where they're fairly widely used, but aren't the majority of vectors / strings.

This issue is not a decision about whether or not to use SSO.

It's not a decision about anything because it's RFC material, not issue tracker material. The discussion here is a waste of time beyond developing points to copy-paste into an RFC discussion.

@thestinger
Copy link
Contributor

It's very clean and useful to have UTF-8 wrappers around general purpose vector types. There simply isn't a valid reason to provide a string without exposing the underlying vector type. It would be a far inferior API and would bring Rust a lot closer to the mess of C++ collections (char[n], std::array<char, N>, std::unique_ptr<char[]>, std::vector<char>, std::basic_string<char> - all totally incompatible for no reason at all, and forcing many inefficiencies too - not a design with much to learn from), whether the excuse is "future proofing" or not.

@thestinger
Copy link
Contributor

Ultimately, it would be nice if the implementation of the string type could be extracted into a reusable facade to wrap around anything able to expose the random-access / contiguous vector API. Turning it into more than just a vector wrapper is stepping away from that kind of clean, reusable design.

@comex
Copy link
Contributor

comex commented Jan 16, 2015

Can you please answer my question about the other methods (inherently) derived from the underlying representation being Vec<u8>?

@thestinger
Copy link
Contributor

The allocation/copy-free conversion methods to/from Vec<u8> and soon Box<str> (just as Vec<T> provides those for Box<[T]>), among others.

@comex
Copy link
Contributor

comex commented Jan 16, 2015

If the string is not small, then this is possible regardless of the actual underlying type. If it is small - who said these had to be allocation free? You were just saying how cheap allocations were... and the copy for a small string would be all of 23 bytes (at maximum).

@aturon
Copy link
Member

aturon commented Jan 16, 2015

As @steveklabnik was alluding to, the Rust community greatly values respectful, technical discussion and recognizes that most decisions involve tradeoffs between competing concerns. Dismissive or aggressive argumentation makes forums unwelcoming, rather than fostering meaningful exchange or insight. I don’t feel that this comment thread has lived up to the high standards we strive for.


That said, in general the issue tracker is not a good place to discuss design changes, in part because of its relatively low visibility to the broader community.

At this point, I don’t think that continued discussion on this github issue is likely to be productive. For the time being, I'd like to ask that discussion move to the discuss thread on the topic. I’ll help draw more attention to it, to ensure that it is seen more widely by the community.

Finally, when previous design decisions have turned out not to address specific concerns, we’ve tried hard to revisit them. While there is limited time for such changes before 1.0, it might be worth trying to lay out the concerns in an RFC to foster a more in-depth, community-wide discussion. Hopefully such an RFC would draw from the conclusions from the discuss thread. I’d be glad to help with this process as well.

@thestinger
Copy link
Contributor

As @steveklabnik was alluding to, the Rust community greatly values respectful, technical discussion and recognizes that most decisions involve tradeoffs between competing concerns. Dismissive or aggressive argumentation makes forums unwelcoming, rather than fostering meaningful exchange or insight. I don’t feel that this comment thread has lived up to the high standards we strive for.

Piling on dismissive, condescending stuff like this isn't raising the level of discussion at all.

I'd like to ask that discussion move to the discuss thread on the topic. I’ll help draw more attention to it, to ensure that it is seen more widely by the community.

I don't participate there because it's even more awful than the discussion system here. I think I've said everything that I plan on saying anyway, and I'll just refine the arguments and make a more coherent case against this if an RFC is actually filed. This is just a practice game.

@mzabaluev
Copy link
Contributor

To cut back on unnecessary argument, perhaps it would be good to narrow down the issue:

  • as_mut_vec is the only method on String so far identified that necessarily exposes Vec, going above the general assumption of an amortized-linear buffer as the backing store.
  • Uses of this method, rare as they are, can be easily replaced with moving transformations: into_bytes to change the facade to Vec, and String::from_utf8_unchecked to change it back to String when done mutating the vector. If this is found happening on a hot path (and it has to be hot for such modification to have significant impact), consider taking such transformations outside of the hot path or using a specialized solution.
  • The presence of this method limits options for future optimization on String if needs for such optimization, specific to usage patterns of mutable strings, are discovered in later applications of Rust which is going to be much more extensive and varied than they are now. SSO is not necessarily the only possible optimization.

@mzabaluev
Copy link
Contributor

It's suprising that indexing/slicing is brought up. Safe indexing/slicing cannot be fast on String, or &str for that matter, because it has to assert UTF-8 sequence boundaries, so it has to have lots-o-branches regardless. Though many of those branches will be well-predicted and should be statically hotpath-optimized, doing such operations on a massive scale is not conductive to good performance.

Just about the only safe way to read strings that's worth optimizing is sequential iteration. For ultra-fast access to raw bytes, transform into Vec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants