Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is String allowed to switch to a small-string optimization? #499

Open
Manishearth opened this issue Mar 19, 2024 · 10 comments
Open

Is String allowed to switch to a small-string optimization? #499

Manishearth opened this issue Mar 19, 2024 · 10 comments

Comments

@Manishearth
Copy link
Member

In theory, String could implement a small-string optimization where in certain cases the string is stored inline and Deref returns a pointer to the flat in-memory representation.

The only thing that really changes here would be the behavior of into_raw_parts(), it would be forced to perform a last minute allocation.

And that would potentially break Rust code that was calling this function with an older pointer.

It also of course breaks StableDeref, though there are already open questions about that.

Concretely, I have the following questions:

  • Would it make sense to be explicit about this on the invariants of String, such that people maintaining a copy of the stdlib can make such a change, and/or potentially contribute it upstream (making no judgement on whether SSOs are a good idea for Rust in general, just whether such a change should be allowed by Rust's rules)
  • Would it make sense to have a miri check that ensures that pointers obtained via Deref are not valid for from_raw_parts, only ones from into_raw_parts()?

even if we decide that StableDeref is an important property to maintain here, it might be worth getting unsafe code to not rely on into_raw_parts()' behavior, so that people tweaking the stdlib can still have some hope of having this work (e.g. "as long as we patch out impl StableDeref for String, we are fine")

Anyway, curious about thoughts.

cc @gribozavr

@RalfJung
Copy link
Member

I think this is primarily a t-libs / t-libs-api question; in this repo we're mostly talking about language-level questions.

@bjorn3
Copy link
Member

bjorn3 commented Mar 19, 2024

We can't do this for Vec at least due to the guarantees we provide for it. In fact this is explicitly called out in the documentation:

Vec will never perform a “small optimization” where elements are actually stored on the stack for two reasons:

  • It would make it more difficult for unsafe code to correctly manipulate a Vec. The contents of a Vec wouldn’t have a stable address if it were only moved, and it would be more difficult to determine if a Vec had actually allocated memory.
  • It would penalize the general case, incurring an additional branch on every access.

String is currently a Vec wrapper and even has two method to get the inner Vec: String::as_mut_vec and String::into_bytes. These would both need to allocate if we did small-string optimization.

@Manishearth
Copy link
Member Author

@RalfJung fair, though I think specifically the "can miri check this" is relevant here too

cc @rust-lang/libs-api

I would like this discussion to focus on whether we want to allow for the possibility, not whether we want to make such a change.

@RalfJung
Copy link
Member

I think specifically the "can miri check this" is relevant here too

There's a general question of whether and how Miri can and should detect library UB. For some cases that's trivial, e.g. all assert_unsafe_precondition! for library UB are checked by Miri. But this here is a very non-trivial case and it's not clear to me how we could model an interface that would let the String type inform Miri about this kind of UB. I don't want to hard-code the String type somewhere in the aliasing model, that code is already complicated enough as it is.

@CAD97
Copy link

CAD97 commented Mar 19, 2024

In discussion around the storages proposal, we've discussed the possibility of a "fixed size storage" which could allow String<InlineStorage<[u8; N]>> to become roughly { len: usize, data: [u8; N] }. Theoretically it could be possible to have the storage handle store capacity as well, such that it could be niched into the inline data space when data outlines onto the heap.

However, such tricks add further complexity into an already complex API surface and thus don't seem particularly helpful. Furthermore, a general storage can't niche data into the len field nor take advantage of the UTF-8 requirement on the data, so while related, can't fully replace a specific SSO-performing type.

As String::into_raw_parts is still unstable, people do open code it using as_mut_ptr. ... Which is potentially unsound since that's str::as_mut_ptr, String probably needs that method to shadow the str one, like Vec shadows the slice method. (Edit: yep, modifying the example to use a nonexact capacity causes Miri to complain about UB. rust-lang/rust#106593)

@the8472
Copy link
Member

the8472 commented Mar 20, 2024

pub unsafe fn as_mut_vec(&mut self) -> &mut Vec<u8> exists. And String has the same size as Vec, so to be able to take this reference they must have the same layout, at least for the global allocator.

@Lokathor
Copy link
Contributor

bjorn3 mentioned that. The method could theoretically allocate a small-optimized string before returning the vec reference.

@chorman0773
Copy link
Contributor

String also isn't guaranteed to have the same size of Vec<u8>

@m-ou-se
Copy link
Member

m-ou-se commented Mar 20, 2024

FWIW, the String docs currently say:

A String is made up of three components: a pointer to some bytes, a length, and a capacity. The pointer points to an internal buffer String uses to store its data. The length is the number of bytes currently stored in the buffer, and the capacity is the size of the buffer in bytes. As such, the length will always be less than or equal to the capacity.

This buffer is always stored on the heap.

@m-ou-se
Copy link
Member

m-ou-se commented Mar 20, 2024

I don't think alloc::string::String can implement small string optimization. However, I'd really like to see a future where alloc::string::String is not special, and it's effortless to just use sso_string::String or refcounted_string::String or whatever variant you like, with no downsides over alloc::string::String.

That means that things like str::repeat() and format!() would have to change their return type from -> String to -> impl StringTrait or something like that, which is probably not fully backwards compatible, but might be doable in some way.

There are a few other places where alloc::string::String is special, such as in panic handling, but I'm hoping we can find some nice solution for that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants