Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Settle conventions for ~str/StrBuf, ~[T]/Vec #13717

Closed
brson opened this issue Apr 24, 2014 · 14 comments
Closed

Settle conventions for ~str/StrBuf, ~[T]/Vec #13717

brson opened this issue Apr 24, 2014 · 14 comments
Milestone

Comments

@brson
Copy link
Contributor

brson commented Apr 24, 2014

There's no consensus yet on when to use each. Nominating.

@huonw
Copy link
Member

huonw commented Apr 24, 2014

I strongly believe the types that retain more information, and are more flexible (Vec and StrBuf) should be the default, while any uses of the less flexible/"lossier" ~ types need to be carefully justified.

I know this is controversial and others would (somewhat) prefer the ~ types to be the default. In that case, I think it would be good to have a write up of the benefits of the ~ types and how these benefits outweigh the costs of always converting to a ~ type on return (rather than just letting the user do it when they actually need a ~ type), as well as how the ~ types will avoid their allocator deficiency (no way to specify an allocator parameter with ~).

@thestinger
Copy link
Contributor

If we're ignoring performance and size considerations, I don't think ~[T] has any place at all. It would just cause the need to do conversions to and from Vec<T>, and that's extra noise Rust code hasn't need to do with in the past. If there's a semantic use case for a vector without a capacity, then I have to wonder why we're not providing the same thing for other containers like hash tables. The use case is that it's 2 words instead of 3, although we could also have a 1-word vector.

There's no allocator parameter on the current unique pointers, but under the assumption that a future Uniq<T> type with an optional allocator parameter will exist there's still the issue of whether Vec<T> or Uniq<[T]> should be used.

Conversions from ~[T] to Vec<T> will be free, so from a performance point of view there's no issue with using ~[T] if it's all you need and then converting to Vec<T> later. There is an issue with converting Vec<T> to ~[T] as the known capacity is lost. In addition to that, you have to free all of the capacity if the length is zero due to the Option<T> optimization.

If our allocators take a mandatory size parameter when deallocating like C++ allocators then there's also a need to drop the excess capacity with shrink_to_fit() when the length is non-zero too.

@Kimundi
Copy link
Member

Kimundi commented Apr 24, 2014

Hm, let me try to making a bullet list of all relevant aspects and assumptions that influence this decision. This is going to be a bit train-of-though-y, and not based on first-hand experience, so I might be talking bullshit or consider unimportant details. ;)

I'm assuming ~[] to be a non-growable owned slice under DST, and also that strings will be handled identically, so Vec/StrBuf and ~[]/~Str is interchangeably here. Furthermore, I'm assuming that the two valid options are "Vec everywhere as recommended default" and "Both Vec and ~[] as recommended default".

So, as far as I can see at least these aspects are important:

  1. Requesting and resizing memory allocations is not zero-cost.
    • Depending on allocator some operations might be zero cost, but you can't assume it in general
  2. Forgetting about a suffix of an allocation does not work with all allocators.
    • Dropping excess capacity needs some kind of shrink_allocation operation in the generic case.
  3. Memory usage and time complexity in general should be as small as possible.
  4. An allocator might provide you with more memory than asked for, with zero cost penalty.
    • Meaning you can gain capacity for free sometimes.
  5. User might desire a way to encode "this will never grow" into the type of a value.
    • The usage is to improve concentration and ability to reason about code by knowing that a type is inherently ungrowable.
  6. The convention about which type to use should be as simple as possible.
  7. It should be easy to decide for the user which type to use.
  8. Internal implementation details should not leak because of this convention.
  9. People will make mistakes, or not read the docs, so the convention should not lead to errors spreading through a code base.
  10. Some people think because ~[T] is a more build-in/vectory syntax than Vec<T>, it should be used more.

If it where just about 1-3, then the logical choice would be that everything that returns a vector that could conceivably have excess capacity (due to incremental build-up or similar) return a Vec, and everything that returns a vector that just requires a single allocation return a ~[]. If you want to grow it further, the latter can be turned into a Vec for zero cost, and if you don't want to grow you can just store both as is, with the only ẁasted space being the unused capacity field for a grow able vector, which is constant cost as opposed to the potential O(n) cost of shrink_allocation.

If you also consider 4, then the choice of using ~[] at all becomes harder, as its now a choice between allowing the user to saving one word of memory for zero cost, or potentially allowing the user to grow the allocation more cheaply than expected for zero cost, both being relatively weak optimizations.

5 might be a argument in favor of ~[], but we already offer better support than other languages for encoding this without using ~[]: A vector in a immutable location is already impossible to change, and both &[] and &mut [] offer a very cheap way to arrive at a type that is inherently not growable. However, both options are not exactly cognitive overhead free, as they both restrict the base type Vec to behave like ~[], rather than having a restricted type to begin with. (In other words, its slightly more complex)

But, there is also 6 and 7 to think about: Rust is already a complex language, so any convention a new user of the language has to learn up front should be as simple as is practical. And arrays and vectors, while being as basic as it gets, already are more complex than in other languages due to unsized [T], ownership, and unboxed values. And, as trivial as it sounds, always using the same type is easier than needing to decide or convert between two different types all the time.

For 8, if the convention becomes to choose between ~[] and Vec<T>, then this can leak implementation details, as the decision on what to return is partly based on how its constructed. This can also lead to API instability or performance problems, as a change in the implementation either causes the API to change, or to require calling shrink_allocation to keep the API stable. Again, using Vec consequently would solve this simply by adding the constant cost of a capacity field.

The problem with 9 is that a wrongly used ~[] can bubble up through API layers, as other users assume its being chosen for the correct reasons, or worse convert it to a Vec along the way, with a potentially useful capacity getting lost without anyone noticing.

Lastly, with 10 I think this is the same "provide short, intuitive build-in syntax for common structures" desire that gets expressed every time language features move into the library or get more generic. Certain things simply look way more verbose in todays rust than in that of one, two years ago, so every time a new verbosity/complexity develops, people try to find ways to minimize its impact. (I remember falling into the same trap at the time removing @T was first talked about)
In this case, this is about loosing ~[T] completely as a shortform (at least as a recommended default type). Seeing how this argument is really just about syntax and DST in general, I don't think it should be considered for this decision, as the semantic and performance implications are way more important, and the ergonomic ones more far reaching.

So in conclusion I think using Vec and StrBuf everywhere per default instead of also ~[] and ~Str is the better choice, as the advantages far outweigh the disadvantages.

@alexcrichton
Copy link
Member

One concern I had recently is what to do about string literals. I would imagine that the type of a string literal will very much wish itself to be the default string type (whatever it is).

If the default string type were StrBuf, then how would string literals be dealt with? (I'm actually not sure how this would work)

@eddyb
Copy link
Member

eddyb commented Apr 24, 2014

About string literals, keep in mind that you can have unsized values under DST, as long as they're lvalues.

Under DST, with the current types, *"foo" would yield an unsized lvalue of type str and static lifetime.
If we change the string literal itself to that, "foo" becomes str, &"foo" then is consistent as &'static str and you "only" need autoref to have it working in most places where it works today.

I would expecting boxing such a str would allow creating any sort of smart pointer to it, but that conflicts a bit with StrBuf.
Then again... how often is the conversion from &'static str to StrBuf actually desired? I wish we could have metrics for it.

@thestinger
Copy link
Contributor

One concern I had recently is what to do about string literals. I would imagine that the type of a string literal will very much wish itself to be the default string type (whatever it is).

I don't think the literal should be allocating memory. There's always a string slice involved, and I think dynamically allocating memory and copying from the string slice might as well be explicit. The literals for ~str are already going to be gone so StrBuf is no different.

@alexcrichton
Copy link
Member

Ah, I found a better way to phrase my concern. In the original DST proposal, Vec<T> is to [T] as StrBuf is to Str.

Competing proposals sounds like they want to answer the question: Vec<T> is to [T] as Str is to ??. I am concerned ?? would be.

@thestinger
Copy link
Contributor

I think array ([T, ..n]), slice ([T]) and vector (Vec<T>) are workable names for the normal types.

I wouldn't mind just doing something similar for strings and using Text to go along with Str. It doesn't really make any less sense than saying vectors are dynamic arrays. It's the language itself giving it that meaning.

@brson brson added this to the 1.0 milestone Apr 25, 2014
@brson
Copy link
Contributor Author

brson commented Apr 25, 2014

P-backcompat-libs 1.0

@pnkfelix
Copy link
Member

@alexcrichton I was just idly wondering about the names here. As a Lisper, I like Vec<T> is to [T] as Str is to Sym. (The intention being that Sym (as in "symbol") is an immutable (and often interned) character string.)

Having said that, I do not mind @thestinger 's suggestion of Text (though I do not yet have an intuition as to which type Text now maps to versus what Str maps to).

@zkamsler
Copy link
Contributor

I think Sym for the DST string type may be a false analogy, as it is unlikely to be either immutable or interned.

I think part of the difficulty here is that, while there are plenty of words to choose from for vector-like containers, almost all languages have called strings strings. Any string-like type that does not have string in the name will be at a disadvantage relative to the one that does.

@pnkfelix
Copy link
Member

@zkamsler Its hard in general to mutate Rust (utf8) strings while leaving their length unchanged.

My intuition is that for the string analogue to [T], we are not going to be able to readily change their length (just like with [T]), and therefore they will tend to be immutable. Admittedly I am just guessing about that, maybe my intuition here is wrong.

@thestinger
Copy link
Contributor

You'll still be able to perform all of the &mut [T] operations on ~[T]. I don't think think they'll tend to be either mutable or immutable because I don't really see a use case for it :P. You do need to build the vector at some point, and ~[T] is going to make that painful.

@alexcrichton
Copy link
Member

Closing, String and &str will be the predominant string types, and Vec with &[T] will be the predominant vector types.

See #60 for string debate, and vectors were in a meeting.

arcnmx pushed a commit to arcnmx/rust that referenced this issue Dec 17, 2022
…eykril

Handle raw identifiers in proc macro server

Fixes rust-lang#13706

When proc macros create `proc_macro::Ident`s, they pass an identifier text without "r#" prefix and a flag `is_raw` to proc macro server. Our `tt::Ident` currently stores the text *with* "r#" so we need to adjust them somewhere.

Rather than following rustc and adding `is_raw` field to our `tt::Ident`, I opted for adjusting the representation of identifiers in proc macro server, because we don't need the field outside it.

It's hard to write regression test for this, but at least I:
- ran `cargo +nightly t --features sysroot-abi` and all the tests passed
- built proc macro server with `cargo +nightly b -r --bin rust-analyzer-proc-macro-srv --features sysroot-abi` and made sure rust-lang#13706 resolved
  - For the record, the nightly versions used are `rustc 1.67.0-nightly (32e613b 2022-12-02)` and `cargo 1.67.0-nightly (e027c4b5d 2022-11-25)`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants