Skip to content

Commit cef8295

Browse files
committed
Return the String guide to its former glory.
When we moved over to the book, we lost this.
1 parent b7930d9 commit cef8295

File tree

2 files changed

+284
-0
lines changed

2 files changed

+284
-0
lines changed

src/doc/trpl/SUMMARY.md

+1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
* [Standard Input](standard-input.md)
1717
* [Guessing Game](guessing-game.md)
1818
* [II: Intermediate Rust](intermediate.md)
19+
* [More Strings](more-strings.md)
1920
* [Crates and Modules](crates-and-modules.md)
2021
* [Testing](testing.md)
2122
* [Pointers](pointers.md)

src/doc/trpl/more-strings.md

+283
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
% More Strings
2+
3+
Strings are an important concept to master in any programming language. If you
4+
come from a managed language background, you may be surprised at the complexity
5+
of string handling in a systems programming language. Efficient access and
6+
allocation of memory for a dynamically sized structure involves a lot of
7+
details. Luckily, Rust has lots of tools to help us here.
8+
9+
A **string** is a sequence of unicode scalar values encoded as a stream of
10+
UTF-8 bytes. All strings are guaranteed to be validly-encoded UTF-8 sequences.
11+
Additionally, strings are not null-terminated and can contain null bytes.
12+
13+
Rust has two main types of strings: `&str` and `String`.
14+
15+
# &str
16+
17+
The first kind is a `&str`. This is pronounced a 'string slice'.
18+
String literals are of the type `&str`:
19+
20+
```
21+
let string = "Hello there.";
22+
```
23+
24+
Like any Rust reference, string slices have an associated lifetime. A string
25+
literal is a `&'static str`. A string slice can be written without an explicit
26+
lifetime in many cases, such as in function arguments. In these cases the
27+
lifetime will be inferred:
28+
29+
```
30+
fn takes_slice(slice: &str) {
31+
println!("Got: {}", slice);
32+
}
33+
```
34+
35+
Like vector slices, string slices are simply a pointer plus a length. This
36+
means that they're a 'view' into an already-allocated string, such as a
37+
string literal or a `String`.
38+
39+
# String
40+
41+
A `String` is a heap-allocated string. This string is growable, and is also
42+
guaranteed to be UTF-8.
43+
44+
```
45+
let mut s = "Hello".to_string();
46+
println!("{}", s);
47+
48+
s.push_str(", world.");
49+
println!("{}", s);
50+
```
51+
52+
You can coerce a `String` into a `&str` by dereferencing it:
53+
54+
```
55+
fn takes_slice(slice: &str) {
56+
println!("Got: {}", slice);
57+
}
58+
59+
fn main() {
60+
let s = "Hello".to_string();
61+
takes_slice(&*s);
62+
}
63+
```
64+
65+
You can also get a `&str` from a stack-allocated array of bytes:
66+
67+
```
68+
use std::str;
69+
70+
let x: &[u8] = &[b'a', b'b'];
71+
let stack_str: &str = str::from_utf8(x).unwrap();
72+
```
73+
74+
# Best Practices
75+
76+
## `String` vs. `&str`
77+
78+
In general, you should prefer `String` when you need ownership, and `&str` when
79+
you just need to borrow a string. This is very similar to using `Vec<T>` vs. `&[T]`,
80+
and `T` vs `&T` in general.
81+
82+
This means starting off with this:
83+
84+
```{rust,ignore}
85+
fn foo(s: &str) {
86+
```
87+
88+
and only moving to this:
89+
90+
```{rust,ignore}
91+
fn foo(s: String) {
92+
```
93+
94+
If you have good reason. It's not polite to hold on to ownership you don't
95+
need, and it can make your lifetimes more complex.
96+
97+
## Generic functions
98+
99+
To write a function that's generic over types of strings, use `&str`.
100+
101+
```
102+
fn some_string_length(x: &str) -> uint {
103+
x.len()
104+
}
105+
106+
fn main() {
107+
let s = "Hello, world";
108+
109+
println!("{}", some_string_length(s));
110+
111+
let s = "Hello, world".to_string();
112+
113+
println!("{}", some_string_length(s.as_slice()));
114+
}
115+
```
116+
117+
Both of these lines will print `12`.
118+
119+
## Indexing strings
120+
121+
You may be tempted to try to access a certain character of a `String`, like
122+
this:
123+
124+
```{rust,ignore}
125+
let s = "hello".to_string();
126+
127+
println!("{}", s[0]);
128+
```
129+
130+
This does not compile. This is on purpose. In the world of UTF-8, direct
131+
indexing is basically never what you want to do. The reason is that each
132+
character can be a variable number of bytes. This means that you have to iterate
133+
through the characters anyway, which is an O(n) operation.
134+
135+
There's 3 basic levels of unicode (and its encodings):
136+
137+
- code units, the underlying data type used to store everything
138+
- code points/unicode scalar values (char)
139+
- graphemes (visible characters)
140+
141+
Rust provides iterators for each of these situations:
142+
143+
- `.bytes()` will iterate over the underlying bytes
144+
- `.chars()` will iterate over the code points
145+
- `.graphemes()` will iterate over each grapheme
146+
147+
Usually, the `graphemes()` method on `&str` is what you want:
148+
149+
```
150+
let s = "u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé";
151+
152+
for l in s.graphemes(true) {
153+
println!("{}", l);
154+
}
155+
```
156+
157+
This prints:
158+
159+
```text
160+
161+
n͈̰̎
162+
i̙̮͚̦
163+
c͚̉
164+
o̼̩̰͗
165+
d͔̆̓ͥ
166+
167+
```
168+
169+
Note that `l` has the type `&str` here, since a single grapheme can consist of
170+
multiple codepoints, so a `char` wouldn't be appropriate.
171+
172+
This will print out each visible character in turn, as you'd expect: first "u͔", then
173+
"n͈̰̎", etc. If you wanted each individual codepoint of each grapheme, you can use `.chars()`:
174+
175+
```
176+
let s = "u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé";
177+
178+
for l in s.chars() {
179+
println!("{}", l);
180+
}
181+
```
182+
183+
This prints:
184+
185+
```text
186+
u
187+
͔
188+
n
189+
̎
190+
͈
191+
̰
192+
i
193+
̙
194+
̮
195+
͚
196+
̦
197+
c
198+
̉
199+
͚
200+
o
201+
͗
202+
̼
203+
̩
204+
̰
205+
d
206+
̆
207+
̓
208+
ͥ
209+
͔
210+
e
211+
́
212+
```
213+
214+
You can see how some of them are combining characters, and therefore the output
215+
looks a bit odd.
216+
217+
If you want the individual byte representation of each codepoint, you can use
218+
`.bytes()`:
219+
220+
```
221+
let s = "u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé";
222+
223+
for l in s.bytes() {
224+
println!("{}", l);
225+
}
226+
```
227+
228+
This will print:
229+
230+
```text
231+
117
232+
205
233+
148
234+
110
235+
204
236+
142
237+
205
238+
136
239+
204
240+
176
241+
105
242+
204
243+
153
244+
204
245+
174
246+
205
247+
154
248+
204
249+
166
250+
99
251+
204
252+
137
253+
205
254+
154
255+
111
256+
205
257+
151
258+
204
259+
188
260+
204
261+
169
262+
204
263+
176
264+
100
265+
204
266+
134
267+
205
268+
131
269+
205
270+
165
271+
205
272+
148
273+
101
274+
204
275+
129
276+
```
277+
278+
Many more bytes than graphemes!
279+
280+
# Other Documentation
281+
282+
* [the `&str` API documentation](std/str/index.html)
283+
* [the `String` API documentation](std/string/index.html)

0 commit comments

Comments
 (0)