-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Fixes nits in string guide #17453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes nits in string guide #17453
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -96,12 +96,11 @@ need, and it can make your lifetimes more complex. | |
|
||
## Generic functions | ||
|
||
To write a function that's generic over types of strings, use [the `Str` | ||
trait](http://doc.rust-lang.org/std/str/trait.Str.html): | ||
To write a function that's generic over types of strings, use `&str`. | ||
|
||
```{rust} | ||
fn some_string_length<T: Str>(x: T) -> uint { | ||
x.as_slice().len() | ||
fn some_string_length(x: &str) -> uint { | ||
x.len() | ||
} | ||
|
||
fn main() { | ||
|
@@ -111,15 +110,12 @@ fn main() { | |
|
||
let s = "Hello, world".to_string(); | ||
|
||
println!("{}", some_string_length(s)); | ||
println!("{}", some_string_length(s.as_slice())); | ||
} | ||
``` | ||
|
||
Both of these lines will print `12`. | ||
|
||
The only method that the `Str` trait has is `as_slice()`, which gives you | ||
access to a `&str` value from the underlying string. | ||
|
||
## Comparisons | ||
|
||
To compare a String to a constant string, prefer `as_slice()`... | ||
|
@@ -161,25 +157,93 @@ indexing is basically never what you want to do. The reason is that each | |
character can be a variable number of bytes. This means that you have to iterate | ||
through the characters anyway, which is a O(n) operation. | ||
|
||
To iterate over a string, use the `graphemes()` method on `&str`: | ||
There's 3 basic levels of unicode (and its encodings): | ||
|
||
- code units, the underlying data type used to store everything | ||
- code points/unicode scalar values (char) | ||
- graphemes (visible characters) | ||
|
||
Rust provides iterators for each of these situations: | ||
|
||
- `.bytes()` will iterate over the underlying bytes | ||
- `.chars()` will iterate over the code points | ||
- `.graphemes()` will iterate over each grapheme | ||
|
||
Usually, the `graphemes()` method on `&str` is what you want: | ||
|
||
```{rust} | ||
let s = "αἰθήρ"; | ||
let s = "u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé"; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you check all this combining character stuff renders correctly in a few different browsers etc.? (Github is certainly doing a bad job of it.) It be be wroth reducing the number slightly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Really? It renders just fine on GitHub for me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For example, this is what I see: https://www.dropbox.com/s/chaumhg5nla1tui/Screenshot%202014-09-23%2009.41.25.png?dl=0 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I took my screenshot in github. I'd actually bet that it's probably that I have fonts installed that you don't? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps? My first screenshot was in github and the second was on gmail, not sure what's going on. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Safari is the weirdest one: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Woah, Safari is doing a bad job.) The best example of the problems is the grapheme section below, I find it hard to believe that the combining characters shouldn't be over their main letter. It may just be an issue of fonts, but we could try to tweak the example so it's slightly more portable. Also, I'm most concerned about this when actually rendered, since that's how most people will be viewing it. @steveklabnik could you put up a |
||
|
||
for l in s.graphemes(true) { | ||
println!("{}", l); | ||
} | ||
``` | ||
|
||
This prints: | ||
|
||
```{notrust,ignore} | ||
u͔ | ||
n͈̰̎ | ||
i̙̮͚̦ | ||
c͚̉ | ||
o̼̩̰͗ | ||
d͔̆̓ͥ | ||
é | ||
``` | ||
|
||
Note that `l` has the type `&str` here, since a single grapheme can consist of | ||
multiple codepoints, so a `char` wouldn't be appropriate. | ||
|
||
This will print out each character in turn, as you'd expect: first "α", then | ||
"ἰ", etc. You can see that this is different than just the individual bytes. | ||
Here's a version that prints out each byte: | ||
This will print out each visible character in turn, as you'd expect: first "u͔", then | ||
"n͈̰̎", etc. If you wanted each individual codepoint of each grapheme, you can use `.chars()`: | ||
|
||
```{rust} | ||
let s = "αἰθήρ"; | ||
let s = "u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé"; | ||
|
||
for l in s.chars() { | ||
println!("{}", l); | ||
} | ||
``` | ||
|
||
This prints: | ||
|
||
```{notrust,ignore} | ||
u | ||
͔ | ||
n | ||
̎ | ||
͈ | ||
̰ | ||
i | ||
̙ | ||
̮ | ||
͚ | ||
̦ | ||
c | ||
̉ | ||
͚ | ||
o | ||
͗ | ||
̼ | ||
̩ | ||
̰ | ||
d | ||
̆ | ||
̓ | ||
ͥ | ||
͔ | ||
e | ||
́ | ||
``` | ||
|
||
You can see how some of them are combining characters, and therefore the output | ||
looks a bit odd. | ||
|
||
If you want the individual byte representation of each codepoint, you can use | ||
`.bytes()`: | ||
|
||
```{rust} | ||
let s = "u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé"; | ||
|
||
for l in s.bytes() { | ||
println!("{}", l); | ||
|
@@ -189,16 +253,50 @@ for l in s.bytes() { | |
This will print: | ||
|
||
```{notrust,ignore} | ||
206 | ||
177 | ||
225 | ||
188 | ||
117 | ||
205 | ||
148 | ||
110 | ||
204 | ||
142 | ||
205 | ||
136 | ||
204 | ||
176 | ||
206 | ||
184 | ||
206 | ||
105 | ||
204 | ||
153 | ||
204 | ||
174 | ||
207 | ||
205 | ||
154 | ||
204 | ||
166 | ||
99 | ||
204 | ||
137 | ||
205 | ||
154 | ||
111 | ||
205 | ||
151 | ||
204 | ||
188 | ||
204 | ||
169 | ||
204 | ||
176 | ||
100 | ||
204 | ||
134 | ||
205 | ||
131 | ||
205 | ||
165 | ||
205 | ||
148 | ||
101 | ||
204 | ||
129 | ||
``` | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this sentence is being too strong and prescriptive. I'm not really sure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your intention is to 'fetch character x' like you would an ASCII string, I think this is correct.