Skip to content

Commit 09b003a

Browse files
committed
Literal expressions: document non-C-string textual literals
1 parent 8c77e8b commit 09b003a

File tree

1 file changed

+193
-5
lines changed

1 file changed

+193
-5
lines changed

src/expressions/literal-expr.md

Lines changed: 193 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,29 +26,208 @@ Each of the lexical [literal][literal tokens] forms described earlier can make u
2626
5; // integer type
2727
```
2828

29+
In the descriptions below, the _string representation_ of a token is the sequence of characters from the input which matched the token's production in a *Lexer* grammar snippet.
30+
31+
> **Note**: this string representation never includes a character `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).
32+
33+
## Escapes
34+
35+
The descriptions of textual literal expressions below make use of several forms of _escape_.
36+
37+
Each form of escape is characterised by:
38+
* an _escape sequence_: a sequence of characters, which always begins with `U+005C` (`\`)
39+
* an _escaped value_: either a single character or an empty sequence of characters
40+
41+
In the definitions of escapes below:
42+
* An _octal digit_ is any of the characters in the range \[`0`-`7`].
43+
* A _hexadecimal digit_ is any of the characters in the ranges \[`0`-`9`], \[`a`-`f`], or \[`A`-`F`].
44+
45+
### Simple escapes
46+
47+
Each sequence of characters occurring in the first column of the following table is an escape sequence.
48+
49+
In each case, the escaped value is the character given in the corresponding entry in the second column.
50+
51+
| Escape sequence | Escaped value |
52+
|-----------------|--------------------------|
53+
| `\0` | U+0000 (NUL) |
54+
| `\t` | U+0009 (HT) |
55+
| `\n` | U+000A (LF) |
56+
| `\r` | U+000D (CR) |
57+
| `\"` | U+0022 (QUOTATION MARK) |
58+
| `\'` | U+0027 (APOSTROPHE) |
59+
| `\\` | U+005C (REVERSE SOLIDUS) |
60+
61+
### 8-bit escapes
62+
63+
The escape sequence consists of `\x` followed by two hexadecimal digits.
64+
65+
The escaped value is the character whose [Unicode scalar value] is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer, as if by [`u8::from_str_radix`] with radix 16.
66+
67+
> **Note**: the escaped value therefore has a [Unicode scalar value] in the range of [`u8`][numeric types].
68+
69+
### 7-bit escapes
70+
71+
The escape sequence consists of `\x` followed by an octal digit then a hexadecimal digit.
72+
73+
The escaped value is the character whose [Unicode scalar value] is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer, as if by [`u8::from_str_radix`] with radix 16.
74+
75+
### Unicode escapes
76+
77+
The escape sequence consists of `\u{`, followed by a sequence of characters each of which is a hexadecimal digit or `_`, followed by `}`.
78+
79+
The escaped value is the character whose [Unicode scalar value] is the result of interpreting the hexadecimal digits contained in the escape sequence as a hexadecimal integer, as if by [`u8::from_str_radix`] with radix 16.
80+
81+
> **Note**: the permitted forms of a [CHAR_LITERAL] or [STRING_LITERAL] token ensure that there is such a character.
82+
83+
### String continuation escapes
84+
85+
The escape sequence consists of `\` followed immediately by `U+000A` (LF), and all following whitespace characters before the next non-whitespace character.
86+
For this purpose, the whitespace characters are `U+0009` (HT), `U+000A` (LF), `U+000D` (CR), and `U+0020` (SPACE).
87+
88+
The escaped value is an empty sequence of characters.
89+
2990
## Character literal expressions
3091

3192
A character literal expression consists of a single [CHAR_LITERAL] token.
3293

33-
> **Note**: This section is incomplete.
94+
The expression's type is the primitive [`char`][textual types] type.
95+
96+
The token must not have a suffix.
97+
98+
The token's _literal content_ is the sequence of characters following the first `U+0027` (`'`) and preceding the last `U+0027` (`'`) in the string representation of the token.
99+
100+
The literal expression's _represented character_ is derived from the literal content as follows:
101+
102+
* If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
103+
* [Simple escapes]
104+
* [7-bit escapes]
105+
* [Unicode escapes]
106+
107+
* Otherwise the represented character is the single character that makes up the literal content.
108+
109+
The expression's value is the [`char`][textual types] corresponding to the represented character's [Unicode scalar value].
110+
111+
> **Note**: the permitted forms of a [CHAR_LITERAL] token ensure that these rules always produce a single character.
112+
113+
Examples of character literal expressions:
114+
115+
```rust
116+
'R'; // R
117+
'\''; // '
118+
'\x52'; // R
119+
'\u{00E6}'; // LATIN SMALL LETTER AE (U+00E6)
120+
```
34121

35122
## String literal expressions
36123

37124
A string literal expression consists of a single [STRING_LITERAL] or [RAW_STRING_LITERAL] token.
38125

39-
> **Note**: This section is incomplete.
126+
The expression's type is a shared reference (with `static` lifetime) to the primitive [`str`][textual types] type.
127+
That is, the type is `&'static str`.
128+
129+
The token must not have a suffix.
130+
131+
The token's _literal content_ is the sequence of characters following the first `U+0022` (`"`) and preceding the last `U+0022` (`"`) in the string representation of the token.
132+
133+
The literal expression's _represented string_ is a sequence of characters derived from the literal content as follows:
134+
135+
* If the token is a [STRING_LITERAL], each escape sequence of any of the following forms occurring in the literal content is replaced by the escape sequence's escaped value.
136+
* [Simple escapes]
137+
* [7-bit escapes]
138+
* [Unicode escapes]
139+
* [String continuation escapes]
140+
141+
These replacements take place in left-to-right order.
142+
For example, the token `"\\x41"` is converted to the characters `\` `x` `4` `1`.
143+
144+
* If the token is a [RAW_STRING_LITERAL], the represented string is identical to the literal content.
145+
146+
The expression's value is a reference to a statically allocated [`str`][textual types] containing the UTF-8 encoding of the represented string.
147+
148+
Examples of string literal expressions:
149+
150+
```rust
151+
"foo"; r"foo"; // foo
152+
"\"foo\""; r#""foo""#; // "foo"
153+
154+
"foo #\"# bar";
155+
r##"foo #"# bar"##; // foo #"# bar
156+
157+
"\x52"; "R"; r"R"; // R
158+
"\\x52"; r"\x52"; // \x52
159+
```
40160

41161
## Byte literal expressions
42162

43163
A byte literal expression consists of a single [BYTE_LITERAL] token.
44164

45-
> **Note**: This section is incomplete.
165+
The expression's type is the primitive [`u8`][numeric types] type.
166+
167+
The token must not have a suffix.
168+
169+
The token's _literal content_ is the sequence of characters following the first `U+0027` (`'`) and preceding the last `U+0027` (`'`) in the string representation of the token.
170+
171+
The literal expression's _represented character_ is derived from the literal content as follows:
172+
173+
* If the literal content is one of the following forms of escape sequence, the represented character is the escape sequence's escaped value:
174+
* [Simple escapes]
175+
* [8-bit escapes]
176+
177+
* Otherwise the represented character is the single character that makes up the literal content.
178+
179+
The expression's value is the represented character's [Unicode scalar value].
180+
181+
> **Note**: the permitted forms of a [BYTE_LITERAL] token ensure that these rules always produce a single character, whose Unicode scalar value is in the range of [`u8`][numeric types].
182+
183+
Examples of byte literal expressions:
184+
185+
```rust
186+
b'R'; // 82
187+
b'\''; // 39
188+
b'\x52'; // 82
189+
b'\xA0'; // 160
190+
```
46191

47192
## Byte string literal expressions
48193

49-
A string literal expression consists of a single [BYTE_STRING_LITERAL] or [RAW_BYTE_STRING_LITERAL] token.
194+
A byte string literal expression consists of a single [BYTE_STRING_LITERAL] or [RAW_BYTE_STRING_LITERAL] token.
50195

51-
> **Note**: This section is incomplete.
196+
The expression's type is a shared reference (with `static` lifetime) to an array whose element type is [`u8`][numeric types].
197+
That is, the type is `&'static [u8; N]`, where `N` is the number of bytes in the represented string described below.
198+
199+
The token must not have a suffix.
200+
201+
The token's _literal content_ is the sequence of characters following the first `U+0022` (`"`) and preceding the last `U+0022` (`"`) in the string representation of the token.
202+
203+
The literal expression's _represented string_ is a sequence of characters derived from the literal content as follows:
204+
205+
* If the token is a [BYTE_STRING_LITERAL], each escape sequence of any of the following forms occurring in the literal content is replaced by the escape sequence's escaped value.
206+
* [Simple escapes]
207+
* [8-bit escapes]
208+
* [String continuation escapes]
209+
210+
These replacements take place in left-to-right order.
211+
For example, the token `b"\\x41"` is converted to the characters `\` `x` `4` `1`.
212+
213+
* If the token is a [RAW_BYTE_STRING_LITERAL], the represented string is identical to the literal content.
214+
215+
The expression's value is a reference to a statically allocated array containing the [Unicode scalar values] of the characters in the represented string, in the same order.
216+
217+
> **Note**: the permitted forms of [BYTE_STRING_LITERAL] and [RAW_BYTE_STRING_LITERAL] tokens ensure that these rules always produce array element values in the range of [`u8`][numeric types].
218+
219+
Examples of byte string literal expressions:
220+
221+
```rust
222+
b"foo"; br"foo"; // foo
223+
b"\"foo\""; br#""foo""#; // "foo"
224+
225+
b"foo #\"# bar";
226+
br##"foo #"# bar"##; // foo #"# bar
227+
228+
b"\x52"; b"R"; br"R"; // R
229+
b"\\x52"; br"\x52"; // \x52
230+
```
52231

53232
## C string literal expressions
54233

@@ -167,6 +346,11 @@ The expression's type is the primitive [boolean type], and its value is:
167346
* false if the keyword is `false`
168347

169348

349+
[Simple escapes]: #simple-escapes
350+
[8-bit escapes]: #8-bit-escapes
351+
[7-bit escapes]: #7-bit-escapes
352+
[Unicode escapes]: #unicode-escapes
353+
[String continuation escapes]: #string-continuation-escapes
170354
[boolean type]: ../types/boolean.md
171355
[constant expression]: ../const_eval.md#constant-expressions
172356
[floating-point types]: ../types/numeric.md#floating-point-types
@@ -177,12 +361,16 @@ The expression's type is the primitive [boolean type], and its value is:
177361
[suffix]: ../tokens.md#suffixes
178362
[negation operator]: operator-expr.md#negation-operators
179363
[overflow]: operator-expr.md#overflow
364+
[textual types]: ../types/textual.md
365+
[Unicode scalar value]: http://www.unicode.org/glossary/#unicode_scalar_value
366+
[Unicode scalar values]: http://www.unicode.org/glossary/#unicode_scalar_value
180367
[`f32::from_str`]: ../../core/primitive.f32.md#method.from_str
181368
[`f32::INFINITY`]: ../../core/primitive.f32.md#associatedconstant.INFINITY
182369
[`f32::NAN`]: ../../core/primitive.f32.md#associatedconstant.NAN
183370
[`f64::from_str`]: ../../core/primitive.f64.md#method.from_str
184371
[`f64::INFINITY`]: ../../core/primitive.f64.md#associatedconstant.INFINITY
185372
[`f64::NAN`]: ../../core/primitive.f64.md#associatedconstant.NAN
373+
[`u8::from_str_radix`]: ../../core/primitive.u8.md#method.from_str_radix
186374
[`u128::from_str_radix`]: ../../core/primitive.u128.md#method.from_str_radix
187375
[CHAR_LITERAL]: ../tokens.md#character-literals
188376
[STRING_LITERAL]: ../tokens.md#string-literals

0 commit comments

Comments
 (0)