You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/DelimiterSyntax.md
+23-29Lines changed: 23 additions & 29 deletions
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ This proposal helps complete the story told in [Regex Type and Overview][regex-t
21
21
22
22
**TODO: But is it?**
23
23
24
-
A regex literal will be introduced using `/.../` delimiters, within which the compiler will parse a regular expression (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]):
24
+
A regex literal will be introduced using `/.../` delimiters, within which the compiler will parse a regex (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]):
25
25
26
26
```swift
27
27
// Matches "<identifier> = <hexadecimal value>", extracting the identifier and hex number
@@ -34,7 +34,7 @@ Due to the existing use of `/` in comment syntax and operators, there are some s
34
34
35
35
## Detailed design
36
36
37
-
Choice of `/` as the regex literal delimiter requires a number of ambiguities to be resolved. And it requires some existing features of the language to be disallowed.
37
+
Choosing `/` as the regex literal delimiter requires a number of ambiguities to be resolved. It also requires a couple of source breaking language changes to be introduced in a new language mode.
38
38
39
39
### Ambiguities with comment syntax
40
40
@@ -50,7 +50,7 @@ Perhaps the most obvious parsing ambiguity with `/.../` delimiters is with comme
50
50
*/
51
51
```
52
52
53
-
In this case, the block comment would prematurely end on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with `*/` in a string literal, however it is much more likely to occur in a regular expression given the prevalence of the `*` quantifier.
53
+
In this case, the block comment would prematurely end on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with `*/` in a string literal, though it is more likely to occur in a regex given the prevalence of the `*` quantifier. This issue can be avoided in many cases by using line comment syntax `//` instead, which it should be noted is the syntax that Xcode uses when commenting out multiple lines.
54
54
55
55
- Block comment syntax also means that a regex literal would not be able to start with the `*` character, however this is less of a concern as it would not be valid regex syntax.
56
56
@@ -63,7 +63,7 @@ There would be a minor ambiguity with infix operators used with regex literals.
63
63
64
64
In order to help avoid further parsing ambiguities, a regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax.
65
65
66
-
<details><summary>Rationale</summary>
66
+
#### Rationale
67
67
68
68
This is due to 2 main ambiguities. The first of which arises when a `/.../` regex literal starts a new line. This is particularly problematic for result builders, where we expect it to be frequently used, for example:
69
69
@@ -75,7 +75,7 @@ Builder {
75
75
}
76
76
```
77
77
78
-
This is parsed as a single operator chain, however it is likely the user is expecting a regex literal. To resolve this ambiguity, a regex literal may not start with a space or tab character. This takes advantage of the fact that infix operators require consistent spacing on either side.
78
+
This is parsed as a single operator chain, however it is likely the user is expecting a regex literal. To resolve this ambiguity, a regex literal may not start with a space or tab character. The above therefore remains an operator chain. This takes advantage of the fact that infix operators require consistent spacing on either side.
79
79
80
80
If a space or tab is needed as the first character, it must be escaped, e.g:
81
81
@@ -87,7 +87,7 @@ Builder {
87
87
}
88
88
```
89
89
90
-
The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function, for example:
90
+
The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function or subscript, for example:
91
91
92
92
```swift
93
93
let arr: [Double] = [2, 3, 4]
@@ -98,37 +98,31 @@ The `/` in the call to `reduce` is in a valid expression context, and as such co
98
98
99
99
It should be noted that this only mitigates the issue, as it does not handle the case where the next character is a comma or right square bracket. These cases are explored further in the following section.
100
100
101
-
</details>
102
-
103
101
### Language changes required
104
102
105
103
In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes in Swift 6 mode:
106
104
107
105
- Deprecation of prefix operators containing the `/` character.
108
106
- Parsing `/,` and `/]` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments.
109
-
110
-
<details><summary>Rationale</summary>
111
107
112
-
#### Prefix operators starting with`/`
108
+
#### Prefix operators containing`/`
113
109
114
-
We'd need to ban prefix operators starting with `/`, to avoid ambiguity with cases such as:
110
+
We need to ban prefix operators starting with `/`, to avoid ambiguity with cases such as:
115
111
116
112
```swift
117
113
let x =/0; let y = 1/
118
114
let z =/^x^/
119
115
```
120
-
121
-
Postfix `/` operators would be okay, as they'd only be treated as regex literal delimiters if we were already trying to lex as a regex literal.
122
116
123
-
#### Prefix operators containing `/`
124
-
125
-
Prefix operators *containing*`/` (not just at the start) need banning too, in order to allow prefix operators to be used with regex literals in an unambiguous way, e.g:
117
+
Prefix operators containing `/` more generally also need banning, in order to allow prefix operators to be used with regex literals in an unambiguous way, e.g:
126
118
127
119
```swift
128
120
let x =!/y / .foo()
129
121
```
130
-
131
-
Otherwise it would be interpreted as the prefix operator `!/` by default, and require parens `!(/y /)` for regex parsing.
122
+
123
+
Today, this is interpreted as the prefix operator `!/` on `y`. With the banning of prefix operators containing `/`, it becomes prefix `!` on a regex literal, with a member access `.foo`.
124
+
125
+
Postfix `/` operators do not require banning, as they'd only be treated as regex literal delimiters if we are already trying to lex as a regex literal.
132
126
133
127
#### `/,` and `/]` as regex literal openings
134
128
@@ -156,8 +150,6 @@ func baz(_ x: S) -> Int {
156
150
157
151
`foo(/, /)` is currently parsed as 2 unapplied operator arguments. `bar(/, 2) + bar(/, 3)` is currently parsed as two independent calls that each take an unapplied `/` operator reference. Both of these would become regex literals arguments, `/, /` and `/, 2) + bar(/` respectively (though the latter would produce a regex error).
158
152
159
-
**TODO: Do we want to talk about a heuristic that looks for unbalanced parens? I'm kind of hesitant to implement that, as it would have edge cases and might screw with regex errors that should be diagnosed as invalid regex, rather than some cryptic Swift syntactic error. Which would also make it harder to explain to users.**
160
-
161
153
To disambiguate these cases, users will need to surround at least the opening `/` with parentheses, e.g:
162
154
163
155
```swift
@@ -180,6 +172,8 @@ This takes advantage of the fact that a regex literal will not be parsed if the
180
172
181
173
The obvious choice here would follow string literals and use `#/.../#`.
182
174
175
+
**TODO: What backslash rules do we want?**
176
+
183
177
### Multi-line literals
184
178
185
179
The obvious choice for a multi-line regex literal would be to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. But this signifies a (documentation) comment, so a different multi-line delimiter would be needed, with no obvious choice. However, it's not clear that we need multi-line regex literals. The existing literals can be used inside a regex builder DSL.
@@ -192,7 +186,7 @@ Allowing non-semantic whitespace and other features of the extended syntax would
192
186
193
187
### Pound slash `#/.../#`
194
188
195
-
**TODO: This needs to be rewritten to say that it's a transition syntax**
189
+
**TODO: This needs to be rewritten to say that it's a potential transition syntax**
196
190
197
191
This would be less syntactically ambiguous than `/.../`, while retaining some of the term-of-art familiarity. It would also provide a natural path through which to introduce `/.../` in a new language mode, as users could drop the `#` characters once they upgrade.
198
192
@@ -211,19 +205,19 @@ let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)'
211
205
212
206
The use of two letter prefix could potentially be used as a namespace for future literal types. It would also have obvious extensions to raw and multi-line literals using `re#'...'#` and `re'''...'''` respectively. However, it is unusual for a Swift literal to be prefixed in this way. We also feel that its similarity to a string literal might have users confuse it with a raw string literal.
213
207
214
-
Also, there are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. However, alternative syntax exists for all of these constructs, e.g `(?<name>)`, `\k<name>`, and `(?C"arg")`. Those could be required instead. If a raw regex literal were later added, the single quote syntax could also be used.
208
+
Also, there are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. However, alternative syntax exists for all of these constructs, e.g `(?<name>)`, `\k<name>`, and `(?C"arg")`. Those could be required instead. A raw regex literal syntax e.g `re#'...'#` would also avoid this issue.
215
209
216
210
### Prefixed double quote `re"...."`
217
211
218
-
This would be a double quoted version of `re'...'`, more similar to string literal syntax. This has the advantage that single quote regex syntax e.g `(?'name')` would continue to work without requiring the use of the alternative syntax or "raw syntax" delimiters. However it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote may be useful in expressing this difference.
212
+
This would be a double quoted version of `re'...'`, more similar to string literal syntax. This has the advantage that single quote regex syntax e.g `(?'name')` would continue to work without requiring the use of the alternative syntax or raw literal syntax. However it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote may be useful in expressing this difference.
219
213
220
214
### Single letter prefixed quote `r'...'`
221
215
222
216
This would be a slightly shorter version of `re'...'`. While it's more concise, it could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings.
223
217
224
218
### Single quotes `'...'`
225
219
226
-
This would be an even more concise version of `re'...'` that drops the prefix entirely. However, given how close it is to string literal syntax, it may not be entirely clear to users that `'...'` denotes a regular expression as opposed to some different form of string literal (e.g some form of character literal, or a string literal with different escaping rules).
220
+
This would be an even more concise version of `re'...'` that drops the prefix entirely. However, given how close it is to string literal syntax, it may not be entirely clear to users that `'...'` denotes a regex as opposed to some different form of string literal (e.g some form of character literal, or a string literal with different escaping rules).
227
221
228
222
We could help distinguish it from a string literal by requiring e.g `'/.../'`, though it may not be clear that the `/` characters are part of the delimiters rather than part of the literal. Additionally, this would potentially rule out the use of `'...'` as a future literal kind.
229
223
@@ -233,7 +227,7 @@ We could opt for for a more explicitly spelled out literal syntax such as `#rege
233
227
234
228
Such a syntax would require the containing regex to correctly balance parentheses for groups, otherwise the rest of the line might be incorrectly considered a regex. This could place additional cognitive burden on the user, and may lead to an awkward typing experience. For example, if the user is editing a previously written regex, the syntax highlighting for the rest of the line may change, and unhelpful spurious errors may be reported. With a different delimiter, the compiler would be able to detect and better diagnose unbalanced parentheses in the regex.
235
229
236
-
We could avoid the parenthesis balancing issue by requiring an additional internal delimiter such as `#regex(/.../)`. However it is even more heavyweight, and it may be unclear that `/` is part of the delimiter rather than part of the literal. Alternatively, we could replace the internal delimiter with another character such as ```#regex`...` ```, `#regex{...}`, or `#regex/.../`. However those would be inconsistent with the existing `#literal(...)` syntax and the first two would overload the existing meanings for the ``` `` ``` and `{}` delimiters.
230
+
We could avoid the parenthesis balancing issue by requiring an additional internal delimiter such as `#regex(/.../)`. However this is even more heavyweight, and it may be unclear that `/` is part of the delimiter rather than part of an argument. Alternatively, we could replace the internal delimiter with another character such as ```#regex`...` ```, `#regex{...}`, or `#regex/.../`. However those would be inconsistent with the existing `#literal(...)` syntax and the first two would overload the existing meanings for the ``` `` ``` and `{}` delimiters.
237
231
238
232
It should also be noted that `#regex(...)` would introduce a syntactic inconsistency where the argument of a `#literal(...)` is no longer necessarily valid Swift syntax, despite being written in the form of an argument.
239
233
@@ -243,7 +237,7 @@ We could reduce the visual weight of `#regex(...)` by only requiring `#(...)`. H
243
237
244
238
### Reusing string literal syntax
245
239
246
-
Instead of supporting a first-class literal kind for regular expressions, we could instead allow users to write a regular expression in a string literal, and parse, diagnose, and generate the appropriate code when it's coerced to an `ExpressibleByRegexLiteral` conforming type.
240
+
Instead of supporting a first-class literal kind for regex, we could instead allow users to write a regex in a string literal, and parse, diagnose, and generate the appropriate code when it's coerced to the `Regex` type.
247
241
248
242
```swift
249
243
let regex: Regex =#"([[:alpha:]]\w*) = ([0-9A-F]+)"#
- We would not be able to easily apply custom syntax highlighting and other editor features for the regex syntax.
255
-
- It would require an `ExpressibleByRegexLiteral` contextual type to be treated as a regex, otherwise it would be defaulted to `String`, which may be undesired.
249
+
- It would require a `Regex` contextual type to be treated as a regex, otherwise it would be defaulted to `String`, which may be undesired.
256
250
- In an overloaded context it may be ambiguous or unclear whether a string literal is meant to be interpreted as a literal string or regex.
257
251
- Regex-specific escape sequences such as `\w` would likely require the use of raw string syntax `#"..."#`, as they are otherwise invalid in a string literal.
258
252
- It wouldn't be compatible with other string literal features such as interpolations.
@@ -266,4 +260,4 @@ Instead of adding a custom regex literal, we could require users to explicitly w
0 commit comments