You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, we want to issue an update to [Regular Expression Literals](https://forums.swift.org/t/pitch-regular-expression-literals/52820) and prepare for a formal proposal. The great delimiter deliberation continues to unfold, so in the meantime, we have a significant amount of surface area to present for review/feedback: the syntax _inside_ a regex literal. Additionally, this is the syntax accepted from a string used for run-time regex construction, so we're devoting an entire pitch/proposal to the topic of _regex syntax_, distinct from the result builder DSL or the choice of delimiters for literals.
@@ -16,21 +16,50 @@ The overall story is laid out in [Regex Type and Overview](https://github.com/ap
16
16
17
17
Swift aims to be a pragmatic programming language, striking a balance between familiarity, interoperability, and advancing the art. Swift's `String` presents a uniquely Unicode-forward model of string, but currently suffers from limited processing facilities.
18
18
19
-
<!--
20
-
... tools need run time construction
21
-
... ns regular expression operates over a fundamentally different model and has limited syntactic and semantic support
22
-
... we prpose a best-in-class treatment of familiar regex syntax
23
-
-->
19
+
`NSRegularExpression` can construct a processing pipeline from a string containing [ICU regular expression syntax][icu-syntax]. However, it is inherently tied to ICU's engine and thus it operates over a fundamentally different model of string than Swift's `String`. It is also limited in features and carries a fair amount of Objective-C baggage, such as the need to translate between `NSRange` and `Range`.
20
+
21
+
```swift
22
+
let pattern =#"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
23
+
let nsRegEx =try!NSRegularExpression(pattern: pattern)
24
+
25
+
funcprocessEntry(_line: String) -> Transaction? {
26
+
let range =NSRange(line.startIndex..<line.endIndex, in: line)
27
+
guardlet result = nsRegEx.firstMatch(in: line, range: range),
28
+
let kindRange =Range(result.range(at: 1), in: line),
29
+
let kind = Transaction.Kind(line[kindRange]),
30
+
let dateRange =Range(result.range(at: 2), in: line),
31
+
let date =try?Date(String(line[dateRange]), strategy: dateParser),
32
+
let accountRange =Range(result.range(at: 3), in: line),
33
+
let amountRange =Range(result.range(at: 4), in: line),
Fixing these fundamental limitations requires migrating to a completely different engine and type system representation. This is the path we're proposing with `Regex`, outlined in [Regex Type and Overview][overview]. Details on the semantic differences between ICU's string model and Swift's `String` is discussed in [Unicode for String Processing][pitches].
24
46
25
47
The full string processing effort includes a regex type with strongly typed captures, the ability to create a regex from a string at runtime, a compile-time literal, a result builder DSL, protocols for intermixing 3rd party industrial-strength parsers with regex declarations, and a slew of regex-powered algorithms over strings.
26
48
27
49
This proposal specifically hones in on the _familiarity_ aspect by providing a best-in-class treatment of familiar regex syntax.
28
50
29
51
## Proposed Solution
30
52
31
-
<!--
32
-
... regex compiling and existential match type
33
-
-->
53
+
We propose run-time construction of `Regex` from a best-in-class treatment of familiar regular expression syntax. A `Regex` is generic over its `Output`, which includes capture information. This may be an existential `AnyRegexOutput`, or a concrete type provided by the user.
54
+
55
+
```swift
56
+
let pattern =#"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
57
+
let regex =try!Regex(pattern)
58
+
// regex: Regex<AnyRegexOutput>
59
+
60
+
let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
61
+
try!Regex(pattern)
62
+
```
34
63
35
64
### Syntax
36
65
@@ -51,11 +80,87 @@ Regex syntax will be part of Swift's source-compatibility story as well as its b
51
80
52
81
## Detailed Design
53
82
54
-
<!--
55
-
... init, dynamic match, conversion to static
56
-
-->
83
+
We propose initializers to declare and compile a regex from syntax. Upon failure, these initializers throw compilation errors, such as for syntax or type errors. API for retrieving error information is future work.
84
+
85
+
```swift
86
+
extensionRegex {
87
+
/// Parse and compile `pattern`, resulting in a strongly-typed capture list.
/// The range over which a value was captured. `nil` for no-capture.
118
+
publicvar range: Range<String.Index>?
119
+
120
+
/// The slice of the input over which a value was captured. `nil` for no-capture.
121
+
publicvar substring: Substring?
122
+
123
+
/// The captured value. `nil` for no-capture.
124
+
publicvar value: Any?
125
+
}
126
+
127
+
// Trivial collection conformance requirements
128
+
129
+
publicvar startIndex: Int { get }
130
+
131
+
publicvar endIndex: Int { get }
132
+
133
+
publicvar count: Int { get }
134
+
135
+
publicfuncindex(afteri: Int) ->Int
136
+
137
+
publicfuncindex(beforei: Int) ->Int
138
+
139
+
publicsubscript(position: Int) ->Element
140
+
}
141
+
```
142
+
143
+
We propose adding an API to `Regex<AnyRegexOutput>.Match` to cast the output type to a concrete one. A regex match will lazily create a `Substring` on demand, so casting the match itself saves ARC traffic vs extracting and casting the output.
144
+
145
+
```swift
146
+
extensionRegex.Match where Output == AnyRegexOutput {
147
+
/// Creates a type-erased regex match from an existing match.
148
+
///
149
+
/// Use this initializer to fit a regex match with strongly typed captures into the
150
+
/// use site of a dynamic regex match, i.e. one that was created from a string.
151
+
publicinit<Output>(_match: Regex<Output>.Match)
152
+
153
+
/// Returns a typed match by converting the underlying values to the specified
154
+
/// types.
155
+
///
156
+
/// - Parameter type: The expected output type.
157
+
/// - Returns: A match generic over the output type if the underlying values can be converted to the
The rest of this proposal will be a detailed and exhaustive definition of our proposed regex syntax.
59
164
60
165
<details><summary>Grammar Notation</summary>
61
166
@@ -827,6 +932,12 @@ We are deferring runtime support for callouts from regex literals as future work
827
932
828
933
## Alternatives Considered
829
934
935
+
### Failalbe inits
936
+
937
+
There are many ways for compilation to fail, from syntactic errors to unsupported features to type mismatches. In the general case, run-time compilation errors are not recoverable by a tool without modifying the user's input. Even then, the thrown errors contain valuable information as to why compilation failed. For example, swiftpm presents any errors directly to the user.
938
+
939
+
As proposed, the errors thrown will be the same errors presented to the Swift compiler, tracking fine-grained source locations with specific reasons why compilation failed. Defining a rich error API is future work, as these errors are rapidly evolving and it is too early to lock in the ABI.
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexTypeOverview.md
+22-16Lines changed: 22 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,3 @@
1
-
2
1
# Regex Type and Overview
3
2
4
3
- Authors: [Michael Ilseman](https://github.com/milseman) and the Standard Library Team
@@ -135,11 +134,11 @@ Regexes can be created at run time from a string containing familiar regex synta
135
134
136
135
```swift
137
136
let pattern =#"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
138
-
let regex =try!Regex(compiling: pattern)
137
+
let regex =try!Regex(pattern)
139
138
// regex: Regex<AnyRegexOutput>
140
139
141
140
let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
142
-
try!Regex(compiling: pattern)
141
+
try!Regex(pattern)
143
142
```
144
143
145
144
*Note*: The syntax accepted and further details on run-time compilation, including `AnyRegexOutput` and extended syntaxes, are discussed in [Run-time Regex Construction][pitches].
The result builder allows for inline failable value construction, which participates in the overall string processing algorithm: returning `nil` signals a local failure and the engine backtracks to try an alternative. This not only relieves the use site from post-processing, it enables new kinds of processing algorithms, allows for search-space pruning, and enhances debuggability.
227
226
228
-
Swift regexes describe an unambiguous algorithm, were choice is ordered and effects can be reliably observed. For example, a `print()` statement inside the `TryCapture`'s transform function will run whenever the overall algorithm naturally dictates an attempt should be made. Optimizations can only elide such calls if they can prove it is behavior-preserving (e.g. "pure").
227
+
Swift regexes describe an unambiguous algorithm, where choice is ordered and effects can be reliably observed. For example, a `print()` statement inside the `TryCapture`'s transform function will run whenever the overall algorithm naturally dictates an attempt should be made. Optimizations can only elide such calls if they can prove it is behavior-preserving (e.g. "pure").
229
228
230
229
`CustomMatchingRegexComponent`, discussed in [String Processing Algorithms][pitches], allows industrial-strength parsers to be used a regex components. This allows us to drop the overly-permissive pre-parsing step:
*Note*: Details on how references work is discussed in [Regex Builders][pitches]. `Regex.Match` supports referring to _all_ captures by position (`match.1`, etc.) whether named or referenced or neither. Due to compiler limitations, result builders do not support forming labeled tuples for named captures.
279
278
280
279
281
-
### Algorithms, algorithms everywhere
280
+
### Regex-powered algorithms
282
281
283
282
Regexes can be used right out of the box with a variety of powerful and convenient algorithms, including trimming, splitting, and finding/replacing all matches within a string.
284
283
285
284
These algorithms are discussed in [String Processing Algorithms][pitches].
286
285
287
286
288
-
### Onward Unicode
287
+
### Unicode handling
289
288
290
289
A regex describes an algorithm to be ran over some model of string, and Swift's `String` has a rather unique Unicode-forward model. `Character` is an [extended grapheme cluster](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) and equality is determined under [canonical equivalence](https://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence).
// Note: this allows `.0` when `Match` is not a tuple.
361
360
362
361
}
@@ -482,6 +481,13 @@ We're also looking for more community discussion on what the default type system
482
481
483
482
The actual `Match` struct just stores ranges: the `Substrings` are lazily created on demand. This avoids unnecessary ARC traffic and memory usage.
484
483
484
+
485
+
### `Regex<Match, Captures>` instead of `Regex<Output>`
486
+
487
+
The generic parameter `Output` is proposed to contain both the whole match (the `.0` element if `Output` is a tuple) and captures. One alternative we have considered is separating `Output` into the entire match and the captures, i.e. `Regex<Match, Captures>`, and using `Void` for for `Captures` when there are no captures.
488
+
489
+
The biggest issue with this alternative design is that the numbering of `Captures` elements misaligns with the numbering of captures in textual regexes, where backreference `\0` refers to the entire match and captures start at `\1`. This design would sacrifice familarity and have the pitfall of introducing off-by-one errors.
490
+
485
491
### Future work: static optimization and compilation
486
492
487
493
Swift's support for static compilation is still developing, and future work here is leveraging that to compile regex when profitable. Many regex describe simple [DFAs](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) and can be statically compiled into very efficient programs. Full static compilation needs to be balanced with code size concerns, as a matching-specific bytecode is typically far smaller than a corresponding program (especially since the bytecode interpreter is shared).
0 commit comments