Skip to content

Commit b86ca70

Browse files
milsemanAzoy
authored andcommitted
Update regex syntax pitch (swiftlang#258)
* Update regex syntax pitch * Rename file
1 parent cc91315 commit b86ca70

File tree

2 files changed

+153
-13
lines changed

2 files changed

+153
-13
lines changed

Documentation/Evolution/RegexSyntax.md renamed to Documentation/Evolution/RegexSyntaxRunTimeConstruction.md

Lines changed: 124 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
Hello, we want to issue an update to [Regular Expression Literals](https://forums.swift.org/t/pitch-regular-expression-literals/52820) and prepare for a formal proposal. The great delimiter deliberation continues to unfold, so in the meantime, we have a significant amount of surface area to present for review/feedback: the syntax _inside_ a regex literal. Additionally, this is the syntax accepted from a string used for run-time regex construction, so we're devoting an entire pitch/proposal to the topic of _regex syntax_, distinct from the result builder DSL or the choice of delimiters for literals.
33
-->
44

5-
# Run-time Regex Construction
5+
# Regex Syntax and Run-time Construction
66

77
- Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman)
88

@@ -16,21 +16,50 @@ The overall story is laid out in [Regex Type and Overview](https://github.com/ap
1616

1717
Swift aims to be a pragmatic programming language, striking a balance between familiarity, interoperability, and advancing the art. Swift's `String` presents a uniquely Unicode-forward model of string, but currently suffers from limited processing facilities.
1818

19-
<!--
20-
... tools need run time construction
21-
... ns regular expression operates over a fundamentally different model and has limited syntactic and semantic support
22-
... we prpose a best-in-class treatment of familiar regex syntax
23-
-->
19+
`NSRegularExpression` can construct a processing pipeline from a string containing [ICU regular expression syntax][icu-syntax]. However, it is inherently tied to ICU's engine and thus it operates over a fundamentally different model of string than Swift's `String`. It is also limited in features and carries a fair amount of Objective-C baggage, such as the need to translate between `NSRange` and `Range`.
20+
21+
```swift
22+
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
23+
let nsRegEx = try! NSRegularExpression(pattern: pattern)
24+
25+
func processEntry(_ line: String) -> Transaction? {
26+
let range = NSRange(line.startIndex..<line.endIndex, in: line)
27+
guard let result = nsRegEx.firstMatch(in: line, range: range),
28+
let kindRange = Range(result.range(at: 1), in: line),
29+
let kind = Transaction.Kind(line[kindRange]),
30+
let dateRange = Range(result.range(at: 2), in: line),
31+
let date = try? Date(String(line[dateRange]), strategy: dateParser),
32+
let accountRange = Range(result.range(at: 3), in: line),
33+
let amountRange = Range(result.range(at: 4), in: line),
34+
let amount = try? Decimal(
35+
String(line[amountRange]), format: decimalParser)
36+
else {
37+
return nil
38+
}
39+
40+
return Transaction(
41+
kind: kind, date: date, account: String(line[accountRange]), amount: amount)
42+
}
43+
```
44+
45+
Fixing these fundamental limitations requires migrating to a completely different engine and type system representation. This is the path we're proposing with `Regex`, outlined in [Regex Type and Overview][overview]. Details on the semantic differences between ICU's string model and Swift's `String` is discussed in [Unicode for String Processing][pitches].
2446

2547
The full string processing effort includes a regex type with strongly typed captures, the ability to create a regex from a string at runtime, a compile-time literal, a result builder DSL, protocols for intermixing 3rd party industrial-strength parsers with regex declarations, and a slew of regex-powered algorithms over strings.
2648

2749
This proposal specifically hones in on the _familiarity_ aspect by providing a best-in-class treatment of familiar regex syntax.
2850

2951
## Proposed Solution
3052

31-
<!--
32-
... regex compiling and existential match type
33-
-->
53+
We propose run-time construction of `Regex` from a best-in-class treatment of familiar regular expression syntax. A `Regex` is generic over its `Output`, which includes capture information. This may be an existential `AnyRegexOutput`, or a concrete type provided by the user.
54+
55+
```swift
56+
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
57+
let regex = try! Regex(compiling: pattern)
58+
// regex: Regex<AnyRegexOutput>
59+
60+
let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
61+
try! Regex(compiling: pattern)
62+
```
3463

3564
### Syntax
3665

@@ -51,11 +80,87 @@ Regex syntax will be part of Swift's source-compatibility story as well as its b
5180

5281
## Detailed Design
5382

54-
<!--
55-
... init, dynamic match, conversion to static
56-
-->
83+
We propose initializers to declare and compile a regex from syntax. Upon failure, these initializers throw compilation errors, such as for syntax or type errors. API for retrieving error information is future work.
84+
85+
```swift
86+
extension Regex {
87+
/// Parse and compile `pattern`, resulting in a strongly-typed capture list.
88+
public init(compiling pattern: String, as: Output.Type = Output.self) throws
89+
}
90+
extension Regex where Output == AnyRegexOutput {
91+
/// Parse and compile `pattern`, resulting in an existentially-typed capture list.
92+
public init(compiling pattern: String) throws
93+
}
94+
```
95+
96+
We propose `AnyRegexOutput` for capture types not known at compilation time, alongside casting API to convert to a strongly-typed capture list.
97+
98+
```swift
99+
/// A type-erased regex output
100+
public struct AnyRegexOutput {
101+
/// Creates a type-erased regex output from an existing output.
102+
///
103+
/// Use this initializer to fit a regex with strongly typed captures into the
104+
/// use site of a dynamic regex, i.e. one that was created from a string.
105+
public init<Output>(_ match: Regex<Output>.Match)
57106

58-
We propose the following syntax for regex.
107+
/// Returns a typed output by converting the underlying value to the specified
108+
/// type.
109+
///
110+
/// - Parameter type: The expected output type.
111+
/// - Returns: The output, if the underlying value can be converted to the
112+
/// output type, or nil otherwise.
113+
public func `as`<Output>(_ type: Output.Type) -> Output?
114+
}
115+
extension AnyRegexOutput: RandomAccessCollection {
116+
public struct Element {
117+
/// The range over which a value was captured. `nil` for no-capture.
118+
public var range: Range<String.Index>?
119+
120+
/// The slice of the input over which a value was captured. `nil` for no-capture.
121+
public var substring: Substring?
122+
123+
/// The captured value. `nil` for no-capture.
124+
public var value: Any?
125+
}
126+
127+
// Trivial collection conformance requirements
128+
129+
public var startIndex: Int { get }
130+
131+
public var endIndex: Int { get }
132+
133+
public var count: Int { get }
134+
135+
public func index(after i: Int) -> Int
136+
137+
public func index(before i: Int) -> Int
138+
139+
public subscript(position: Int) -> Element
140+
}
141+
```
142+
143+
We propose adding an API to `Regex<AnyRegexOutput>.Match` to cast the output type to a concrete one. A regex match will lazily create a `Substring` on demand, so casting the match itself saves ARC traffic vs extracting and casting the output.
144+
145+
```swift
146+
extension Regex.Match where Output == AnyRegexOutput {
147+
/// Creates a type-erased regex match from an existing match.
148+
///
149+
/// Use this initializer to fit a regex match with strongly typed captures into the
150+
/// use site of a dynamic regex match, i.e. one that was created from a string.
151+
public init<Output>(_ match: Regex<Output>.Match)
152+
153+
/// Returns a typed match by converting the underlying values to the specified
154+
/// types.
155+
///
156+
/// - Parameter type: The expected output type.
157+
/// - Returns: A match generic over the output type if the underlying values can be converted to the
158+
/// output type. Returns `nil` otherwise.
159+
public func `as`<Output>(_ type: Output.Type) -> Regex<Output>.Match?
160+
}
161+
```
162+
163+
The rest of this proposal will be a detailed and exhaustive definition of our proposed regex syntax.
59164

60165
<details><summary>Grammar Notation</summary>
61166

@@ -827,6 +932,12 @@ We are deferring runtime support for callouts from regex literals as future work
827932

828933
## Alternatives Considered
829934

935+
### Failalbe inits
936+
937+
There are many ways for compilation to fail, from syntactic errors to unsupported features to type mismatches. In the general case, run-time compilation errors are not recoverable by a tool without modifying the user's input. Even then, the thrown errors contain valuable information as to why compilation failed. For example, swiftpm presents any errors directly to the user.
938+
939+
As proposed, the errors thrown will be the same errors presented to the Swift compiler, tracking fine-grained source locations with specific reasons why compilation failed. Defining a rich error API is future work, as these errors are rapidly evolving and it is too early to lock in the ABI.
940+
830941

831942
### Skip the syntax
832943

Sources/_StringProcessing/Regex/AnyRegexOutput.swift

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ extension Regex.Match where Output == AnyRegexOutput {
3737
}
3838
}
3939

40+
/// A type-erased regex output
4041
public struct AnyRegexOutput {
4142
let input: String
4243
fileprivate let _elements: [ElementRepresentation]
@@ -70,6 +71,7 @@ extension AnyRegexOutput {
7071

7172
/// Returns a typed output by converting the underlying value to the specified
7273
/// type.
74+
///
7375
/// - Parameter type: The expected output type.
7476
/// - Returns: The output, if the underlying value can be converted to the
7577
/// output type, or nil otherwise.
@@ -119,13 +121,20 @@ extension AnyRegexOutput: RandomAccessCollection {
119121
fileprivate let representation: ElementRepresentation
120122
let input: String
121123

124+
/// The range over which a value was captured. `nil` for no-capture.
122125
public var range: Range<String.Index>? {
123126
representation.bounds
124127
}
125128

129+
/// The slice of the input over which a value was captured. `nil` for no-capture.
126130
public var substring: Substring? {
127131
range.map { input[$0] }
128132
}
133+
134+
/// The captured value, `nil` for no-capture
135+
public var value: Any? {
136+
fatalError()
137+
}
129138
}
130139

131140
public var startIndex: Int {
@@ -152,3 +161,23 @@ extension AnyRegexOutput: RandomAccessCollection {
152161
.init(representation: _elements[position], input: input)
153162
}
154163
}
164+
165+
extension Regex.Match where Output == AnyRegexOutput {
166+
/// Creates a type-erased regex match from an existing match.
167+
///
168+
/// Use this initializer to fit a regex match with strongly typed captures into the
169+
/// use site of a dynamic regex match, i.e. one that was created from a string.
170+
public init<Output>(_ match: Regex<Output>.Match) {
171+
fatalError("FIXME: Not implemented")
172+
}
173+
174+
/// Returns a typed match by converting the underlying values to the specified
175+
/// types.
176+
///
177+
/// - Parameter type: The expected output type.
178+
/// - Returns: A match generic over the output type if the underlying values can be converted to the
179+
/// output type. Returns `nil` otherwise.
180+
public func `as`<Output>(_ type: Output.Type) -> Regex<Output>.Match? {
181+
fatalError("FIXME: Not implemented")
182+
}
183+
}

0 commit comments

Comments
 (0)