Skip to content

Commit 115a937

Browse files
authored
Merge pull request #298 from Azoy/da-api-mon
[5.7] Integrate API changes into release/5.7
2 parents b583909 + 2d9de48 commit 115a937

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+3854
-2037
lines changed

Documentation/Evolution/RegexSyntax.md renamed to Documentation/Evolution/RegexSyntaxRunTimeConstruction.md

Lines changed: 124 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
Hello, we want to issue an update to [Regular Expression Literals](https://forums.swift.org/t/pitch-regular-expression-literals/52820) and prepare for a formal proposal. The great delimiter deliberation continues to unfold, so in the meantime, we have a significant amount of surface area to present for review/feedback: the syntax _inside_ a regex literal. Additionally, this is the syntax accepted from a string used for run-time regex construction, so we're devoting an entire pitch/proposal to the topic of _regex syntax_, distinct from the result builder DSL or the choice of delimiters for literals.
33
-->
44

5-
# Run-time Regex Construction
5+
# Regex Syntax and Run-time Construction
66

77
- Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman)
88

@@ -16,21 +16,50 @@ The overall story is laid out in [Regex Type and Overview](https://github.com/ap
1616

1717
Swift aims to be a pragmatic programming language, striking a balance between familiarity, interoperability, and advancing the art. Swift's `String` presents a uniquely Unicode-forward model of string, but currently suffers from limited processing facilities.
1818

19-
<!--
20-
... tools need run time construction
21-
... ns regular expression operates over a fundamentally different model and has limited syntactic and semantic support
22-
... we prpose a best-in-class treatment of familiar regex syntax
23-
-->
19+
`NSRegularExpression` can construct a processing pipeline from a string containing [ICU regular expression syntax][icu-syntax]. However, it is inherently tied to ICU's engine and thus it operates over a fundamentally different model of string than Swift's `String`. It is also limited in features and carries a fair amount of Objective-C baggage, such as the need to translate between `NSRange` and `Range`.
20+
21+
```swift
22+
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
23+
let nsRegEx = try! NSRegularExpression(pattern: pattern)
24+
25+
func processEntry(_ line: String) -> Transaction? {
26+
let range = NSRange(line.startIndex..<line.endIndex, in: line)
27+
guard let result = nsRegEx.firstMatch(in: line, range: range),
28+
let kindRange = Range(result.range(at: 1), in: line),
29+
let kind = Transaction.Kind(line[kindRange]),
30+
let dateRange = Range(result.range(at: 2), in: line),
31+
let date = try? Date(String(line[dateRange]), strategy: dateParser),
32+
let accountRange = Range(result.range(at: 3), in: line),
33+
let amountRange = Range(result.range(at: 4), in: line),
34+
let amount = try? Decimal(
35+
String(line[amountRange]), format: decimalParser)
36+
else {
37+
return nil
38+
}
39+
40+
return Transaction(
41+
kind: kind, date: date, account: String(line[accountRange]), amount: amount)
42+
}
43+
```
44+
45+
Fixing these fundamental limitations requires migrating to a completely different engine and type system representation. This is the path we're proposing with `Regex`, outlined in [Regex Type and Overview][overview]. Details on the semantic differences between ICU's string model and Swift's `String` is discussed in [Unicode for String Processing][pitches].
2446

2547
The full string processing effort includes a regex type with strongly typed captures, the ability to create a regex from a string at runtime, a compile-time literal, a result builder DSL, protocols for intermixing 3rd party industrial-strength parsers with regex declarations, and a slew of regex-powered algorithms over strings.
2648

2749
This proposal specifically hones in on the _familiarity_ aspect by providing a best-in-class treatment of familiar regex syntax.
2850

2951
## Proposed Solution
3052

31-
<!--
32-
... regex compiling and existential match type
33-
-->
53+
We propose run-time construction of `Regex` from a best-in-class treatment of familiar regular expression syntax. A `Regex` is generic over its `Output`, which includes capture information. This may be an existential `AnyRegexOutput`, or a concrete type provided by the user.
54+
55+
```swift
56+
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
57+
let regex = try! Regex(pattern)
58+
// regex: Regex<AnyRegexOutput>
59+
60+
let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
61+
try! Regex(pattern)
62+
```
3463

3564
### Syntax
3665

@@ -51,11 +80,87 @@ Regex syntax will be part of Swift's source-compatibility story as well as its b
5180

5281
## Detailed Design
5382

54-
<!--
55-
... init, dynamic match, conversion to static
56-
-->
83+
We propose initializers to declare and compile a regex from syntax. Upon failure, these initializers throw compilation errors, such as for syntax or type errors. API for retrieving error information is future work.
84+
85+
```swift
86+
extension Regex {
87+
/// Parse and compile `pattern`, resulting in a strongly-typed capture list.
88+
public init(compiling pattern: String, as: Output.Type = Output.self) throws
89+
}
90+
extension Regex where Output == AnyRegexOutput {
91+
/// Parse and compile `pattern`, resulting in an existentially-typed capture list.
92+
public init(compiling pattern: String) throws
93+
}
94+
```
95+
96+
We propose `AnyRegexOutput` for capture types not known at compilation time, alongside casting API to convert to a strongly-typed capture list.
97+
98+
```swift
99+
/// A type-erased regex output
100+
public struct AnyRegexOutput {
101+
/// Creates a type-erased regex output from an existing output.
102+
///
103+
/// Use this initializer to fit a regex with strongly typed captures into the
104+
/// use site of a dynamic regex, i.e. one that was created from a string.
105+
public init<Output>(_ match: Regex<Output>.Match)
57106

58-
We propose the following syntax for regex.
107+
/// Returns a typed output by converting the underlying value to the specified
108+
/// type.
109+
///
110+
/// - Parameter type: The expected output type.
111+
/// - Returns: The output, if the underlying value can be converted to the
112+
/// output type, or nil otherwise.
113+
public func `as`<Output>(_ type: Output.Type) -> Output?
114+
}
115+
extension AnyRegexOutput: RandomAccessCollection {
116+
public struct Element {
117+
/// The range over which a value was captured. `nil` for no-capture.
118+
public var range: Range<String.Index>?
119+
120+
/// The slice of the input over which a value was captured. `nil` for no-capture.
121+
public var substring: Substring?
122+
123+
/// The captured value. `nil` for no-capture.
124+
public var value: Any?
125+
}
126+
127+
// Trivial collection conformance requirements
128+
129+
public var startIndex: Int { get }
130+
131+
public var endIndex: Int { get }
132+
133+
public var count: Int { get }
134+
135+
public func index(after i: Int) -> Int
136+
137+
public func index(before i: Int) -> Int
138+
139+
public subscript(position: Int) -> Element
140+
}
141+
```
142+
143+
We propose adding an API to `Regex<AnyRegexOutput>.Match` to cast the output type to a concrete one. A regex match will lazily create a `Substring` on demand, so casting the match itself saves ARC traffic vs extracting and casting the output.
144+
145+
```swift
146+
extension Regex.Match where Output == AnyRegexOutput {
147+
/// Creates a type-erased regex match from an existing match.
148+
///
149+
/// Use this initializer to fit a regex match with strongly typed captures into the
150+
/// use site of a dynamic regex match, i.e. one that was created from a string.
151+
public init<Output>(_ match: Regex<Output>.Match)
152+
153+
/// Returns a typed match by converting the underlying values to the specified
154+
/// types.
155+
///
156+
/// - Parameter type: The expected output type.
157+
/// - Returns: A match generic over the output type if the underlying values can be converted to the
158+
/// output type. Returns `nil` otherwise.
159+
public func `as`<Output>(_ type: Output.Type) -> Regex<Output>.Match?
160+
}
161+
```
162+
163+
The rest of this proposal will be a detailed and exhaustive definition of our proposed regex syntax.
59164

60165
<details><summary>Grammar Notation</summary>
61166

@@ -827,6 +932,12 @@ We are deferring runtime support for callouts from regex literals as future work
827932

828933
## Alternatives Considered
829934

935+
### Failalbe inits
936+
937+
There are many ways for compilation to fail, from syntactic errors to unsupported features to type mismatches. In the general case, run-time compilation errors are not recoverable by a tool without modifying the user's input. Even then, the thrown errors contain valuable information as to why compilation failed. For example, swiftpm presents any errors directly to the user.
938+
939+
As proposed, the errors thrown will be the same errors presented to the Swift compiler, tracking fine-grained source locations with specific reasons why compilation failed. Defining a rich error API is future work, as these errors are rapidly evolving and it is too early to lock in the ABI.
940+
830941

831942
### Skip the syntax
832943

Documentation/Evolution/RegexTypeOverview.md

Lines changed: 22 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
# Regex Type and Overview
32

43
- Authors: [Michael Ilseman](https://github.com/milseman) and the Standard Library Team
@@ -135,11 +134,11 @@ Regexes can be created at run time from a string containing familiar regex synta
135134

136135
```swift
137136
let pattern = #"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
138-
let regex = try! Regex(compiling: pattern)
137+
let regex = try! Regex(pattern)
139138
// regex: Regex<AnyRegexOutput>
140139

141140
let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
142-
try! Regex(compiling: pattern)
141+
try! Regex(pattern)
143142
```
144143

145144
*Note*: The syntax accepted and further details on run-time compilation, including `AnyRegexOutput` and extended syntaxes, are discussed in [Run-time Regex Construction][pitches].
@@ -225,7 +224,7 @@ func processEntry(_ line: String) -> Transaction? {
225224

226225
The result builder allows for inline failable value construction, which participates in the overall string processing algorithm: returning `nil` signals a local failure and the engine backtracks to try an alternative. This not only relieves the use site from post-processing, it enables new kinds of processing algorithms, allows for search-space pruning, and enhances debuggability.
227226

228-
Swift regexes describe an unambiguous algorithm, were choice is ordered and effects can be reliably observed. For example, a `print()` statement inside the `TryCapture`'s transform function will run whenever the overall algorithm naturally dictates an attempt should be made. Optimizations can only elide such calls if they can prove it is behavior-preserving (e.g. "pure").
227+
Swift regexes describe an unambiguous algorithm, where choice is ordered and effects can be reliably observed. For example, a `print()` statement inside the `TryCapture`'s transform function will run whenever the overall algorithm naturally dictates an attempt should be made. Optimizations can only elide such calls if they can prove it is behavior-preserving (e.g. "pure").
229228

230229
`CustomMatchingRegexComponent`, discussed in [String Processing Algorithms][pitches], allows industrial-strength parsers to be used a regex components. This allows us to drop the overly-permissive pre-parsing step:
231230

@@ -278,14 +277,14 @@ func processEntry(_ line: String) -> Transaction? {
278277
*Note*: Details on how references work is discussed in [Regex Builders][pitches]. `Regex.Match` supports referring to _all_ captures by position (`match.1`, etc.) whether named or referenced or neither. Due to compiler limitations, result builders do not support forming labeled tuples for named captures.
279278

280279

281-
### Algorithms, algorithms everywhere
280+
### Regex-powered algorithms
282281

283282
Regexes can be used right out of the box with a variety of powerful and convenient algorithms, including trimming, splitting, and finding/replacing all matches within a string.
284283

285284
These algorithms are discussed in [String Processing Algorithms][pitches].
286285

287286

288-
### Onward Unicode
287+
### Unicode handling
289288

290289
A regex describes an algorithm to be ran over some model of string, and Swift's `String` has a rather unique Unicode-forward model. `Character` is an [extended grapheme cluster](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) and equality is determined under [canonical equivalence](https://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence).
291290

@@ -301,7 +300,7 @@ Regex targets [UTS\#18 Level 2](https://www.unicode.org/reports/tr18/#Extended_U
301300
```swift
302301
/// A regex represents a string processing algorithm.
303302
///
304-
/// let regex = try Regex(compiling: "a(.*)b")
303+
/// let regex = try Regex("a(.*)b")
305304
/// let match = "cbaxb".firstMatch(of: regex)
306305
/// print(match.0) // "axb"
307306
/// print(match.1) // "x"
@@ -310,12 +309,12 @@ public struct Regex<Output> {
310309
/// Match a string in its entirety.
311310
///
312311
/// Returns `nil` if no match and throws on abort
313-
public func matchWhole(_ s: String) throws -> Regex<Output>.Match?
312+
public func wholeMatch(in s: String) throws -> Regex<Output>.Match?
314313

315314
/// Match part of the string, starting at the beginning.
316315
///
317316
/// Returns `nil` if no match and throws on abort
318-
public func matchPrefix(_ s: String) throws -> Regex<Output>.Match?
317+
public func prefixMatch(in s: String) throws -> Regex<Output>.Match?
319318

320319
/// Find the first match in a string
321320
///
@@ -325,17 +324,17 @@ public struct Regex<Output> {
325324
/// Match a substring in its entirety.
326325
///
327326
/// Returns `nil` if no match and throws on abort
328-
public func matchWhole(_ s: Substring) throws -> Regex<Output>.Match?
327+
public func wholeMatch(in s: Substring) throws -> Regex<Output>.Match?
329328

330329
/// Match part of the string, starting at the beginning.
331330
///
332331
/// Returns `nil` if no match and throws on abort
333-
public func matchPrefix(_ s: Substring) throws -> Regex<Output>.Match?
332+
public func prefixMatch(in s: Substring) throws -> Regex<Output>.Match?
334333

335334
/// Find the first match in a substring
336335
///
337336
/// Returns `nil` if no match is found and throws on abort
338-
public func firstMatch(_ s: Substring) throws -> Regex<Output>.Match?
337+
public func firstMatch(in s: Substring) throws -> Regex<Output>.Match?
339338

340339
/// The result of matching a regex against a string.
341340
///
@@ -344,19 +343,19 @@ public struct Regex<Output> {
344343
@dynamicMemberLookup
345344
public struct Match {
346345
/// The range of the overall match
347-
public let range: Range<String.Index>
346+
public var range: Range<String.Index> { get }
348347

349348
/// The produced output from the match operation
350-
public var output: Output
349+
public var output: Output { get }
351350

352351
/// Lookup a capture by name or number
353-
public subscript<T>(dynamicMember keyPath: KeyPath<Output, T>) -> T
352+
public subscript<T>(dynamicMember keyPath: KeyPath<Output, T>) -> T { get }
354353

355354
/// Lookup a capture by number
356355
@_disfavoredOverload
357356
public subscript(
358357
dynamicMember keyPath: KeyPath<(Output, _doNotUse: ()), Output>
359-
) -> Output
358+
) -> Output { get }
360359
// Note: this allows `.0` when `Match` is not a tuple.
361360

362361
}
@@ -482,6 +481,13 @@ We're also looking for more community discussion on what the default type system
482481

483482
The actual `Match` struct just stores ranges: the `Substrings` are lazily created on demand. This avoids unnecessary ARC traffic and memory usage.
484483

484+
485+
### `Regex<Match, Captures>` instead of `Regex<Output>`
486+
487+
The generic parameter `Output` is proposed to contain both the whole match (the `.0` element if `Output` is a tuple) and captures. One alternative we have considered is separating `Output` into the entire match and the captures, i.e. `Regex<Match, Captures>`, and using `Void` for for `Captures` when there are no captures.
488+
489+
The biggest issue with this alternative design is that the numbering of `Captures` elements misaligns with the numbering of captures in textual regexes, where backreference `\0` refers to the entire match and captures start at `\1`. This design would sacrifice familarity and have the pitfall of introducing off-by-one errors.
490+
485491
### Future work: static optimization and compilation
486492

487493
Swift's support for static compilation is still developing, and future work here is leveraging that to compile regex when profitable. Many regex describe simple [DFAs](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) and can be statically compiled into very efficient programs. Full static compilation needs to be balanced with code size concerns, as a matching-specific bytecode is typically far smaller than a corresponding program (especially since the bytecode interpreter is shared).

Documentation/Evolution/StringProcessingAlgorithms.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,7 @@ public protocol CustomMatchingRegexComponent : RegexComponent {
187187
_ input: String,
188188
startingAt index: String.Index,
189189
in bounds: Range<String.Index>
190-
) -> (upperBound: String.Index, match: Match)?
190+
) throws -> (upperBound: String.Index, match: Match)?
191191
}
192192
```
193193

0 commit comments

Comments
 (0)