Skip to content

Commit 9432324

Browse files
committed
- Restructure the pitch to this structure:
- Motivation for adding algorithms - Motivation for `CustomRegexComponent` - Design for added algorithms - Design for `CustomRegexComponent` - Add a few doc comments
1 parent fea786a commit 9432324

File tree

1 file changed

+117
-87
lines changed

1 file changed

+117
-87
lines changed

Documentation/Evolution/StringProcessingAlgorithms.md

Lines changed: 117 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -4,54 +4,33 @@
44

55
The standard library is currently missing a large number of `String` algorithms that do exist in Foundation. We introduce a more coherent set of `Collection` algorithms with a focus on string processing, including support for regular expressions.
66

7-
## Motivation
8-
9-
TODO
107

11-
## Proposed solution
8+
## Motivation
129

13-
We introduce internal infrastructure that allows groups of `Collection` algorithms that perform the same operations on different types to share their implementation, leading to a more coherent set of public APIs. This allows us to more easily provide algorithms that work with `RegexProtocol` values, such as
10+
TODO: Motivation for adding both generic `<r: RegexProtocol>` and non-generic algorithm functions.
1411

15-
```swift
16-
extension BidirectionalCollection where SubSequence == Substring {
17-
public func ranges<R: RegexProtocol>(of regex: R) -> some Collection<Range<Index>>
18-
}
19-
```
2012

21-
We also introduce the `CustomRegexComponent` protocol that conveniently lets types from outside the standard library participate in regex builders and `RegexProtocol` algorithms:
13+
### Use custom parsers in regex builders and `RegexProtocol` algorithms
2214

23-
```swift
24-
public protocol CustomRegexComponent: RegexProtocol {
25-
/// Match the input string within the specified bounds, beginning at the given index, and return
26-
/// the end position (upper bound) of the match and the matched instance.
27-
/// - Parameters:
28-
/// - input: The string in which the match is performed.
29-
/// - index: An index of `input` at which to begin matching.
30-
/// - bounds: The bounds in `input` in which the match is performed.
31-
/// - Returns: The upper bound where the match terminates and a matched instance, or nil if
32-
/// there isn't a match.
33-
func match(
34-
_ input: String,
35-
startingAt index: String.Index,
36-
in bounds: Range<String.Index>
37-
) -> (upperBound: String.Index, match: Match)?
38-
}
39-
```
15+
We want to extend string processing to types from outside the standard library, so that one can incorporate custom parsers in regex builders and `RegexProtocol` algorithms seamlessly.
4016

4117
Consider parsing an HTTP header to capture the date field as a `Date` type:
4218

43-
```
19+
```swift
20+
let header = """
4421
HTTP/1.1 301 Redirect
4522
Date: Wed, 16 Feb 2022 23:53:19 GMT
4623
Connection: close
4724
Location: https://www.apple.com/
4825
Content-Type: text/html
4926
Content-Language: en
27+
"""
5028
```
5129

52-
You are likely going to match a substring that look like a date string (`16 Feb 2022`), and parse the substring as a `Date` with one of Foundation's date parsers:
30+
You are likely going to match a substring that look like a date string (`16 Feb 2022`), and parse the substring as a `Date` with one of the date parsers in the Foundation framework:
5331

5432
```swift
33+
let dateParser = Date.ParseStrategy(format: "\(day: .twoDigits) \(month: .abbreviated) \(year: .padded(4))"
5534
let regex = Regex {
5635
capture {
5736
oneOrMore(.digit)
@@ -63,51 +42,24 @@ let regex = Regex {
6342
}
6443

6544
if let dateMatch = header.firstMatch(of: regex)?.0 {
66-
let date = try? Date(dateMatch, strategy: .fixed(format: "\(day: .twoDigits) \(month: .abbreviated) \(year: .padded(4))", timeZone: TimeZone(identifier: "GMT")!, locale: Locale(identifier: "en_US")))
45+
let date = try? Date(dateMatch, strategy: dateParser)
6746
}
6847
```
6948

7049
This works, but wouldn't it be much more approachable if you can directly use the date parser within the string match function?
7150

7251
```swift
7352
let regex = Regex {
74-
capture {
75-
.date(format: "\(day: .twoDigits) \(month: .abbreviated) \(year: .padded(4))", timeZone: TimeZone(identifier: "GMT")!, locale: Locale(identifier: "en_US"))
76-
}
53+
capture(dateParser)
7754
}
7855

79-
if let match = header.firstMatch(of: regex) {
80-
let string = match.0 // "16 Feb 2022"
81-
let date = match.1 // 2022-02-16 00:00:00 +0000
82-
}
56+
let date = header.firstMatch(of: regex).map(\.result.1)
57+
// A `Date` representing 2022-02-16 00:00:00 +0000
8358
```
8459

85-
You can do this because Foundation framework's `Date.ParseStrategy` conforms to `CustomRegexComponent`, defined above. You can also conform your custom parser to `CustomRegexComponent`. Conformance is simple: implement the `match` function to return the upper bound of the matched substring, and the type represented by the matched range. It inherits from `RegexProtocol`, so you will be able to use it with all of the string algorithms that take a `RegexProtocol` type.
86-
87-
Foundation framework's `Date.ParseStrategy` conforms to `CustomRegexComponent` this way. It also adds a static function `date(format:timeZone:locale)` as a static member of `RegexProtocol`, so you can refer to it as `.date(format:...)` in the `Regex` result builder.
88-
89-
```swift
90-
extension Date.ParseStrategy : CustomRegexComponent {
91-
func match(
92-
_ input: String,
93-
startingAt index: String.Index,
94-
in bounds: Range<String.Index>
95-
) -> (upperBound: String.Index, match: Date)?
96-
}
97-
98-
extension RegexProtocol where Self == Date.ParseStrategy {
99-
public static func date(
100-
format: Date.FormatString,
101-
timeZone: TimeZone,
102-
locale: Locale? = nil
103-
) -> Self
104-
}
105-
```
106-
107-
Here's another example of how you can use `FloatingPointFormatStyle<Double>.Currency` to parse a bank statement and record all the monetary values:
60+
Or consider parsing a bank statement to record all the monetary values in the last column:
10861

10962
```swift
110-
11163
let statement = """
11264
CREDIT 04/06/2020 Paypal transfer $4.99
11365
DSLIP 04/06/2020 REMOTE ONLINE DEPOSIT $3,020.85
@@ -116,41 +68,38 @@ DEBIT 04/02/2020 ACH TRNSFR ($38.25)
11668
DEBIT 03/31/2020 Payment to BoA card ($27.44)
11769
DEBIT 03/24/2020 IRX tax payment ($52,249.98)
11870
"""
71+
```
11972

120-
let regex = Regex {
121-
capture {
122-
.currency(code: "USD").sign(strategy: .accounting)
123-
}
124-
}
73+
We have already seen that parsing a date string can be tricky since it could contain localized month name (`"Feb"` as seen from above). Parsing a currency string such as `$3,020.85` with regex is not trivial either -- it can contain grouping separators, a decimal separator, and a currency symbol, all of which can be localized.
12574

126-
let amount = statement.matches(of: regex).map(\.1)
127-
// [4.99, 3020.85, 69.73, -38.25, -27.44, -52249.98]
128-
```
75+
The Foundation framework has various parsers for localized strings like these. Delegating this task to dedicated parsers alleviates the need to handle it yourself. In the second part of the pitch, we introduce the `CustomRegexComponent` protocol that conveniently lets types from outside the standard library participate in regex builders and `RegexProtocol` algorithms.
12976

130-
## Detailed design
77+
## Proposed solution
13178

132-
### `CustomRegexComponent` protocol
79+
We introduce internal infrastructure that allows groups of `Collection` algorithms that perform the same operations on different types to share their implementation, leading to a more coherent set of public APIs. This allows us to more easily provide algorithms that work with `RegexProtocol` values, such as
13380

134-
The `CustomRegexComponent` protocol inherits from `RegexProtocol` and satisfies its sole requirement. This enables the usage of types that conform to `CustomRegexComponent` in regex builders and `RegexProtocol` algorithms.
81+
```swift
82+
extension BidirectionalCollection where SubSequence == Substring {
83+
public func ranges<R: RegexProtocol>(of regex: R) -> some Collection<Range<Index>>
84+
}
85+
```
86+
87+
We also introduce the `CustomRegexComponent` protocol that conveniently lets types from outside the standard library participate in regex builders and `RegexProtocol` algorithms.
88+
89+
If Foundation's currency parser, `Foundation.FloatingPointFormatStyle<Double>.Currency`, conformed to `CustomRegexComponent`, you would be able to retrieve the currency from the bank statement above as a list of `Double` values this way:
13590

13691
```swift
137-
public protocol CustomRegexComponent: RegexProtocol {
138-
/// Match the input string within the specified bounds, beginning at the given index, and return
139-
/// the end position (upper bound) of the match and the matched instance.
140-
/// - Parameters:
141-
/// - input: The string in which the match is performed.
142-
/// - index: An index of `input` at which to begin matching.
143-
/// - bounds: The bounds in `input` in which the match is performed.
144-
/// - Returns: The upper bound where the match terminates and a matched instance, or nil if
145-
/// there isn't a match.
146-
func match(
147-
_ input: String,
148-
startingAt index: String.Index,
149-
in bounds: Range<String.Index>
150-
) -> (upperBound: String.Index, match: Match)?
92+
let regex = Regex {
93+
capture(.localizedCurrency(code: "USD").sign(strategy: .accounting))
15194
}
95+
96+
let amount = statement.matches(of: regex).map(\.result.1)
97+
// [4.99, 3020.85, 69.73, -38.25, -27.44, -52249.98]
15298
```
15399

100+
101+
## Detailed design
102+
154103
### Algorithms
155104

156105
The following algorithms are included in this pitch:
@@ -159,11 +108,21 @@ The following algorithms are included in this pitch:
159108

160109
```swift
161110
extension Collection where Element: Equatable {
111+
/// Returns a Boolean value indicating whether the collection contains the
112+
/// given sequence.
113+
/// - Parameter other: A sequence to search for within this collection.
114+
/// - Returns: `true` if the collection contains the specified sequence,
115+
/// otherwise `false`.
162116
public func contains<S: Sequence>(_ other: S) -> Bool
163117
where S.Element == Element
164118
}
165119

166120
extension BidirectionalCollection where SubSequence == Substring {
121+
/// Returns a Boolean value indicating whether the collection contains the
122+
/// given regex.
123+
/// - Parameter regex: A regex to search for within this collection.
124+
/// - Returns: `true` if the regex was found in the collection, otherwise
125+
/// `false`.
167126
public func contains<R: RegexProtocol>(_ regex: R) -> Bool
168127
}
169128
```
@@ -172,6 +131,11 @@ extension BidirectionalCollection where SubSequence == Substring {
172131

173132
```swift
174133
extension BidirectionalCollection where SubSequence == Substring {
134+
/// Returns a Boolean value indicating whether the initial elements of the
135+
/// sequence are the same as the elements in the specified regex.
136+
/// - Parameter regex: A regex to compare to this sequence.
137+
/// - Returns: `true` if the initial elements of the sequence matches the
138+
/// beginning of `regex`; otherwise, `false`.
175139
public func starts<R: RegexProtocol>(with regex: R) -> Bool
176140
}
177141
```
@@ -180,14 +144,31 @@ extension BidirectionalCollection where SubSequence == Substring {
180144

181145
```swift
182146
extension Collection {
147+
/// Returns a new collection of the same type by removing initial elements
148+
/// that satisfy the given predicate from the start
149+
/// - Parameter predicate: A closure that takes an element of the sequence
150+
/// as its argument and returns a Boolean value indicating whether the
151+
/// element should be removed from the collection.
152+
/// - Returns: A collection containing the elements of the receiver that are
153+
/// not removed by `predicate`.
183154
public func trimmingPrefix(while predicate: (Element) -> Bool) -> SubSequence
184155
}
185156

186157
extension Collection where SubSequence == Self {
158+
/// Removes the initial elements that satisfy the given predicate from the
159+
/// start of the sequence.
160+
/// - Parameter predicate: A closure that takes an element of the sequence
161+
/// as its argument and returns a Boolean value indicating whether the
162+
/// element should be removed from the collection.
187163
public mutating func trimPrefix(while predicate: (Element) -> Bool)
188164
}
189165

190166
extension RangeReplaceableCollection {
167+
/// Removes the initial elements that satisfy the given predicate from the
168+
/// start of the sequence.
169+
/// - Parameter predicate: A closure that takes an element of the sequence
170+
/// as its argument and returns a Boolean value indicating whether the
171+
/// element should be removed from the collection.
191172
public mutating func trimPrefix(while predicate: (Element) -> Bool)
192173
}
193174

@@ -342,6 +323,55 @@ extension BidirectionalCollection where SubSequence == Substring {
342323
}
343324
```
344325

326+
### `CustomRegexComponent` protocol
327+
328+
The `CustomRegexComponent` protocol inherits from `RegexProtocol` and satisfies its sole requirement. This enables the usage of types that conform to `CustomRegexComponent` in regex builders and `RegexProtocol` algorithms.
329+
330+
```swift
331+
public protocol CustomRegexComponent: RegexProtocol {
332+
/// Match the input string within the specified bounds, beginning at the given index, and return
333+
/// the end position (upper bound) of the match and the matched instance.
334+
/// - Parameters:
335+
/// - input: The string in which the match is performed.
336+
/// - index: An index of `input` at which to begin matching.
337+
/// - bounds: The bounds in `input` in which the match is performed.
338+
/// - Returns: The upper bound where the match terminates and a matched instance, or nil if
339+
/// there isn't a match.
340+
func match(
341+
_ input: String,
342+
startingAt index: String.Index,
343+
in bounds: Range<String.Index>
344+
) -> (upperBound: String.Index, match: Match)?
345+
}
346+
```
347+
348+
You can conform your custom parser to `CustomRegexComponent`. Conformance is simple: implement the `match` function to return the upper bound of the matched substring, and the type represented by the matched range. It inherits from `RegexProtocol`, so you will be able to use it with all of the string algorithms that take a `RegexProtocol` type.
349+
350+
Here, we use Foundation framework's `FloatingPointFormatStyle<Double>.Currency` as an example. `FloatingPointFormatStyle<Double>.Currency` would conform to `CustomRegexComponent` by implementing the `match` function with `Match` being a `Double`. It could also add a static function `.localizedCurrency(code:)` as a member of `RegexProtocol`, so you can refer to it as `.localizedCurrency(code:)` in the `Regex` result builder.
351+
352+
```swift
353+
extension FloatingPointFormatStyle<Double>.Currency : CustomRegexComponent {
354+
func match(
355+
_ input: String,
356+
startingAt index: String.Index,
357+
in bounds: Range<String.Index>
358+
) -> (upperBound: String.Index, match: Double)?
359+
}
360+
361+
extension RegexProtocol where Self == FloatingPointFormatStyle<Double>.Currency {
362+
public static func localizedCurrency(code: Locale.Currency) -> Self
363+
}
364+
```
365+
366+
Users could specify a pattern to match a localized currency amount such as `"$3,020.85"` simply with the following, and use it in any of the string matching algorithms introduced above.
367+
368+
```swift
369+
let regex = Regex {
370+
capture(.localizedCurreny(code: "USD"))
371+
}
372+
```
373+
374+
345375
## Alternatives considered
346376

347377
### Extend `Sequence` instead of `Collection`

0 commit comments

Comments
 (0)