Skip to content

Commit fea786a

Browse files
author
Tim Vermeulen
authored
Add string processing algorithms pitch draft
1 parent ad4966c commit fea786a

File tree

1 file changed

+369
-0
lines changed

1 file changed

+369
-0
lines changed
Lines changed: 369 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,369 @@
1+
# String processing algorithms
2+
3+
## Introduction
4+
5+
The standard library is currently missing a large number of `String` algorithms that do exist in Foundation. We introduce a more coherent set of `Collection` algorithms with a focus on string processing, including support for regular expressions.
6+
7+
## Motivation
8+
9+
TODO
10+
11+
## Proposed solution
12+
13+
We introduce internal infrastructure that allows groups of `Collection` algorithms that perform the same operations on different types to share their implementation, leading to a more coherent set of public APIs. This allows us to more easily provide algorithms that work with `RegexProtocol` values, such as
14+
15+
```swift
16+
extension BidirectionalCollection where SubSequence == Substring {
17+
public func ranges<R: RegexProtocol>(of regex: R) -> some Collection<Range<Index>>
18+
}
19+
```
20+
21+
We also introduce the `CustomRegexComponent` protocol that conveniently lets types from outside the standard library participate in regex builders and `RegexProtocol` algorithms:
22+
23+
```swift
24+
public protocol CustomRegexComponent: RegexProtocol {
25+
/// Match the input string within the specified bounds, beginning at the given index, and return
26+
/// the end position (upper bound) of the match and the matched instance.
27+
/// - Parameters:
28+
/// - input: The string in which the match is performed.
29+
/// - index: An index of `input` at which to begin matching.
30+
/// - bounds: The bounds in `input` in which the match is performed.
31+
/// - Returns: The upper bound where the match terminates and a matched instance, or nil if
32+
/// there isn't a match.
33+
func match(
34+
_ input: String,
35+
startingAt index: String.Index,
36+
in bounds: Range<String.Index>
37+
) -> (upperBound: String.Index, match: Match)?
38+
}
39+
```
40+
41+
Consider parsing an HTTP header to capture the date field as a `Date` type:
42+
43+
```
44+
HTTP/1.1 301 Redirect
45+
Date: Wed, 16 Feb 2022 23:53:19 GMT
46+
Connection: close
47+
Location: https://www.apple.com/
48+
Content-Type: text/html
49+
Content-Language: en
50+
```
51+
52+
You are likely going to match a substring that look like a date string (`16 Feb 2022`), and parse the substring as a `Date` with one of Foundation's date parsers:
53+
54+
```swift
55+
let regex = Regex {
56+
capture {
57+
oneOrMore(.digit)
58+
" "
59+
oneOrMore(.word)
60+
" "
61+
oneOrMore(.digit)
62+
}
63+
}
64+
65+
if let dateMatch = header.firstMatch(of: regex)?.0 {
66+
let date = try? Date(dateMatch, strategy: .fixed(format: "\(day: .twoDigits) \(month: .abbreviated) \(year: .padded(4))", timeZone: TimeZone(identifier: "GMT")!, locale: Locale(identifier: "en_US")))
67+
}
68+
```
69+
70+
This works, but wouldn't it be much more approachable if you can directly use the date parser within the string match function?
71+
72+
```swift
73+
let regex = Regex {
74+
capture {
75+
.date(format: "\(day: .twoDigits) \(month: .abbreviated) \(year: .padded(4))", timeZone: TimeZone(identifier: "GMT")!, locale: Locale(identifier: "en_US"))
76+
}
77+
}
78+
79+
if let match = header.firstMatch(of: regex) {
80+
let string = match.0 // "16 Feb 2022"
81+
let date = match.1 // 2022-02-16 00:00:00 +0000
82+
}
83+
```
84+
85+
You can do this because Foundation framework's `Date.ParseStrategy` conforms to `CustomRegexComponent`, defined above. You can also conform your custom parser to `CustomRegexComponent`. Conformance is simple: implement the `match` function to return the upper bound of the matched substring, and the type represented by the matched range. It inherits from `RegexProtocol`, so you will be able to use it with all of the string algorithms that take a `RegexProtocol` type.
86+
87+
Foundation framework's `Date.ParseStrategy` conforms to `CustomRegexComponent` this way. It also adds a static function `date(format:timeZone:locale)` as a static member of `RegexProtocol`, so you can refer to it as `.date(format:...)` in the `Regex` result builder.
88+
89+
```swift
90+
extension Date.ParseStrategy : CustomRegexComponent {
91+
func match(
92+
_ input: String,
93+
startingAt index: String.Index,
94+
in bounds: Range<String.Index>
95+
) -> (upperBound: String.Index, match: Date)?
96+
}
97+
98+
extension RegexProtocol where Self == Date.ParseStrategy {
99+
public static func date(
100+
format: Date.FormatString,
101+
timeZone: TimeZone,
102+
locale: Locale? = nil
103+
) -> Self
104+
}
105+
```
106+
107+
Here's another example of how you can use `FloatingPointFormatStyle<Double>.Currency` to parse a bank statement and record all the monetary values:
108+
109+
```swift
110+
111+
let statement = """
112+
CREDIT 04/06/2020 Paypal transfer $4.99
113+
DSLIP 04/06/2020 REMOTE ONLINE DEPOSIT $3,020.85
114+
CREDIT 04/03/2020 PAYROLL $69.73
115+
DEBIT 04/02/2020 ACH TRNSFR ($38.25)
116+
DEBIT 03/31/2020 Payment to BoA card ($27.44)
117+
DEBIT 03/24/2020 IRX tax payment ($52,249.98)
118+
"""
119+
120+
let regex = Regex {
121+
capture {
122+
.currency(code: "USD").sign(strategy: .accounting)
123+
}
124+
}
125+
126+
let amount = statement.matches(of: regex).map(\.1)
127+
// [4.99, 3020.85, 69.73, -38.25, -27.44, -52249.98]
128+
```
129+
130+
## Detailed design
131+
132+
### `CustomRegexComponent` protocol
133+
134+
The `CustomRegexComponent` protocol inherits from `RegexProtocol` and satisfies its sole requirement. This enables the usage of types that conform to `CustomRegexComponent` in regex builders and `RegexProtocol` algorithms.
135+
136+
```swift
137+
public protocol CustomRegexComponent: RegexProtocol {
138+
/// Match the input string within the specified bounds, beginning at the given index, and return
139+
/// the end position (upper bound) of the match and the matched instance.
140+
/// - Parameters:
141+
/// - input: The string in which the match is performed.
142+
/// - index: An index of `input` at which to begin matching.
143+
/// - bounds: The bounds in `input` in which the match is performed.
144+
/// - Returns: The upper bound where the match terminates and a matched instance, or nil if
145+
/// there isn't a match.
146+
func match(
147+
_ input: String,
148+
startingAt index: String.Index,
149+
in bounds: Range<String.Index>
150+
) -> (upperBound: String.Index, match: Match)?
151+
}
152+
```
153+
154+
### Algorithms
155+
156+
The following algorithms are included in this pitch:
157+
158+
#### Contains
159+
160+
```swift
161+
extension Collection where Element: Equatable {
162+
public func contains<S: Sequence>(_ other: S) -> Bool
163+
where S.Element == Element
164+
}
165+
166+
extension BidirectionalCollection where SubSequence == Substring {
167+
public func contains<R: RegexProtocol>(_ regex: R) -> Bool
168+
}
169+
```
170+
171+
#### Starts with
172+
173+
```swift
174+
extension BidirectionalCollection where SubSequence == Substring {
175+
public func starts<R: RegexProtocol>(with regex: R) -> Bool
176+
}
177+
```
178+
179+
#### Trim prefix
180+
181+
```swift
182+
extension Collection {
183+
public func trimmingPrefix(while predicate: (Element) -> Bool) -> SubSequence
184+
}
185+
186+
extension Collection where SubSequence == Self {
187+
public mutating func trimPrefix(while predicate: (Element) -> Bool)
188+
}
189+
190+
extension RangeReplaceableCollection {
191+
public mutating func trimPrefix(while predicate: (Element) -> Bool)
192+
}
193+
194+
extension Collection where Element: Equatable {
195+
public func trimmingPrefix<Prefix: Collection>(_ prefix: Prefix) -> SubSequence
196+
where Prefix.Element == Element
197+
}
198+
199+
extension Collection where SubSequence == Self, Element: Equatable {
200+
public mutating func trimPrefix<Prefix: Collection>(_ prefix: Prefix)
201+
where Prefix.Element == Element
202+
}
203+
204+
extension RangeReplaceableCollection where Element: Equatable {
205+
public mutating func trimPrefix<Prefix: Collection>(_ prefix: Prefix)
206+
where Prefix.Element == Element
207+
}
208+
209+
extension BidirectionalCollection where SubSequence == Substring {
210+
public func trimmingPrefix<R: RegexProtocol>(_ regex: R) -> SubSequence
211+
}
212+
213+
extension RangeReplaceableCollection
214+
where Self: BidirectionalCollection, SubSequence == Substring
215+
{
216+
public mutating func trimPrefix<R: RegexProtocol>(_ regex: R)
217+
}
218+
```
219+
220+
#### First range
221+
222+
```swift
223+
extension Collection where Element: Equatable {
224+
public func firstRange<S: Sequence>(of sequence: S) -> Range<Index>?
225+
where S.Element == Element
226+
}
227+
228+
extension BidirectionalCollection where Element: Comparable {
229+
public func firstRange<S: Sequence>(of other: S) -> Range<Index>?
230+
where S.Element == Element
231+
}
232+
233+
extension BidirectionalCollection where SubSequence == Substring {
234+
public func firstRange<R: RegexProtocol>(of regex: R) -> Range<Index>?
235+
}
236+
```
237+
238+
#### Ranges
239+
240+
```swift
241+
extension Collection where Element: Equatable {
242+
public func ranges<S: Sequence>(of other: S) -> some Collection<Range<Index>>
243+
where S.Element == Element
244+
}
245+
246+
extension BidirectionalCollection where SubSequence == Substring {
247+
public func ranges<R: RegexProtocol>(of regex: R) -> some Collection<Range<Index>>
248+
}
249+
```
250+
251+
#### First match
252+
253+
```swift
254+
extension BidirectionalCollection where SubSequence == Substring {
255+
public func firstMatch<R: RegexProtocol>(of regex: R) -> RegexMatch<R.Match>?
256+
}
257+
```
258+
259+
#### Matches
260+
261+
```swift
262+
extension BidirectionalCollection where SubSequence == Substring {
263+
public func matches<R: RegexProtocol>(of regex: R) -> some Collection<RegexMatch<R.Match>>
264+
}
265+
```
266+
267+
#### Replace
268+
269+
```swift
270+
extension RangeReplaceableCollection where Element: Equatable {
271+
public func replacing<S: Sequence, Replacement: Collection>(
272+
_ other: S,
273+
with replacement: Replacement,
274+
subrange: Range<Index>,
275+
maxReplacements: Int = .max
276+
) -> Self where S.Element == Element, Replacement.Element == Element
277+
278+
public func replacing<S: Sequence, Replacement: Collection>(
279+
_ other: S,
280+
with replacement: Replacement,
281+
maxReplacements: Int = .max
282+
) -> Self where S.Element == Element, Replacement.Element == Element
283+
284+
public mutating func replace<S: Sequence, Replacement: Collection>(
285+
_ other: S,
286+
with replacement: Replacement,
287+
maxReplacements: Int = .max
288+
) where S.Element == Element, Replacement.Element == Element
289+
}
290+
291+
extension RangeReplaceableCollection where SubSequence == Substring {
292+
public func replacing<R: RegexProtocol, Replacement: Collection>(
293+
_ regex: R,
294+
with replacement: Replacement,
295+
subrange: Range<Index>,
296+
maxReplacements: Int = .max
297+
) -> Self where Replacement.Element == Element
298+
299+
public func replacing<R: RegexProtocol, Replacement: Collection>(
300+
_ regex: R,
301+
with replacement: Replacement,
302+
maxReplacements: Int = .max
303+
) -> Self where Replacement.Element == Element
304+
305+
public mutating func replace<R: RegexProtocol, Replacement: Collection>(
306+
_ regex: R,
307+
with replacement: Replacement,
308+
maxReplacements: Int = .max
309+
) where Replacement.Element == Element
310+
311+
public func replacing<R: RegexProtocol, Replacement: Collection>(
312+
_ regex: R,
313+
with replacement: (RegexMatch<R.Match>) throws -> Replacement,
314+
subrange: Range<Index>,
315+
maxReplacements: Int = .max
316+
) rethrows -> Self where Replacement.Element == Element
317+
318+
public func replacing<R: RegexProtocol, Replacement: Collection>(
319+
_ regex: R,
320+
with replacement: (RegexMatch<R.Match>) throws -> Replacement,
321+
maxReplacements: Int = .max
322+
) rethrows -> Self where Replacement.Element == Element
323+
324+
public mutating func replace<R: RegexProtocol, Replacement: Collection>(
325+
_ regex: R,
326+
with replacement: (RegexMatch<R.Match>) throws -> Replacement,
327+
maxReplacements: Int = .max
328+
) rethrows where Replacement.Element == Element
329+
}
330+
```
331+
332+
#### Split
333+
334+
```swift
335+
extension Collection where Element: Equatable {
336+
public func split<S: Sequence>(by separator: S) -> some Collection<SubSequence>
337+
where S.Element == Element
338+
}
339+
340+
extension BidirectionalCollection where SubSequence == Substring {
341+
public func split<R: RegexProtocol>(by separator: R) -> some Collection<Substring>
342+
}
343+
```
344+
345+
## Alternatives considered
346+
347+
### Extend `Sequence` instead of `Collection`
348+
349+
All of the proposed algorithms are specific to the `Collection` protocol, without support for plain `Sequence`s. Types conforming to the `Sequence` protocol are not required to support multi-pass iteration, which makes a `Sequence` conformance insufficient for most of these algorithms. In light of this, the decision was made to have the underlying shared algorithm implementations work exclusively with `Collection`s.
350+
351+
## Future directions
352+
353+
### Backward algorithms
354+
355+
There are some unanswered questions about algorithms that operate from the back of a collection.
356+
357+
There is a subtle difference between finding the last non-overlapping range of a pattern in a string, and finding the first range of this pattern when searching from the back. `"aaaaa".ranges(of: "aa")` produces two non-overlapping ranges, splitting the string in the chunks `aa|aa|a`. It would not be completely unreasonable to expect `"aaaaa".lastRange(of: "aa")` to be shorthand for `"aaaaa".ranges(of: "aa").last`, i.e. to return the range that contains the third and fourth characters of the string. Yet, the first range of `"aa"` when searching from the back of the string yields the range that contains the fourth and fifth characters.
358+
359+
It is not obvious whether both of these notions of what it means for a range to be the "last" range should be supported, or what names should be used in order to disambiguate them. It is also worth noting that some kinds of patterns do behave nicely and always produce the same results when searching forwards or backwards, e.g. `myInts.lastIndex(where: { $0 > 10 })` is unambiguous. These kinds of patterns might warrant special treatment when designing algorithms that process the collection in reverse.
360+
361+
Similar questions arise when trimming a string from both sides: `"ababa".trimming("aba")` can return either `"ba"` or `"ab"`, depending on whether the prefix or the suffix was trimmed first.
362+
363+
### Throwing closures
364+
365+
The closure parameters of `trimPrefix(while:)` and `replace(_:with:)` aren't marked `throws` and the methods themselves aren't marked `rethrows`, because the shared implementations of these groups of related algorithms do not yet support error handling.
366+
367+
### Open up the shared algorithm implementations for user-defined types
368+
369+
At this point we have not settled on a final design for the protocol hierarchy that the shared algorithm implementations rely on, so we are not ready to expose this infrastructure and stabilize the entire ABI. We aim to eventually open up the ability for users to pass their own types to these `Collection` algorithms without having to go through the `RegexProtocol` overload which creates an intermediate `Regex` instance.

0 commit comments

Comments
 (0)