|
| 1 | +# String processing algorithms |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +The standard library is currently missing a large number of `String` algorithms that do exist in Foundation. We introduce a more coherent set of `Collection` algorithms with a focus on string processing, including support for regular expressions. |
| 6 | + |
| 7 | +## Motivation |
| 8 | + |
| 9 | +TODO |
| 10 | + |
| 11 | +## Proposed solution |
| 12 | + |
| 13 | +We introduce internal infrastructure that allows groups of `Collection` algorithms that perform the same operations on different types to share their implementation, leading to a more coherent set of public APIs. This allows us to more easily provide algorithms that work with `RegexProtocol` values, such as |
| 14 | + |
| 15 | +```swift |
| 16 | +extension BidirectionalCollection where SubSequence == Substring { |
| 17 | + public func ranges<R: RegexProtocol>(of regex: R) -> some Collection<Range<Index>> |
| 18 | +} |
| 19 | +``` |
| 20 | + |
| 21 | +We also introduce the `CustomRegexComponent` protocol that conveniently lets types from outside the standard library participate in regex builders and `RegexProtocol` algorithms: |
| 22 | + |
| 23 | +```swift |
| 24 | +public protocol CustomRegexComponent: RegexProtocol { |
| 25 | + /// Match the input string within the specified bounds, beginning at the given index, and return |
| 26 | + /// the end position (upper bound) of the match and the matched instance. |
| 27 | + /// - Parameters: |
| 28 | + /// - input: The string in which the match is performed. |
| 29 | + /// - index: An index of `input` at which to begin matching. |
| 30 | + /// - bounds: The bounds in `input` in which the match is performed. |
| 31 | + /// - Returns: The upper bound where the match terminates and a matched instance, or nil if |
| 32 | + /// there isn't a match. |
| 33 | + func match( |
| 34 | + _ input: String, |
| 35 | + startingAt index: String.Index, |
| 36 | + in bounds: Range<String.Index> |
| 37 | + ) -> (upperBound: String.Index, match: Match)? |
| 38 | +} |
| 39 | +``` |
| 40 | + |
| 41 | +Consider parsing an HTTP header to capture the date field as a `Date` type: |
| 42 | + |
| 43 | +``` |
| 44 | +HTTP/1.1 301 Redirect |
| 45 | +Date: Wed, 16 Feb 2022 23:53:19 GMT |
| 46 | +Connection: close |
| 47 | +Location: https://www.apple.com/ |
| 48 | +Content-Type: text/html |
| 49 | +Content-Language: en |
| 50 | +``` |
| 51 | + |
| 52 | +You are likely going to match a substring that look like a date string (`16 Feb 2022`), and parse the substring as a `Date` with one of Foundation's date parsers: |
| 53 | + |
| 54 | +```swift |
| 55 | +let regex = Regex { |
| 56 | + capture { |
| 57 | + oneOrMore(.digit) |
| 58 | + " " |
| 59 | + oneOrMore(.word) |
| 60 | + " " |
| 61 | + oneOrMore(.digit) |
| 62 | + } |
| 63 | +} |
| 64 | + |
| 65 | +if let dateMatch = header.firstMatch(of: regex)?.0 { |
| 66 | + let date = try? Date(dateMatch, strategy: .fixed(format: "\(day: .twoDigits) \(month: .abbreviated) \(year: .padded(4))", timeZone: TimeZone(identifier: "GMT")!, locale: Locale(identifier: "en_US"))) |
| 67 | +} |
| 68 | +``` |
| 69 | + |
| 70 | +This works, but wouldn't it be much more approachable if you can directly use the date parser within the string match function? |
| 71 | + |
| 72 | +```swift |
| 73 | +let regex = Regex { |
| 74 | + capture { |
| 75 | + .date(format: "\(day: .twoDigits) \(month: .abbreviated) \(year: .padded(4))", timeZone: TimeZone(identifier: "GMT")!, locale: Locale(identifier: "en_US")) |
| 76 | + } |
| 77 | +} |
| 78 | + |
| 79 | +if let match = header.firstMatch(of: regex) { |
| 80 | + let string = match.0 // "16 Feb 2022" |
| 81 | + let date = match.1 // 2022-02-16 00:00:00 +0000 |
| 82 | +} |
| 83 | +``` |
| 84 | + |
| 85 | +You can do this because Foundation framework's `Date.ParseStrategy` conforms to `CustomRegexComponent`, defined above. You can also conform your custom parser to `CustomRegexComponent`. Conformance is simple: implement the `match` function to return the upper bound of the matched substring, and the type represented by the matched range. It inherits from `RegexProtocol`, so you will be able to use it with all of the string algorithms that take a `RegexProtocol` type. |
| 86 | + |
| 87 | +Foundation framework's `Date.ParseStrategy` conforms to `CustomRegexComponent` this way. It also adds a static function `date(format:timeZone:locale)` as a static member of `RegexProtocol`, so you can refer to it as `.date(format:...)` in the `Regex` result builder. |
| 88 | + |
| 89 | +```swift |
| 90 | +extension Date.ParseStrategy : CustomRegexComponent { |
| 91 | + func match( |
| 92 | + _ input: String, |
| 93 | + startingAt index: String.Index, |
| 94 | + in bounds: Range<String.Index> |
| 95 | + ) -> (upperBound: String.Index, match: Date)? |
| 96 | +} |
| 97 | + |
| 98 | +extension RegexProtocol where Self == Date.ParseStrategy { |
| 99 | + public static func date( |
| 100 | + format: Date.FormatString, |
| 101 | + timeZone: TimeZone, |
| 102 | + locale: Locale? = nil |
| 103 | + ) -> Self |
| 104 | +} |
| 105 | +``` |
| 106 | + |
| 107 | +Here's another example of how you can use `FloatingPointFormatStyle<Double>.Currency` to parse a bank statement and record all the monetary values: |
| 108 | + |
| 109 | +```swift |
| 110 | + |
| 111 | +let statement = """ |
| 112 | +CREDIT 04/06/2020 Paypal transfer $4.99 |
| 113 | +DSLIP 04/06/2020 REMOTE ONLINE DEPOSIT $3,020.85 |
| 114 | +CREDIT 04/03/2020 PAYROLL $69.73 |
| 115 | +DEBIT 04/02/2020 ACH TRNSFR ($38.25) |
| 116 | +DEBIT 03/31/2020 Payment to BoA card ($27.44) |
| 117 | +DEBIT 03/24/2020 IRX tax payment ($52,249.98) |
| 118 | +""" |
| 119 | + |
| 120 | +let regex = Regex { |
| 121 | + capture { |
| 122 | + .currency(code: "USD").sign(strategy: .accounting) |
| 123 | + } |
| 124 | +} |
| 125 | + |
| 126 | +let amount = statement.matches(of: regex).map(\.1) |
| 127 | +// [4.99, 3020.85, 69.73, -38.25, -27.44, -52249.98] |
| 128 | +``` |
| 129 | + |
| 130 | +## Detailed design |
| 131 | + |
| 132 | +### `CustomRegexComponent` protocol |
| 133 | + |
| 134 | +The `CustomRegexComponent` protocol inherits from `RegexProtocol` and satisfies its sole requirement. This enables the usage of types that conform to `CustomRegexComponent` in regex builders and `RegexProtocol` algorithms. |
| 135 | + |
| 136 | +```swift |
| 137 | +public protocol CustomRegexComponent: RegexProtocol { |
| 138 | + /// Match the input string within the specified bounds, beginning at the given index, and return |
| 139 | + /// the end position (upper bound) of the match and the matched instance. |
| 140 | + /// - Parameters: |
| 141 | + /// - input: The string in which the match is performed. |
| 142 | + /// - index: An index of `input` at which to begin matching. |
| 143 | + /// - bounds: The bounds in `input` in which the match is performed. |
| 144 | + /// - Returns: The upper bound where the match terminates and a matched instance, or nil if |
| 145 | + /// there isn't a match. |
| 146 | + func match( |
| 147 | + _ input: String, |
| 148 | + startingAt index: String.Index, |
| 149 | + in bounds: Range<String.Index> |
| 150 | + ) -> (upperBound: String.Index, match: Match)? |
| 151 | +} |
| 152 | +``` |
| 153 | + |
| 154 | +### Algorithms |
| 155 | + |
| 156 | +The following algorithms are included in this pitch: |
| 157 | + |
| 158 | +#### Contains |
| 159 | + |
| 160 | +```swift |
| 161 | +extension Collection where Element: Equatable { |
| 162 | + public func contains<S: Sequence>(_ other: S) -> Bool |
| 163 | + where S.Element == Element |
| 164 | +} |
| 165 | + |
| 166 | +extension BidirectionalCollection where SubSequence == Substring { |
| 167 | + public func contains<R: RegexProtocol>(_ regex: R) -> Bool |
| 168 | +} |
| 169 | +``` |
| 170 | + |
| 171 | +#### Starts with |
| 172 | + |
| 173 | +```swift |
| 174 | +extension BidirectionalCollection where SubSequence == Substring { |
| 175 | + public func starts<R: RegexProtocol>(with regex: R) -> Bool |
| 176 | +} |
| 177 | +``` |
| 178 | + |
| 179 | +#### Trim prefix |
| 180 | + |
| 181 | +```swift |
| 182 | +extension Collection { |
| 183 | + public func trimmingPrefix(while predicate: (Element) -> Bool) -> SubSequence |
| 184 | +} |
| 185 | + |
| 186 | +extension Collection where SubSequence == Self { |
| 187 | + public mutating func trimPrefix(while predicate: (Element) -> Bool) |
| 188 | +} |
| 189 | + |
| 190 | +extension RangeReplaceableCollection { |
| 191 | + public mutating func trimPrefix(while predicate: (Element) -> Bool) |
| 192 | +} |
| 193 | + |
| 194 | +extension Collection where Element: Equatable { |
| 195 | + public func trimmingPrefix<Prefix: Collection>(_ prefix: Prefix) -> SubSequence |
| 196 | + where Prefix.Element == Element |
| 197 | +} |
| 198 | + |
| 199 | +extension Collection where SubSequence == Self, Element: Equatable { |
| 200 | + public mutating func trimPrefix<Prefix: Collection>(_ prefix: Prefix) |
| 201 | + where Prefix.Element == Element |
| 202 | +} |
| 203 | + |
| 204 | +extension RangeReplaceableCollection where Element: Equatable { |
| 205 | + public mutating func trimPrefix<Prefix: Collection>(_ prefix: Prefix) |
| 206 | + where Prefix.Element == Element |
| 207 | +} |
| 208 | + |
| 209 | +extension BidirectionalCollection where SubSequence == Substring { |
| 210 | + public func trimmingPrefix<R: RegexProtocol>(_ regex: R) -> SubSequence |
| 211 | +} |
| 212 | + |
| 213 | +extension RangeReplaceableCollection |
| 214 | + where Self: BidirectionalCollection, SubSequence == Substring |
| 215 | +{ |
| 216 | + public mutating func trimPrefix<R: RegexProtocol>(_ regex: R) |
| 217 | +} |
| 218 | +``` |
| 219 | + |
| 220 | +#### First range |
| 221 | + |
| 222 | +```swift |
| 223 | +extension Collection where Element: Equatable { |
| 224 | + public func firstRange<S: Sequence>(of sequence: S) -> Range<Index>? |
| 225 | + where S.Element == Element |
| 226 | +} |
| 227 | + |
| 228 | +extension BidirectionalCollection where Element: Comparable { |
| 229 | + public func firstRange<S: Sequence>(of other: S) -> Range<Index>? |
| 230 | + where S.Element == Element |
| 231 | +} |
| 232 | + |
| 233 | +extension BidirectionalCollection where SubSequence == Substring { |
| 234 | + public func firstRange<R: RegexProtocol>(of regex: R) -> Range<Index>? |
| 235 | +} |
| 236 | +``` |
| 237 | + |
| 238 | +#### Ranges |
| 239 | + |
| 240 | +```swift |
| 241 | +extension Collection where Element: Equatable { |
| 242 | + public func ranges<S: Sequence>(of other: S) -> some Collection<Range<Index>> |
| 243 | + where S.Element == Element |
| 244 | +} |
| 245 | + |
| 246 | +extension BidirectionalCollection where SubSequence == Substring { |
| 247 | + public func ranges<R: RegexProtocol>(of regex: R) -> some Collection<Range<Index>> |
| 248 | +} |
| 249 | +``` |
| 250 | + |
| 251 | +#### First match |
| 252 | + |
| 253 | +```swift |
| 254 | +extension BidirectionalCollection where SubSequence == Substring { |
| 255 | + public func firstMatch<R: RegexProtocol>(of regex: R) -> RegexMatch<R.Match>? |
| 256 | +} |
| 257 | +``` |
| 258 | + |
| 259 | +#### Matches |
| 260 | + |
| 261 | +```swift |
| 262 | +extension BidirectionalCollection where SubSequence == Substring { |
| 263 | + public func matches<R: RegexProtocol>(of regex: R) -> some Collection<RegexMatch<R.Match>> |
| 264 | +} |
| 265 | +``` |
| 266 | + |
| 267 | +#### Replace |
| 268 | + |
| 269 | +```swift |
| 270 | +extension RangeReplaceableCollection where Element: Equatable { |
| 271 | + public func replacing<S: Sequence, Replacement: Collection>( |
| 272 | + _ other: S, |
| 273 | + with replacement: Replacement, |
| 274 | + subrange: Range<Index>, |
| 275 | + maxReplacements: Int = .max |
| 276 | + ) -> Self where S.Element == Element, Replacement.Element == Element |
| 277 | + |
| 278 | + public func replacing<S: Sequence, Replacement: Collection>( |
| 279 | + _ other: S, |
| 280 | + with replacement: Replacement, |
| 281 | + maxReplacements: Int = .max |
| 282 | + ) -> Self where S.Element == Element, Replacement.Element == Element |
| 283 | + |
| 284 | + public mutating func replace<S: Sequence, Replacement: Collection>( |
| 285 | + _ other: S, |
| 286 | + with replacement: Replacement, |
| 287 | + maxReplacements: Int = .max |
| 288 | + ) where S.Element == Element, Replacement.Element == Element |
| 289 | +} |
| 290 | + |
| 291 | +extension RangeReplaceableCollection where SubSequence == Substring { |
| 292 | + public func replacing<R: RegexProtocol, Replacement: Collection>( |
| 293 | + _ regex: R, |
| 294 | + with replacement: Replacement, |
| 295 | + subrange: Range<Index>, |
| 296 | + maxReplacements: Int = .max |
| 297 | + ) -> Self where Replacement.Element == Element |
| 298 | + |
| 299 | + public func replacing<R: RegexProtocol, Replacement: Collection>( |
| 300 | + _ regex: R, |
| 301 | + with replacement: Replacement, |
| 302 | + maxReplacements: Int = .max |
| 303 | + ) -> Self where Replacement.Element == Element |
| 304 | + |
| 305 | + public mutating func replace<R: RegexProtocol, Replacement: Collection>( |
| 306 | + _ regex: R, |
| 307 | + with replacement: Replacement, |
| 308 | + maxReplacements: Int = .max |
| 309 | + ) where Replacement.Element == Element |
| 310 | + |
| 311 | + public func replacing<R: RegexProtocol, Replacement: Collection>( |
| 312 | + _ regex: R, |
| 313 | + with replacement: (RegexMatch<R.Match>) throws -> Replacement, |
| 314 | + subrange: Range<Index>, |
| 315 | + maxReplacements: Int = .max |
| 316 | + ) rethrows -> Self where Replacement.Element == Element |
| 317 | + |
| 318 | + public func replacing<R: RegexProtocol, Replacement: Collection>( |
| 319 | + _ regex: R, |
| 320 | + with replacement: (RegexMatch<R.Match>) throws -> Replacement, |
| 321 | + maxReplacements: Int = .max |
| 322 | + ) rethrows -> Self where Replacement.Element == Element |
| 323 | + |
| 324 | + public mutating func replace<R: RegexProtocol, Replacement: Collection>( |
| 325 | + _ regex: R, |
| 326 | + with replacement: (RegexMatch<R.Match>) throws -> Replacement, |
| 327 | + maxReplacements: Int = .max |
| 328 | + ) rethrows where Replacement.Element == Element |
| 329 | +} |
| 330 | +``` |
| 331 | + |
| 332 | +#### Split |
| 333 | + |
| 334 | +```swift |
| 335 | +extension Collection where Element: Equatable { |
| 336 | + public func split<S: Sequence>(by separator: S) -> some Collection<SubSequence> |
| 337 | + where S.Element == Element |
| 338 | +} |
| 339 | + |
| 340 | +extension BidirectionalCollection where SubSequence == Substring { |
| 341 | + public func split<R: RegexProtocol>(by separator: R) -> some Collection<Substring> |
| 342 | +} |
| 343 | +``` |
| 344 | + |
| 345 | +## Alternatives considered |
| 346 | + |
| 347 | +### Extend `Sequence` instead of `Collection` |
| 348 | + |
| 349 | +All of the proposed algorithms are specific to the `Collection` protocol, without support for plain `Sequence`s. Types conforming to the `Sequence` protocol are not required to support multi-pass iteration, which makes a `Sequence` conformance insufficient for most of these algorithms. In light of this, the decision was made to have the underlying shared algorithm implementations work exclusively with `Collection`s. |
| 350 | + |
| 351 | +## Future directions |
| 352 | + |
| 353 | +### Backward algorithms |
| 354 | + |
| 355 | +There are some unanswered questions about algorithms that operate from the back of a collection. |
| 356 | + |
| 357 | +There is a subtle difference between finding the last non-overlapping range of a pattern in a string, and finding the first range of this pattern when searching from the back. `"aaaaa".ranges(of: "aa")` produces two non-overlapping ranges, splitting the string in the chunks `aa|aa|a`. It would not be completely unreasonable to expect `"aaaaa".lastRange(of: "aa")` to be shorthand for `"aaaaa".ranges(of: "aa").last`, i.e. to return the range that contains the third and fourth characters of the string. Yet, the first range of `"aa"` when searching from the back of the string yields the range that contains the fourth and fifth characters. |
| 358 | + |
| 359 | +It is not obvious whether both of these notions of what it means for a range to be the "last" range should be supported, or what names should be used in order to disambiguate them. It is also worth noting that some kinds of patterns do behave nicely and always produce the same results when searching forwards or backwards, e.g. `myInts.lastIndex(where: { $0 > 10 })` is unambiguous. These kinds of patterns might warrant special treatment when designing algorithms that process the collection in reverse. |
| 360 | + |
| 361 | +Similar questions arise when trimming a string from both sides: `"ababa".trimming("aba")` can return either `"ba"` or `"ab"`, depending on whether the prefix or the suffix was trimmed first. |
| 362 | + |
| 363 | +### Throwing closures |
| 364 | + |
| 365 | +The closure parameters of `trimPrefix(while:)` and `replace(_:with:)` aren't marked `throws` and the methods themselves aren't marked `rethrows`, because the shared implementations of these groups of related algorithms do not yet support error handling. |
| 366 | + |
| 367 | +### Open up the shared algorithm implementations for user-defined types |
| 368 | + |
| 369 | +At this point we have not settled on a final design for the protocol hierarchy that the shared algorithm implementations rely on, so we are not ready to expose this infrastructure and stabilize the entire ABI. We aim to eventually open up the ability for users to pass their own types to these `Collection` algorithms without having to go through the `RegexProtocol` overload which creates an intermediate `Regex` instance. |
0 commit comments