|
| 1 | +# Regex lookbehind assertions |
| 2 | + |
| 3 | +* Proposal: [SE-0448](0448-regex-lookbehind-assertions.md) |
| 4 | +* Authors: [Jacob Hearst](https://github.com/JacobHearst) [Michael Ilseman](https://github.com/milseman) |
| 5 | +* Review Manager: [Steve Canon](https://github.com/stephentyrone) |
| 6 | +* Status: **Active review (September 17...October 1, 2024)** |
| 7 | +* Implementation: https://github.com/swiftlang/swift-experimental-string-processing/pull/760 |
| 8 | +* Review: ([pitch](https://github.com/swiftlang/swift-evolution/pull/2525))([review]) |
| 9 | + |
| 10 | + |
| 11 | +## Introduction |
| 12 | + |
| 13 | +Regex supports lookahead assertions, but does not currently support lookbehind assertions. We propose adding these. |
| 14 | + |
| 15 | +## Motivation |
| 16 | + |
| 17 | +Modern regular expression engines support lookbehind assertions, whether fixed length (Perl, PCRE2, Python, Java) or arbitrary length (.NET, Javascript). |
| 18 | + |
| 19 | +## Proposed solution |
| 20 | + |
| 21 | +We propose supporting arbitrary-length lookbehind regexes which can be achieved by performing matching in reverse. |
| 22 | + |
| 23 | +Like lookahead assertions, lookbehind assertions are _zero-width_, meaning they do not affect the current match position. |
| 24 | + |
| 25 | +Examples: |
| 26 | + |
| 27 | + |
| 28 | +```swift |
| 29 | +"abc".firstMatch(of: /a(?<=a)bc/) // matches "abc" |
| 30 | +"abc".firstMatch(of: /a(?<=b)c/) // no match |
| 31 | +"abc".firstMatch(of: /a(?<=.)./) // matches "ab" |
| 32 | +"abc".firstMatch(of: /ab(?<=a)c/) // no match |
| 33 | +"abc".firstMatch(of: /ab(?<=.a)c/) // no match |
| 34 | +"abc".firstMatch(of: /ab(?<=a.)c/) // matches "abc" |
| 35 | +``` |
| 36 | + |
| 37 | +Lookbehind assertions run in reverse, i.e. right-to-left, meaning that right-most eager quantifications have the opportunity to consume more of the input than left-most. This does not affect whether an input matches, but could affect the value of captures inside of a lookbehind assertion: |
| 38 | + |
| 39 | +```swift |
| 40 | +"abcdefg".wholeMatch(of: /(.+)(.+)/) |
| 41 | +// Produces ("abcdefg", "abcdef", "g") |
| 42 | + |
| 43 | +"abcdefg".wholeMatch(of: /.*(?<=(.+)(.+)/)) |
| 44 | +// Produces ("abcdefg", "a", "bcdefg") |
| 45 | +``` |
| 46 | + |
| 47 | +## Detailed design |
| 48 | + |
| 49 | + |
| 50 | +### Syntax |
| 51 | + |
| 52 | +Lookbehind assertion syntax is already supported in the existing [Regex syntax](https://github.com/swiftlang/swift-evolution/blob/main/proposals/0355-regex-syntax-run-time-construction.md#lookahead-and-lookbehind). |
| 53 | + |
| 54 | +The engine is currently incapable of running them, so a compilation error is thrown: |
| 55 | + |
| 56 | +```swift |
| 57 | +let regex = /(?<=a)b/ |
| 58 | +// error: Cannot parse regular expression: lookbehind is not currently supported |
| 59 | +``` |
| 60 | + |
| 61 | +With this proposal, this restriction is lifted and the following syntactic forms will be accepted: |
| 62 | + |
| 63 | +```swift |
| 64 | +// Positive lookbehind |
| 65 | +/a(?<=b)c/ |
| 66 | +/a(*plb:b)c/ |
| 67 | +/a(*positive_lookbehind:b)c/ |
| 68 | + |
| 69 | +// Negative lookbehind |
| 70 | +/a(?<!b)c/ |
| 71 | +/a(*nlb:b)c/ |
| 72 | +/a(*negative_lookbehind:b)c/ |
| 73 | +``` |
| 74 | + |
| 75 | +### Regex builders |
| 76 | +This proposal adds support for both positive and negative lookbehind assertions when using the Regex builder, for example: |
| 77 | + |
| 78 | +```swift |
| 79 | +// Positive Lookbehind |
| 80 | +Regex { |
| 81 | + "a" |
| 82 | + Lookbehind { "b" } |
| 83 | + "c" |
| 84 | +} |
| 85 | + |
| 86 | +// Negative lookbehind |
| 87 | +Regex { |
| 88 | + "a" |
| 89 | + NegativeLookbehind { "b" } |
| 90 | + "c" |
| 91 | +} |
| 92 | +``` |
| 93 | + |
| 94 | +## Source compatibility |
| 95 | + |
| 96 | +This proposal is additive and source-compatible with existing code. |
| 97 | + |
| 98 | +## ABI compatibility |
| 99 | + |
| 100 | +This proposal is additive and ABI-compatible with existing code. |
| 101 | + |
| 102 | +## Implications on adoption |
| 103 | + |
| 104 | +The additions described in this proposal require a new version of the standard library and runtime. |
| 105 | + |
| 106 | +## Future directions |
| 107 | + |
| 108 | +### Support PCRE's `\K` |
| 109 | + |
| 110 | +Future work includes supporting PCRE's `\K`, which resets the current produced match. |
| 111 | + |
| 112 | +### Reverse matching API |
| 113 | + |
| 114 | +Earlier versions of this pitch added API to run regex in reverse from the end of the string. However, we faced difficulties communicating the nuance of reverse matching in API and this is an obscure feature that isn't supported by mainstream languages. |
| 115 | + |
| 116 | +## Alternatives considered |
| 117 | + |
| 118 | +### Fixed length lookbehind assertions only |
| 119 | + |
| 120 | +Fixed-length lookbehind assertions are easier to implement and retrofit onto existing engines. Python only supports a single fixed-width concatenation sequence, PCRE2 additionally supports alternations of fixed-width concatenations, and Java additionally supports bounded quantifications within. |
| 121 | + |
| 122 | +However, this would limit Swift's expressivity compared to Javascript and .NET, as well as be insufficient for reverse matching API. |
| 123 | + |
| 124 | + |
| 125 | +## Acknowledgments |
| 126 | + |
| 127 | +cherrycoke, bjhomer, Simulacroton, and rnantes provided use cases and rationale for lookbehind assertions. xwu provided feedback on the difficulties of communicating reverse matching in API. ksluder, nikolai.ruhe, and pyrtsa surfaced interesting examples and documentation needs. |
| 128 | + |
| 129 | + |
| 130 | + |
| 131 | + |
| 132 | + |
0 commit comments