You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexLiterals.md
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ In *[Regex Type and Overview][regex-type]* we introduced the `Regex` type, which
12
12
13
13
```swift
14
14
let pattern =#"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
15
-
let regex =try!Regex(compiling: pattern)
15
+
let regex =try!Regex(pattern)
16
16
// regex: Regex<AnyRegexOutput>
17
17
```
18
18
@@ -366,7 +366,7 @@ However we decided against this because:
366
366
367
367
### No custom literal
368
368
369
-
Instead of adding a custom regex literal, we could require users to explicitly write `try! Regex(compiling: "[abc]+")`. This would be similar to `NSRegularExpression`, and loses all the benefits of parsing the literal at compile time. This would mean:
369
+
Instead of adding a custom regex literal, we could require users to explicitly write `try! Regex("[abc]+")`. This would be similar to `NSRegularExpression`, and loses all the benefits of parsing the literal at compile time. This would mean:
370
370
371
371
- No source tooling support (e.g syntax highlighting, refactoring actions) would be available.
372
372
- Parse errors would be diagnosed at run time rather than at compile time.
We propose adding API to query and access captures by name in an existentially typed regex match:
165
+
166
+
```swift
167
+
extensionRegex.Match where Output == AnyRegexOutput {
168
+
/// If a named-capture with `name` is present, returns its value. Otherwise `nil`.
169
+
publicsubscript(_name: String) -> AnyRegexOutput.Element? { get }
170
+
}
171
+
172
+
extensionAnyRegexOutput {
173
+
/// If a named-capture with `name` is present, returns its value. Otherwise `nil`.
174
+
publicsubscript(_name: String) -> AnyRegexOutput.Element? { get }
175
+
}
176
+
```
177
+
159
178
The rest of this proposal will be a detailed and exhaustive definition of our proposed regex syntax.
160
179
161
180
<details><summary>Grammar Notation</summary>
@@ -392,7 +411,7 @@ For non-Unicode properties, only a value is required. These include:
392
411
- The special PCRE2 properties `Xan`, `Xps`, `Xsp`, `Xuc`, `Xwd`.
393
412
- The special Java properties `javaLowerCase`, `javaUpperCase`, `javaWhitespace`, `javaMirrored`.
394
413
395
-
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
414
+
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`. Both spellings may be used inside and outside of a custom character class.
396
415
397
416
#### `\K`
398
417
@@ -534,6 +553,7 @@ These operators have a lower precedence than the implicit union of members, e.g
534
553
535
554
To avoid ambiguity between .NET's subtraction syntax and range syntax, .NET specifies that a subtraction will only be parsed if the right-hand-side is a nested custom character class. We propose following this behavior.
536
555
556
+
Note that a custom character class may begin with the `:` character, and only becomes a POSIX character property if a closing `:]` is present. For example, `[:a]` is the character class of `:` and `a`.
537
557
538
558
### Matching options
539
559
@@ -863,7 +883,23 @@ PCRE supports `\N` meaning "not a newline", however there are engines that treat
863
883
864
884
### Extended character property syntax
865
885
866
-
ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties. We propose supporting this, though it is a purely additive feature, and therefore should not conflict with regex engines that implement a more limited POSIX syntax.
886
+
ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`. This has two effects:
887
+
888
+
- They share the same internal grammar, which allows the use of any Unicode character properties in addition to the POSIX properties.
889
+
- The POSIX syntax may be used outside of custom character classes, unlike in PCRE and Oniguruma.
890
+
891
+
We propose following both of these rules. The former is purely additive, and therefore should not conflict with regex engines that implement a more limited POSIX syntax. The latter does conflict with other engines, but we feel it is much more likely that a user would expect e.g `[:space:]` to be a character property rather than the character class `[:aceps]`. We do however feel that a warning might be warranted in order to avoid confusion.
892
+
893
+
### POSIX character property disambiguation
894
+
895
+
PCRE, Oniguruma and ICU allow `[:` to be part of a custom character class if a closing `:]` is not present. For example, `[:a]` is the character class of `:` and `a`. However they each have different rules for detecting the closing `:]`:
896
+
897
+
- PCRE will scan ahead until it hits either `:]`, `]`, or `[:`.
898
+
- Oniguruma will scan ahead until it hits either `:]`, `]`, or the length exceeds 20 characters.
899
+
- ICU will scan ahead until it hits a known escape sequence (e.g `\a`, `\e`, `\Q`, ...), or `:]`. Note this excludes character class escapes e.g `\d`. It also excludes `]`, meaning that even `[:a][:]` is parsed as a POSIX character property.
900
+
901
+
We propose unifying these behaviors by scanning ahead until we hit either `[`, `]`, `:]`, or `\`. Additionally, we will stop on encountering `}` or a second occurrence of `=`. These fall out the fact that they would be invalid contents of the alternative `\p{...}` syntax.
* Available in nightly toolchain snapshots with `import _StringProcessing`
4
9
5
10
## Introduction
6
11
@@ -134,11 +139,11 @@ Regexes can be created at run time from a string containing familiar regex synta
134
139
135
140
```swift
136
141
let pattern =#"(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)"#
137
-
let regex =try!Regex(compiling: pattern)
142
+
let regex =try!Regex(pattern)
138
143
// regex: Regex<AnyRegexOutput>
139
144
140
145
let regex: Regex<(Substring, Substring, Substring, Substring, Substring)> =
141
-
try!Regex(compiling: pattern)
146
+
try!Regex(pattern)
142
147
```
143
148
144
149
*Note*: The syntax accepted and further details on run-time compilation, including `AnyRegexOutput` and extended syntaxes, are discussed in [Run-time Regex Construction][pitches].
let date =try?Date(String(match.date), strategy: dateParser),
213
218
let amount =try?Decimal(String(match.amount), format: decimalParser)
@@ -226,7 +231,7 @@ The result builder allows for inline failable value construction, which particip
226
231
227
232
Swift regexes describe an unambiguous algorithm, where choice is ordered and effects can be reliably observed. For example, a `print()` statement inside the `TryCapture`'s transform function will run whenever the overall algorithm naturally dictates an attempt should be made. Optimizations can only elide such calls if they can prove it is behavior-preserving (e.g. "pure").
228
233
229
-
`CustomMatchingRegexComponent`, discussed in [String Processing Algorithms][pitches], allows industrial-strength parsers to be used a regex components. This allows us to drop the overly-permissive pre-parsing step:
234
+
`CustomPrefixMatchRegexComponent`, discussed in [String Processing Algorithms][pitches], allows industrial-strength parsers to be used a regex components. This allows us to drop the overly-permissive pre-parsing step:
/// Parse and compile `pattern`, resulting in an existentially-typed capture list.
391
-
publicinit(compilingpattern: String) throws
396
+
publicinit(_pattern: String) throws
392
397
}
393
398
```
394
399
400
+
### Cancellation
401
+
402
+
Regex is somewhat different from existing standard library operations in that regex processing can be a long-running task.
403
+
For this reason regex algorithms may check if the parent task has been cancelled and end execution.
404
+
395
405
### On severability and related proposals
396
406
397
407
The proposal split presented is meant to aid focused discussion, while acknowledging that each is interconnected. The boundaries between them are not completely cut-and-dry and could be refined as they enter proposal phase.
398
408
399
409
Accepting this proposal in no way implies that all related proposals must be accepted. They are severable and each should stand on their own merit.
400
410
401
-
402
411
## Source compatibility
403
412
404
413
Everything in this proposal is additive. Regex delimiters may have their own source compatibility impact, which is discussed in that proposal.
@@ -422,7 +431,7 @@ Regular expressions have a deservedly mixed reputation, owing to their historica
422
431
423
432
* "Regular expressions are bad because you should use a real parser"
424
433
- In other systems, you're either in or you're out, leading to a gravitational pull to stay in when... you should get out
425
-
- Our remedy is interoperability with real parsers via `CustomMatchingRegexComponent`
434
+
- Our remedy is interoperability with real parsers via `CustomPrefixMatchRegexComponent`
426
435
- Literals with refactoring actions provide an incremental off-ramp from regex syntax to result builders and real parsers
427
436
* "Regular expressions are bad because ugly unmaintainable syntax"
428
437
- We propose literals with source tools support, allowing for better syntax highlighting and analysis
@@ -488,6 +497,16 @@ The generic parameter `Output` is proposed to contain both the whole match (the
488
497
489
498
The biggest issue with this alternative design is that the numbering of `Captures` elements misaligns with the numbering of captures in textual regexes, where backreference `\0` refers to the entire match and captures start at `\1`. This design would sacrifice familarity and have the pitfall of introducing off-by-one errors.
490
499
500
+
### Encoding `Regex`es into the type system
501
+
502
+
During the initial review period the following comment was made:
503
+
504
+
> I think the goal should be that, at least for regex literals (and hopefully for the DSL to some extent), one day we might not even need a bytecode or interpreter. I think the ideal case is if each literal was its own function or type that gets generated and optimised as if you wrote it in Swift.
505
+
506
+
This is an approach that has been tried a few times in a few different languages (including by a few members of the Swift Standard Library and Core teams), and while it can produce attractive microbenchmarks, it has almost always proved to be a bad idea at the macro scale. In particular, even if we set aside witness tables and other associated swift generics overhead, optimizing a fixed pipeline for each pattern you want to match causes significant codesize expansion when there are multiple patterns in use, as compared to a more flexible byte code interpreter. A bytecode interpreter makes better use of instruction caches and memory, and can also benefit from micro architectural resources that are shared across different patterns. There is a tradeoff w.r.t. branch prediction resources, where separately compiled patterns may have more decisive branch history data, but a shared bytecode engine has much more data to use; this tradeoff tends to fall on the side of a bytecode engine, but it does not always do so.
507
+
508
+
It should also be noted that nothing prevents AOT or JIT compiling of the bytecode if we believe it will be advantageous, but compiling or interpreting arbitrary Swift code at runtime is rather more unattractive, since both the type system and language are undecidable. Even absent this rationale, we would probably not encode regex programs directly into the type system simply because it is unnecessarily complex.
509
+
491
510
### Future work: static optimization and compilation
492
511
493
512
Swift's support for static compilation is still developing, and future work here is leveraging that to compile regex when profitable. Many regex describe simple [DFAs](https://en.wikipedia.org/wiki/Deterministic_finite_automaton) and can be statically compiled into very efficient programs. Full static compilation needs to be balanced with code size concerns, as a matching-specific bytecode is typically far smaller than a corresponding program (especially since the bytecode interpreter is shared).
@@ -497,7 +516,7 @@ Regex are compiled into an intermediary representation and fairly simple analysi
497
516
498
517
### Future work: parser combinators
499
518
500
-
What we propose here is an incremental step towards better parsing support in Swift using parser-combinator style libraries. The underlying execution engine supports recursive function calls and mechanisms for library extensibility. `CustomMatchingRegexComponent`'s protocol requirement is effectively a [monadic parser](https://homepages.inf.ed.ac.uk/wadler/papers/marktoberdorf/baastad.pdf), meaning `Regex` provides a regex-flavored combinator-like system.
519
+
What we propose here is an incremental step towards better parsing support in Swift using parser-combinator style libraries. The underlying execution engine supports recursive function calls and mechanisms for library extensibility. `CustomPrefixMatchRegexComponent`'s protocol requirement is effectively a [monadic parser](https://homepages.inf.ed.ac.uk/wadler/papers/marktoberdorf/baastad.pdf), meaning `Regex` provides a regex-flavored combinator-like system.
501
520
502
521
An issues with traditional parser combinator libraries are the compilation barriers between call-site and definition, resulting in excessive and overly-cautious backtracking traffic. These can be eliminated through better [compilation techniques](https://core.ac.uk/download/pdf/148008325.pdf). As mentioned above, Swift's support for custom static compilation is still under development.
503
522
@@ -546,7 +565,7 @@ Regexes are often used for tokenization and tokens can be represented with Swift
546
565
547
566
### Future work: baked-in localized processing
548
567
549
-
- `CustomMatchingRegexComponent` gives an entry point for localized processors
568
+
- `CustomPrefixMatchRegexComponent` gives an entry point for localized processors
550
569
- Future work includes (sub?)protocols to communicate localization intent
0 commit comments