Skip to content

[Integration] main (a0999f3) -> swift/main #244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Apr 4, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
d3bd6ad
Error on unknown escape sequences
hamishknight Mar 21, 2022
5a52d53
Allow certain escape sequences in character class ranges
hamishknight Mar 21, 2022
692f0fd
Remove obsolete CharacterClass model computation
hamishknight Mar 21, 2022
cdf98c5
Forbid empty character classes
hamishknight Mar 21, 2022
c5ec8be
Remove extra const from gestScriptExtensions
etcwilde Mar 31, 2022
9889ae7
Merge pull request #226 from hamishknight/esc
hamishknight Mar 31, 2022
365f5b5
Merge pull request #236 from etcwilde/ewilde/fix-extraneous-const
etcwilde Mar 31, 2022
0108e22
DSL support for atomic groups (#238)
milseman Mar 31, 2022
692237f
Rename BacktrackingScope to Local (#239)
milseman Mar 31, 2022
096d39d
Better filter trivia in dumps
hamishknight Apr 1, 2022
c6dc547
Formalize non-semantic whitespace matching
hamishknight Apr 1, 2022
a96648b
Rename endOfString -> unterminated
hamishknight Apr 1, 2022
120ffc9
Fix end-of-line-comment lexing
hamishknight Apr 1, 2022
4944fbe
Lex extended pound delimiters
hamishknight Apr 1, 2022
9f42ea4
Introduce a multi-line literal mode
hamishknight Apr 1, 2022
556bca0
Disable unused delimiters
hamishknight Apr 1, 2022
c45450f
Merge pull request #242 from hamishknight/unlimited-delimiters
hamishknight Apr 1, 2022
820ab38
Regex Type and Overview V2 and accompanying tests/changes (#241)
milseman Apr 1, 2022
43a78e8
Update escaping rules in RegexSyntax.md
hamishknight Apr 1, 2022
2aa67f8
Update Documentation/Evolution/RegexSyntax.md
hamishknight Apr 4, 2022
a0999f3
Merge pull request #243 from hamishknight/syntax-pitch-tweak
hamishknight Apr 4, 2022
a989eae
Merge branch 'main' into main-merge
hamishknight Apr 4, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 29 additions & 6 deletions Documentation/Evolution/RegexSyntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,38 @@
Hello, we want to issue an update to [Regular Expression Literals](https://forums.swift.org/t/pitch-regular-expression-literals/52820) and prepare for a formal proposal. The great delimiter deliberation continues to unfold, so in the meantime, we have a significant amount of surface area to present for review/feedback: the syntax _inside_ a regex literal. Additionally, this is the syntax accepted from a string used for run-time regex construction, so we're devoting an entire pitch/proposal to the topic of _regex syntax_, distinct from the result builder DSL or the choice of delimiters for literals.
-->

# Regex Syntax
# Run-time Regex Construction

- Authors: Hamish Knight, Michael Ilseman
- Authors: [Hamish Knight](https://github.com/hamishknight), [Michael Ilseman](https://github.com/milseman)

## Introduction

A regex declares a string processing algorithm using syntax familiar across a variety of languages and tools throughout programming history. Regexes can be created from a string at run time or from a literal at compile time. The contents of that run-time string, or the contents in-between the compile-time literal's delimiters, uses regex syntax. We present a detailed and comprehensive treatment of regex syntax.

This is part of a larger effort in supporting regex literals, which in turn is part of a larger effort towards better string processing using regex. See [Pitch and Proposal Status](https://github.com/apple/swift-experimental-string-processing/issues/107), which tracks each relevant piece. This proposal regards _syntactic_ support, and does not necessarily mean that everything that can be written will be supported by Swift's runtime engine in the initial release. Support for more obscure features may appear over time, see [MatchingEngine Capabilities and Roadmap](https://github.com/apple/swift-experimental-string-processing/issues/99) for status.
A regex declares a string processing algorithm using syntax familiar across a variety of languages and tools throughout programming history. We propose the ability to create a regex at run time from a string containing regex syntax (detailed here), API for accessing the match and captures, and a means to convert between an existential capture representation and concrete types.

The overall story is laid out in [Regex Type and Overview](https://github.com/apple/swift-experimental-string-processing/blob/main/Documentation/Evolution/RegexTypeOverview.md) and each individual component is tracked in [Pitch and Proposal Status](https://github.com/apple/swift-experimental-string-processing/issues/107).

## Motivation

Swift aims to be a pragmatic programming language, striking a balance between familiarity, interoperability, and advancing the art. Swift's `String` presents a uniquely Unicode-forward model of string, but currently suffers from limited processing facilities.

<!--
... tools need run time construction
... ns regular expression operates over a fundamentally different model and has limited syntactic and semantic support
... we prpose a best-in-class treatment of familiar regex syntax
-->

The full string processing effort includes a regex type with strongly typed captures, the ability to create a regex from a string at runtime, a compile-time literal, a result builder DSL, protocols for intermixing 3rd party industrial-strength parsers with regex declarations, and a slew of regex-powered algorithms over strings.

This proposal specifically hones in on the _familiarity_ aspect by providing a best-in-class treatment of familiar regex syntax.

## Proposed Solution

<!--
... regex compiling and existential match type
-->

### Syntax

We propose accepting a syntactic "superset" of the following existing regular expression engines:

- [PCRE 2][pcre2-syntax], an "industry standard" and a rough superset of Perl, Python, etc.
Expand All @@ -40,6 +51,10 @@ Regex syntax will be part of Swift's source-compatibility story as well as its b

## Detailed Design

<!--
... init, dynamic match, conversion to static
-->

We propose the following syntax for regex.

<details><summary>Grammar Notation</summary>
Expand Down Expand Up @@ -145,7 +160,7 @@ Atom -> Anchor
| '\'? <Character>
```

Atoms are the smallest units of regex syntax. They include escape sequences, metacharacters, backreferences, etc. The most basic form of atom is a literal character. A metacharacter may be treated as literal by preceding it with a backslash. Other literal characters may also be preceded with a backslash, but it has no effect if they are unknown escape sequences, e.g `\I` is literal `I`.
Atoms are the smallest units of regex syntax. They include escape sequences, metacharacters, backreferences, etc. The most basic form of atom is a literal character. A metacharacter may be treated as literal by preceding it with a backslash. Other literal characters may also be preceded by a backslash, in which case it has no effect, e.g `\%` is literal `%`. However this does not apply to either non-whitespace Unicode characters, or to unknown ASCII letter character escapes, e.g `\I` is invalid and would produce an error.

#### Anchors

Expand Down Expand Up @@ -832,6 +847,14 @@ Regex syntax will become part of Swift's source and binary-compatibility story,
Even though it is more work up-front and creates a longer proposal, it is less risky to support the full intended syntax. The proposed superset maximizes the familiarity benefit of regex syntax.


<!--

### TODO: Semantic capabilities

This proposal regards _syntactic_ support, and does not necessarily mean that everything that can be parsed will be supported by Swift's engine in the initial release. Support for more obscure features may appear over time, see [MatchingEngine Capabilities and Roadmap](https://github.com/apple/swift-experimental-string-processing/issues/99) for status.

-->

[pcre2-syntax]: https://www.pcre.org/current/doc/html/pcre2syntax.html
[oniguruma-syntax]: https://github.com/kkos/oniguruma/blob/master/doc/RE
[icu-syntax]: https://unicode-org.github.io/icu/userguide/strings/regexp.html
Expand Down
58 changes: 41 additions & 17 deletions Documentation/Evolution/RegexTypeOverview.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,11 +149,11 @@ Type mismatches and invalid regex syntax are diagnosed at construction time by `
When the pattern is known at compile time, regexes can be created from a literal containing the same regex syntax, allowing the compiler to infer the output type. Regex literals enable source tools, e.g. syntax highlighting and actions to refactor into a result builder equivalent.

```swift
let regex = re'(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)'
let regex = /(\w+)\s\s+(\S+)\s\s+((?:(?!\s\s).)*)\s\s+(.*)/
// regex: Regex<(Substring, Substring, Substring, Substring, Substring)>
```

*Note*: Regex literals, most notably the choice of delimiter, are discussed in [Regex Literals][pitches]. For this example, I used the less technically-problematic option of `re'...'`.
*Note*: Regex literals, most notably the choice of delimiter, are discussed in [Regex Literals][pitches].

This same regex can be created from a result builder, a refactoring-friendly representation:

Expand Down Expand Up @@ -193,13 +193,13 @@ A `Regex<Output>.Match` contains the result of a match, surfacing captures by nu

```swift
func processEntry(_ line: String) -> Transaction? {
let regex = re'''
(?x) # Ignore whitespace and comments
// Multiline literal implies `(?x)`, i.e. non-semantic whitespace with line-ending comments
let regex = #/
(?<kind> \w+) \s\s+
(?<date> \S+) \s\s+
(?<account> (?: (?!\s\s) . )+) \s\s+
(?<amount> .*)
'''
/#
// regex: Regex<(
// Substring,
// kind: Substring,
Expand Down Expand Up @@ -291,7 +291,7 @@ A regex describes an algorithm to be ran over some model of string, and Swift's

Calling `dropFirst()` will not drop a leading byte or `Unicode.Scalar`, but rather a full `Character`. Similarly, a `.` in a regex will match any extended grapheme cluster. A regex will match canonical equivalents by default, strengthening the connection between regex and the equivalent `String` operations.

Additionally, word boundaries (`\b`) follow [UTS\#29 Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries), meaning contractions ("don't") and script changes are detected and separated, without incurring significant binary size costs associated with language dictionaries.
Additionally, word boundaries (`\b`) follow [UTS\#29 Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries). Contractions ("don't") are correctly detected and script changes are separated, without incurring significant binary size costs associated with language dictionaries.

Regex targets [UTS\#18 Level 2](https://www.unicode.org/reports/tr18/#Extended_Unicode_Support) by default, but provides options to switch to scalar-level processing as well as compatibility character classes. Detailed rules on how we infer necessary grapheme cluster breaks inside regexes, as well as options and other concerns, are discussed in [Unicode for String Processing][pitches].

Expand All @@ -300,18 +300,47 @@ Regex targets [UTS\#18 Level 2](https://www.unicode.org/reports/tr18/#Extended_U

```swift
/// A regex represents a string processing algorithm.
///
/// let regex = try Regex(compiling: "a(.*)b")
/// let match = "cbaxb".firstMatch(of: regex)
/// print(match.0) // "axb"
/// print(match.1) // "x"
///
public struct Regex<Output> {
/// Match a string in its entirety.
///
/// Returns `nil` if no match and throws on abort
public func matchWhole(_: String) throws -> Match?
public func matchWhole(_ s: String) throws -> Regex<Output>.Match?

/// Match at the front of a string
/// Match part of the string, starting at the beginning.
///
/// Returns `nil` if no match and throws on abort
public func matchFront(_: String) throws -> Match?
public func matchPrefix(_ s: String) throws -> Regex<Output>.Match?

/// Find the first match in a string
///
/// Returns `nil` if no match is found and throws on abort
public func firstMatch(in s: String) throws -> Regex<Output>.Match?

/// Match a substring in its entirety.
///
/// Returns `nil` if no match and throws on abort
public func matchWhole(_ s: Substring) throws -> Regex<Output>.Match?

/// Match part of the string, starting at the beginning.
///
/// Returns `nil` if no match and throws on abort
public func matchPrefix(_ s: Substring) throws -> Regex<Output>.Match?

/// Find the first match in a substring
///
/// Returns `nil` if no match is found and throws on abort
public func firstMatch(_ s: Substring) throws -> Regex<Output>.Match?

/// The result of matching a regex against a string.
///
/// A `Match` forwards API to the `Output` generic parameter,
/// providing direct access to captures.
@dynamicMemberLookup
public struct Match {
/// The range of the overall match
Expand All @@ -320,7 +349,7 @@ public struct Regex<Output> {
/// The produced output from the match operation
public var output: Output

/// Lookup a capture by number
/// Lookup a capture by name or number
public subscript<T>(dynamicMember keyPath: KeyPath<Output, T>) -> T

/// Lookup a capture by number
Expand All @@ -342,11 +371,6 @@ public struct Regex<Output> {
extension Regex: RegexComponent {
public var regex: Regex<Output> { self }

/// Create a regex out of a single component
public init<Content: RegexComponent>(
_ content: Content
) where Content.Output == Output

/// Result builder interface
public init<Content: RegexComponent>(
@RegexComponentBuilder _ content: () -> Content
Expand All @@ -360,11 +384,11 @@ extension Regex.Match {

// Run-time compilation interfaces
extension Regex {
/// Parse and compile `pattern`.
/// Parse and compile `pattern`, resulting in a strongly-typed capture list.
public init(compiling pattern: String, as: Output.Type = Output.self) throws
}
extension Regex where Output == AnyRegexOutput {
/// Parse and compile `pattern`.
/// Parse and compile `pattern`, resulting in an existentially-typed capture list.
public init(compiling pattern: String) throws
}
```
Expand Down
4 changes: 2 additions & 2 deletions Sources/Exercises/Participants/RegexParticipant.swift
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ private func graphemeBreakPropertyData<RP: RegexComponent>(
forLine line: String,
using regex: RP
) -> GraphemeBreakEntry? where RP.Output == (Substring, Substring, Substring?, Substring) {
line.match(regex).map(\.output).flatMap(extractFromCaptures)
line.matchWhole(regex).map(\.output).flatMap(extractFromCaptures)
}

private func graphemeBreakPropertyDataLiteral(
Expand All @@ -80,7 +80,7 @@ private func graphemeBreakPropertyDataLiteral(
private func graphemeBreakPropertyData(
forLine line: String
) -> GraphemeBreakEntry? {
line.match {
line.matchWhole {
TryCapture(OneOrMore(.hexDigit)) { Unicode.Scalar(hex: $0) }
Optionally {
".."
Expand Down
12 changes: 12 additions & 0 deletions Sources/RegexBuilder/DSL.swift
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,18 @@ public struct TryCapture<Output>: _BuiltinRegexComponent {
// Note: Public initializers are currently gyb'd. See Variadics.swift.
}

// MARK: - Groups

/// An atomic group, i.e. opens a local backtracking scope which, upon successful exit,
/// discards any remaining backtracking points from within the scope
public struct Local<Output>: _BuiltinRegexComponent {
public var regex: Regex<Output>

internal init(_ regex: Regex<Output>) {
self.regex = regex
}
}

// MARK: - Backreference

public struct Reference<Capture>: RegexComponent {
Expand Down
20 changes: 16 additions & 4 deletions Sources/RegexBuilder/Match.swift
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,29 @@
import _StringProcessing

extension String {
public func match<R: RegexComponent>(
public func matchWhole<R: RegexComponent>(
@RegexComponentBuilder _ content: () -> R
) -> Regex<R.Output>.Match? {
match(content())
matchWhole(content())
}

public func matchPrefix<R: RegexComponent>(
@RegexComponentBuilder _ content: () -> R
) -> Regex<R.Output>.Match? {
matchPrefix(content())
}
}

extension Substring {
public func match<R: RegexComponent>(
public func matchWhole<R: RegexComponent>(
@RegexComponentBuilder _ content: () -> R
) -> Regex<R.Output>.Match? {
matchWhole(content())
}

public func matchPrefix<R: RegexComponent>(
@RegexComponentBuilder _ content: () -> R
) -> Regex<R.Output>.Match? {
match(content())
matchPrefix(content())
}
}
Loading