Start of Regex Syntax Pitch

hamishknight · hamishknight · commit a34ffb348265 · 2022-02-16T13:48:38.000Z
diff --git a/Documentation/Evolution/RegexSyntax.md b/Documentation/Evolution/RegexSyntax.md
@@ -0,0 +1,245 @@
+# Regular Expression Syntax
+
+- Authors: Hamish Knight, Michael Ilseman
+
+## Introduction
+
+We aim to parse a superset of the syntax accepted by a variety of popular regular expression engines.
+
+**TODO: Elaborate**
+
+## Engines supported
+
+We aim to implement a syntactic superset of:
+
+- [PCRE 2][pcre2-syntax], an "industry standard" of sorts, and a rough superset of Perl, Python, etc.
+- [Oniguruma][oniguruma-syntax], an internationalization-oriented engine with some modern features
+- [ICU][icu-syntax], used by NSRegularExpression, a Unicode-focused engine.
+- [.NET][.net-syntax]'s regular expressions, which support delimiter-balancing and some interesting minor details on conditional patterns.
+- **TODO: List Java here? It doesn't really add any more syntax than the above other than `\p{javaLowerCase}`**
+
+We also intend to achieve at least Level 1 (**TODO: do we want to promise Level 2?**) [UTS#18][uts18] conformance, which specifies regular expression matching semantics without mandating any particular syntax. However we can infer syntactic feature sets from its guidance.
+
+## Regex syntax supported
+
+### General syntax
+
+The following syntax are supported by all the above engines.
+
+#### Alternation
+
+```
+Regex       -> '' | Alternation
+Alternation -> Concatenation ('|' Concatenation)*
+```
+
+This is the operator with the lowest precedence in a regular expression, and checks if any of its branches match the input.
+
+#### Concatenation
+
+```
+Concatenation   -> (!'|' !')' ConcatComponent)*
+ConcatComponent -> Trivia | Quote | Quantification
+```
+
+Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression patterns. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. The `ConcatComponent` token varies across engine, but at least matches some form of trivia, e.g comments, quoted sequences e.g `\Q...\E`, and a quantified expression.
+
+#### Quantification
+
+```
+Quantification -> QuantOperand Quantifier?
+Quantifier     -> ('*' | '+' | '?' | '{' Range '}') QuantKind?
+QuantKind      -> '?' | '+'
+```
+
+Specifies that the operand may be matched against a certain number of times.
+
+#### Groups
+
+```
+GroupStart    -> '(?' GroupKind | '('
+GroupKind     -> ':' | '|' | '>' | '=' | '!' | '*' | '<=' | '<!' | '<*'
+               | NamedGroup | MatchingOptionSeq (':' | ')')
+            
+NamedGroup    -> 'P<' GroupNameBody '>'
+               | '<' GroupNameBody '>'
+               | "'" GroupNameBody "'"
+
+GroupNameBody -> Identifier
+```
+
+Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced, some may capture the nested match, some may match against the input without advancing, some may change the matching options set in the new scope, etc.
+
+#### Anchors
+
+```
+Anchor -> '^' | '$' | '\b'
+```
+
+Anchors match against a certain position in the input rather than on a particular character of the input.
+
+#### Unicode scalars
+
+
+
+#### Builtin character classes
+
+
+
+#### Custom character classes
+
+```
+CustomCharClass -> Start Set (SetOp Set)* ']'
+Start           -> '[' '^'?
+Set             -> Member+
+Member          -> CustomCharClass | !']' !SetOp (Range | Atom)
+Range           -> Atom `-` Atom
+```
+
+Custom characters classes introduce their own language, in which most regular expression metacharacters become literal
+
+
+#### Character properties
+
+### PCRE-specific syntax
+
+#### Callouts
+
+### Oniguruma-specific syntax
+
+#### Custom reference syntax
+
+#### Callout syntax
+
+#### Absent functions
+
+### ICU-specific syntax
+
+
+
+### .NET-specific syntax
+
+#### Balancing groups
+
+```
+GroupNameBody -> Identifier | Identifier? '-' Identifier
+```
+
+.NET supports the ability for a group to reference a prior group, causing the prior group to be deleted, and any intermediate matched input to become the capture of the current group.
+
+#### Character class subtraction with `-`
+
+
+
+## Syntactic differences between engines
+
+### Conflicting differences
+
+#### Character class set operations
+
+In a custom character class, some engines allow for binary set operations that take two character class inputs, and produce a new character class output. However which set operations are supported and the spellings used differ by engine.
+
+| PCRE | ICU | UTS#18 | Oniguruma | .NET | Java |
+|------|-----|--------|-----------|------|------|
+| ❌ | Intersection `&&`, Subtraction `--` | Intersection & Subtraction | Intersection `&&` | Subtraction via `-` | Intersection  `&&` |
+
+[UTS#18][uts18] requires intersection and subtraction, and uses the operation spellings `&&` and `--` in its examples, though it doesn't mandate a particular spelling. In particular, conforming implementations could spell the subtraction `[[x]--[y]]` as `[[x]&&[^y]]`. UTS#18 also suggests a symmetric difference operator `~~`, and uses an explicit `||` operator in examples, though doesn't require either operations.
+
+These differences are conflicting, as engines that don't support a particular operator treat them as literal, e.g `[x&&y]` in PCRE is the character class of `["x", "&", "y"]` rather than an intersection.
+
+We intend to support the operators `&&`, `--`, `-`, and `~~`. This means that any regex literal containing these sequences in a custom character class while being written for an engine not supporting that operation will have a different semantic meaning in our engine. However this ought not to be a common occurrence, as specifying a character multiple times in a custom character class is redundant. However, we intend on providing a strict compatibility mode that may be used to emulate behavior of a particular engine (**TODO: all engines, or just PCRE?**).
+
+#### Nested custom character classes
+
+This allows e.g `[[a]b[c]]`, which is interpreted the same as `[abc]`.
+
+| PCRE | ICU | UTS#18 | Oniguruma | .NET | Java |
+|------|-----|--------|-----------|------|------|
+| ❌ | ✅ | 💡 | ✅ | **TODO** | ✅ |
+
+UTS#18 doesn't require this, though it does suggest it as a way to clarify precedence for chains of character class set operations e.g `[\w--\d&&\s]`, which the user could write as `[[\w--\d]&&\s]`.
+
+PCRE does not support this feature, and as such treats `]` as the closing character of the custom character class. Therefore `[[a]b[c]]` is interpreted as the character class `["[", "a"]`, followed by literal `b`, and then the character class `["c"]`, followed by literal `]`.
+
+We aim to support nested custom character classes, with a strict PCRE mode for emulating the PCRE behavior if desired.
+
+#### `\U`
+
+In PCRE, if `PCRE2_ALT_BSUX` or `PCRE2_EXTRA_ALT_BSUX` are specified, `\U` matches literal `U`. However in ICU, `\Uhhhhhhhh` matches a hex sequence.
+
+#### `{,n}`
+
+This quantifier is supported by Oniguruma, but in PCRE it matches the literal chars. 
+
+#### \0DDD
+
+In ICU, `DDD` are interpreted as an octal code. In PCRE, only the first two digits are interpreted as octal, the last is literal.
+
+#### `\x`
+
+In PCRE, a bare `\x` denotes the NUL character (`U+00`). In Oniguruma, it denotes literal `x`.
+
+#### Whitespace in ranges
+
+In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 to 4 times. However if whitespace is introduced in the range, it becomes invalid and is then treated as the literal characters. We find this behavior to be unintuitive, and therefore intend to parse any intermixed whitespace in the range, but will emit a warning telling users that we're doing so (**TODO: how would they silence? move to modern syntax?**).
+
+#### Implicitly-scoped matching option scopes
+
+PCRE and Oniguruma both support changing the active matching options through the `(?i)` expression. However, they have differing semantics when it comes to their scoping. In Oniguruma, it is treated as an implicit new scope that wraps everything until the end of the current group. In PCRE, it is treated as changing the matching option for all the following expressions until the end of the group.
+
+These sound similar, but have different semantics around alternations, e.g for `a(?i)b|c|d`, in Oniguruma this becomes `a(?i:b|c|d)`, where `a` is no longer part of the alternation. However in PCRE it becomes `a(?i:b)|(?i:c)|(?i:d)`, where `a` remains a child of the alternation.
+
+We aim to support the Oniguruma behavior by default, with a strict-PCRE mode that emulates the PCRE behavior. **TODO: The PCRE behavior is more complex for the parser, but seems less surprising, maybe that should become the default?**
+
+#### Backreference condition kinds
+
+PCRE and .NET allow for conditional patterns to reference a group by its name, e.g:
+
+```
+(?<group1>x)?(?(group1)y)
+```
+
+where `y` will only be matched if `(?<group1>x)` was matched. PCRE will always treat such syntax as a backreference condition, however .NET will only treat it as such if a group with that name exists somewhere in the regex (including after the conditional). Otherwise, .NET interprets `group1` as an arbitrary regular expression condition to try match against. 
+
+We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?('group1')y)` if they want a backreference condition. This more explicit syntax is supported by PCRE.
+
+### Non-conflicting differences
+
+#### `\N`
+
+- PCRE supports `\N` meaning "not a newline"
+- PCRE also supports `\N{U+hhhh}`
+- ICU supports `\N{UNICODE CHAR NAME}` only
+
+#### Extended character property syntax
+
+**TODO: Can this be conflicting?**
+
+ICU (**TODO: any others?**) unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties.
+
+## Canonical representations
+
+Many engines have different spellings for the same regex features, and as such we need to decide on a preferred canonical syntax.
+
+### Backreferences
+
+There are a variety of backreference spellings accepted by different engines
+
+```
+Backreference -> '\g{' NameOrNumberRef '}'
+               | '\g' NumberRef
+               | '\k<' Identifier '>'
+               | "\k'" Identifier "'"
+               | '\k{' Identifier '}'
+               | '\' [1-9] [0-9]+
+               | '(?P=' Identifier ')'
+```
+
+The least intuitive spelling being `'\' [1-9] [0-9]+`, as it can be a backreference or octal sequence depending on the number of prior groups. We plan on choosing the canonical spelling *TODO: decide*.
+
+
+[pcre2-syntax]: https://www.pcre.org/current/doc/html/pcre2syntax.html
+[oniguruma-syntax]: https://github.com/kkos/oniguruma/blob/master/doc/RE
+[icu-syntax]: https://unicode-org.github.io/icu/userguide/strings/regexp.html
+[uts18]: https://www.unicode.org/reports/tr18/
+[.net-syntax]: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expressions