Skip to content

Commit a34ffb3

Browse files
committed
Start of Regex Syntax Pitch
1 parent b7a0196 commit a34ffb3

File tree

1 file changed

+245
-0
lines changed

1 file changed

+245
-0
lines changed
Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
# Regular Expression Syntax
2+
3+
- Authors: Hamish Knight, Michael Ilseman
4+
5+
## Introduction
6+
7+
We aim to parse a superset of the syntax accepted by a variety of popular regular expression engines.
8+
9+
**TODO: Elaborate**
10+
11+
## Engines supported
12+
13+
We aim to implement a syntactic superset of:
14+
15+
- [PCRE 2][pcre2-syntax], an "industry standard" of sorts, and a rough superset of Perl, Python, etc.
16+
- [Oniguruma][oniguruma-syntax], an internationalization-oriented engine with some modern features
17+
- [ICU][icu-syntax], used by NSRegularExpression, a Unicode-focused engine.
18+
- [.NET][.net-syntax]'s regular expressions, which support delimiter-balancing and some interesting minor details on conditional patterns.
19+
- **TODO: List Java here? It doesn't really add any more syntax than the above other than `\p{javaLowerCase}`**
20+
21+
We also intend to achieve at least Level 1 (**TODO: do we want to promise Level 2?**) [UTS#18][uts18] conformance, which specifies regular expression matching semantics without mandating any particular syntax. However we can infer syntactic feature sets from its guidance.
22+
23+
## Regex syntax supported
24+
25+
### General syntax
26+
27+
The following syntax are supported by all the above engines.
28+
29+
#### Alternation
30+
31+
```
32+
Regex -> '' | Alternation
33+
Alternation -> Concatenation ('|' Concatenation)*
34+
```
35+
36+
This is the operator with the lowest precedence in a regular expression, and checks if any of its branches match the input.
37+
38+
#### Concatenation
39+
40+
```
41+
Concatenation -> (!'|' !')' ConcatComponent)*
42+
ConcatComponent -> Trivia | Quote | Quantification
43+
```
44+
45+
Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression patterns. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. The `ConcatComponent` token varies across engine, but at least matches some form of trivia, e.g comments, quoted sequences e.g `\Q...\E`, and a quantified expression.
46+
47+
#### Quantification
48+
49+
```
50+
Quantification -> QuantOperand Quantifier?
51+
Quantifier -> ('*' | '+' | '?' | '{' Range '}') QuantKind?
52+
QuantKind -> '?' | '+'
53+
```
54+
55+
Specifies that the operand may be matched against a certain number of times.
56+
57+
#### Groups
58+
59+
```
60+
GroupStart -> '(?' GroupKind | '('
61+
GroupKind -> ':' | '|' | '>' | '=' | '!' | '*' | '<=' | '<!' | '<*'
62+
| NamedGroup | MatchingOptionSeq (':' | ')')
63+
64+
NamedGroup -> 'P<' GroupNameBody '>'
65+
| '<' GroupNameBody '>'
66+
| "'" GroupNameBody "'"
67+
68+
GroupNameBody -> Identifier
69+
```
70+
71+
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced, some may capture the nested match, some may match against the input without advancing, some may change the matching options set in the new scope, etc.
72+
73+
#### Anchors
74+
75+
```
76+
Anchor -> '^' | '$' | '\b'
77+
```
78+
79+
Anchors match against a certain position in the input rather than on a particular character of the input.
80+
81+
#### Unicode scalars
82+
83+
84+
85+
#### Builtin character classes
86+
87+
88+
89+
#### Custom character classes
90+
91+
```
92+
CustomCharClass -> Start Set (SetOp Set)* ']'
93+
Start -> '[' '^'?
94+
Set -> Member+
95+
Member -> CustomCharClass | !']' !SetOp (Range | Atom)
96+
Range -> Atom `-` Atom
97+
```
98+
99+
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal
100+
101+
102+
#### Character properties
103+
104+
### PCRE-specific syntax
105+
106+
#### Callouts
107+
108+
### Oniguruma-specific syntax
109+
110+
#### Custom reference syntax
111+
112+
#### Callout syntax
113+
114+
#### Absent functions
115+
116+
### ICU-specific syntax
117+
118+
119+
120+
### .NET-specific syntax
121+
122+
#### Balancing groups
123+
124+
```
125+
GroupNameBody -> Identifier | Identifier? '-' Identifier
126+
```
127+
128+
.NET supports the ability for a group to reference a prior group, causing the prior group to be deleted, and any intermediate matched input to become the capture of the current group.
129+
130+
#### Character class subtraction with `-`
131+
132+
133+
134+
## Syntactic differences between engines
135+
136+
### Conflicting differences
137+
138+
#### Character class set operations
139+
140+
In a custom character class, some engines allow for binary set operations that take two character class inputs, and produce a new character class output. However which set operations are supported and the spellings used differ by engine.
141+
142+
| PCRE | ICU | UTS#18 | Oniguruma | .NET | Java |
143+
|------|-----|--------|-----------|------|------|
144+
|| Intersection `&&`, Subtraction `--` | Intersection & Subtraction | Intersection `&&` | Subtraction via `-` | Intersection `&&` |
145+
146+
[UTS#18][uts18] requires intersection and subtraction, and uses the operation spellings `&&` and `--` in its examples, though it doesn't mandate a particular spelling. In particular, conforming implementations could spell the subtraction `[[x]--[y]]` as `[[x]&&[^y]]`. UTS#18 also suggests a symmetric difference operator `~~`, and uses an explicit `||` operator in examples, though doesn't require either operations.
147+
148+
These differences are conflicting, as engines that don't support a particular operator treat them as literal, e.g `[x&&y]` in PCRE is the character class of `["x", "&", "y"]` rather than an intersection.
149+
150+
We intend to support the operators `&&`, `--`, `-`, and `~~`. This means that any regex literal containing these sequences in a custom character class while being written for an engine not supporting that operation will have a different semantic meaning in our engine. However this ought not to be a common occurrence, as specifying a character multiple times in a custom character class is redundant. However, we intend on providing a strict compatibility mode that may be used to emulate behavior of a particular engine (**TODO: all engines, or just PCRE?**).
151+
152+
#### Nested custom character classes
153+
154+
This allows e.g `[[a]b[c]]`, which is interpreted the same as `[abc]`.
155+
156+
| PCRE | ICU | UTS#18 | Oniguruma | .NET | Java |
157+
|------|-----|--------|-----------|------|------|
158+
||| 💡 || **TODO** ||
159+
160+
UTS#18 doesn't require this, though it does suggest it as a way to clarify precedence for chains of character class set operations e.g `[\w--\d&&\s]`, which the user could write as `[[\w--\d]&&\s]`.
161+
162+
PCRE does not support this feature, and as such treats `]` as the closing character of the custom character class. Therefore `[[a]b[c]]` is interpreted as the character class `["[", "a"]`, followed by literal `b`, and then the character class `["c"]`, followed by literal `]`.
163+
164+
We aim to support nested custom character classes, with a strict PCRE mode for emulating the PCRE behavior if desired.
165+
166+
#### `\U`
167+
168+
In PCRE, if `PCRE2_ALT_BSUX` or `PCRE2_EXTRA_ALT_BSUX` are specified, `\U` matches literal `U`. However in ICU, `\Uhhhhhhhh` matches a hex sequence.
169+
170+
#### `{,n}`
171+
172+
This quantifier is supported by Oniguruma, but in PCRE it matches the literal chars.
173+
174+
#### \0DDD
175+
176+
In ICU, `DDD` are interpreted as an octal code. In PCRE, only the first two digits are interpreted as octal, the last is literal.
177+
178+
#### `\x`
179+
180+
In PCRE, a bare `\x` denotes the NUL character (`U+00`). In Oniguruma, it denotes literal `x`.
181+
182+
#### Whitespace in ranges
183+
184+
In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 to 4 times. However if whitespace is introduced in the range, it becomes invalid and is then treated as the literal characters. We find this behavior to be unintuitive, and therefore intend to parse any intermixed whitespace in the range, but will emit a warning telling users that we're doing so (**TODO: how would they silence? move to modern syntax?**).
185+
186+
#### Implicitly-scoped matching option scopes
187+
188+
PCRE and Oniguruma both support changing the active matching options through the `(?i)` expression. However, they have differing semantics when it comes to their scoping. In Oniguruma, it is treated as an implicit new scope that wraps everything until the end of the current group. In PCRE, it is treated as changing the matching option for all the following expressions until the end of the group.
189+
190+
These sound similar, but have different semantics around alternations, e.g for `a(?i)b|c|d`, in Oniguruma this becomes `a(?i:b|c|d)`, where `a` is no longer part of the alternation. However in PCRE it becomes `a(?i:b)|(?i:c)|(?i:d)`, where `a` remains a child of the alternation.
191+
192+
We aim to support the Oniguruma behavior by default, with a strict-PCRE mode that emulates the PCRE behavior. **TODO: The PCRE behavior is more complex for the parser, but seems less surprising, maybe that should become the default?**
193+
194+
#### Backreference condition kinds
195+
196+
PCRE and .NET allow for conditional patterns to reference a group by its name, e.g:
197+
198+
```
199+
(?<group1>x)?(?(group1)y)
200+
```
201+
202+
where `y` will only be matched if `(?<group1>x)` was matched. PCRE will always treat such syntax as a backreference condition, however .NET will only treat it as such if a group with that name exists somewhere in the regex (including after the conditional). Otherwise, .NET interprets `group1` as an arbitrary regular expression condition to try match against.
203+
204+
We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?('group1')y)` if they want a backreference condition. This more explicit syntax is supported by PCRE.
205+
206+
### Non-conflicting differences
207+
208+
#### `\N`
209+
210+
- PCRE supports `\N` meaning "not a newline"
211+
- PCRE also supports `\N{U+hhhh}`
212+
- ICU supports `\N{UNICODE CHAR NAME}` only
213+
214+
#### Extended character property syntax
215+
216+
**TODO: Can this be conflicting?**
217+
218+
ICU (**TODO: any others?**) unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties.
219+
220+
## Canonical representations
221+
222+
Many engines have different spellings for the same regex features, and as such we need to decide on a preferred canonical syntax.
223+
224+
### Backreferences
225+
226+
There are a variety of backreference spellings accepted by different engines
227+
228+
```
229+
Backreference -> '\g{' NameOrNumberRef '}'
230+
| '\g' NumberRef
231+
| '\k<' Identifier '>'
232+
| "\k'" Identifier "'"
233+
| '\k{' Identifier '}'
234+
| '\' [1-9] [0-9]+
235+
| '(?P=' Identifier ')'
236+
```
237+
238+
The least intuitive spelling being `'\' [1-9] [0-9]+`, as it can be a backreference or octal sequence depending on the number of prior groups. We plan on choosing the canonical spelling *TODO: decide*.
239+
240+
241+
[pcre2-syntax]: https://www.pcre.org/current/doc/html/pcre2syntax.html
242+
[oniguruma-syntax]: https://github.com/kkos/oniguruma/blob/master/doc/RE
243+
[icu-syntax]: https://unicode-org.github.io/icu/userguide/strings/regexp.html
244+
[uts18]: https://www.unicode.org/reports/tr18/
245+
[.net-syntax]: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expressions

0 commit comments

Comments
 (0)