You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though only a few atoms are considered valid:
399
+
Custom characters classes introduce their own sublanguage, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though only some atoms are considered valid:
400
400
401
401
- Builtin character classes, except for `.`, `\R`, `\O`, `\X`, `\C`, and `\N`.
402
402
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary).
@@ -796,114 +796,23 @@ We intend on matching the PCRE behavior where groups are numbered purely based o
796
796
797
797
The proposed syntactic superset means there will be multiple ways to write the same thing. Below we discuss what Swift's preferred spelling could be, a "Swift canonical syntax".
798
798
799
-
We are not formally proposing this as a distinct syntax or concept, rather it is useful for considering compiler features such as fixits, pretty-printing, and refactoring actions.
799
+
We are not formally proposing this as a distinct syntax or concept, rather it is useful for considering compiler features such as fixits, pretty-printing, and refactoring actions. We're hoping for further discussion with the community here. Useful criteria include how well the choice fits in with the rest of Swift, whether there's an existing common practice, and whether one choice is less confusing in the context of others.
800
800
801
-
*TODO Hamish*: We're not proposing any actual action or language-level representation. So I feel like this section is more of an advisory section and good for discussion. It guides tooling decisions more than it is a formal addition to the Swift programming language. Rather than say, e.g., "we intend on canonicalizing to `\u{...}`", we could say "we consider `\u{...}` to be Swift's preferred spelling, in line with string literals". I think we can be a bit briefer too, perhaps collapsing multiple sub-sections together.
801
+
Unicode scalar literals can be spelled in many ways (*TODO*: intra-doc link). We propose treating Swift's string literal syntax of `\u{HexDigit{1...}}` as the preferred spelling.
802
802
803
-
### Unicode scalars
803
+
Character properties can be spelled `\p{...}` or `[:...:]`. We recommend preferring `\p{...}` as the bracket syntax historically meant POSIX-defined character classes, and still has that connotation in some engines. The spelling of properties themselves can be fuzzy (*TODO*: intra doc link) and we (weakly) recommend the shortest spelling (no opinion on casing yet). For script extensions, we (weakly) recommend e.g. `\p{Greek}` instead of `\p{Script_Extensions=Greek}`. We would like more discussion with the community here.
804
804
805
-
```
806
-
UnicodeScalar -> '\u{' HexDigit{1...} '}'
807
-
| '\u' HexDigit{4}
808
-
| '\x{' HexDigit{1...} '}'
809
-
| '\x' HexDigit{0...2}
810
-
| '\U' HexDigit{8}
811
-
| '\o{' OctalDigit{1...} '}'
812
-
| '\0' OctalDigit{0...3}
813
-
814
-
HexDigit -> [0-9a-zA-Z]
815
-
OctalDigit -> [0-7]
816
-
817
-
NamedScalar -> '\N{' ScalarName '}'
818
-
ScalarName -> 'U+' HexDigit{1...8} | [\s\w-]+
819
-
```
820
-
821
-
There are multiple equivalent ways of spelling the same the Unicode scalar value, in either hex, octal, or by spelling the name explicitly. String literals already provide a `\u{...}` syntax that allow a hex sequence for a Unicode scalar. As this is Swift's existing preferred spelling for such a sequence, we consider it to be the preferred spelling in this case too. There may however be value in preserving scalars that are explicitly spelled by name with `\N{...}` for clarity.
822
-
823
-
### Character properties
824
-
825
-
Character properties `\p{...}` have a variety of alternative spellings due to fuzzy matching, Unicode aliases, and shorthand syntax for common Unicode properties. They also may be written using POSIX syntax e.g `[:gc=Whitespace:]`.
826
-
827
-
**TODO: Should we suggest canonicalizing on e.g `\p{Script_Extensions=Greek}`? Or prefer the shorthand where we can? Or just avoid canonicalizing?**
828
-
829
-
### Groups
830
-
831
-
Named groups may be specified with a few different delimiters:
832
-
833
-
```
834
-
NamedGroup -> 'P<' GroupNameBody '>'
835
-
| '<' GroupNameBody '>'
836
-
| "'" GroupNameBody "'"
837
-
```
838
-
839
-
The preferable spelling here will likely be influenced by the regex literal delimiter choice. `(?'...')` seems a reasonable preferred spelling in isolation, however not so much if `re'...'` is chosen as the delimiter. To reduce possible confusion for the parser as well as the user, `(?<...>)` would seem the more preferable syntax in that case. This would also likely affect the preferred syntax for references.
840
-
841
-
#### Lookaheads and lookbehinds
842
-
843
-
These have both shorthand spellings as well as more explicit PCRE2 spellings. While the more explicit spellings are definitely clearer, they can feel quite verbose. The short-form spellings e.g `(?=` seem more preferable due to their familiarity.
844
-
845
-
### Backreferences
846
-
847
-
```
848
-
Backreference -> '\g{' NamedOrNumberRef '}'
849
-
| '\g' NumberRef
850
-
| '\k<' NamedOrNumberRef '>'
851
-
| "\k'" NamedOrNumberRef "'"
852
-
| '\k{' NamedRef '}'
853
-
| '\' [1-9] [0-9]+
854
-
| '(?P=' NamedRef ')'
855
-
```
856
-
857
-
For absolute numeric references, `\DDD` seems to be a strong candidate for the preferred syntax due to its familiarity. For relative numbered references, as well as named references, `\k<...>` or `\k'...'` seem like the ideal choice (depending on the syntax chosen for named groups). This avoids the confusion between `\g{...}` and `\g<...>` referring to a backreference and subpattern respectively. It additionally avoids confusion with group syntax.
805
+
Lookaround assertions have common shorthand spellings, while PCRE2 introduced longer more explicit spellings (*TODO*: doc link). We are (very weakly) recommending the common short-hand syntax of e.g. `(?=...)` as that's wider spread. We are interested in more discussion with the community here.
858
806
859
-
There may be value in choosing `\k` as the single unified syntax for backreferences (instead of `\DDD` for absolute numeric references), though there may be value in preserving the familiarity of `\DDD`.
807
+
Named groups may be specified with a few different delimiters: `(?<name>...)`, `(?P<name>...)`, `(?'name'...)`. We (weakly) recommend `(?<name>...)`, but the final preference may be influenced by choice of delimiter for the regex itself. We'd appreciate any insight from the community.
860
808
861
-
### Subpatterns
809
+
References and backreferences (*TODO*: intra-doc link) have multiple spellings. For absolute numeric references, `\DDD` seems to be a strong candidate for the preferred syntax due to its familiarity. For relative numbered references, as well as named references, either `\k<...>` or `\k'...'` seem like the better choice, depending on the syntax chosen for named groups. This avoids the confusion between `\g{...}` and `\g<...>` referring to a backreferences and subpatterns respectively, as well as any confusion with group syntax.
862
810
863
-
```
864
-
Subpattern -> '\g<' NamedOrNumberRef '>'
865
-
| "\g'" NamedOrNumberRef "'"
866
-
| '(?' GroupLikeSubpatternBody ')'
867
-
868
-
GroupLikeSubpatternBody -> 'P>' NamedRef
869
-
| '&' NamedRef
870
-
| 'R'
871
-
| NumberRef
872
-
```
811
+
For subpatterns, we recommend either `\g<...>` or `\g'...'` depending on the choice for named group syntax. We're unsure if we should prefer `(?R)` as a spelling for e.g. `\g<0>` or not, as it is more widely used and understood, but less consistent with other subpatterns.
873
812
874
-
To avoid confusion with groups, `\g<...>` or `\g'...'` seem like the ideal preferred spellings (depending on the syntax chosen for named groups). There may however be value in preserving the `(?R)` spelling where it is used, instead of preferring e.g `\g<0>`.
875
-
876
-
### Conditional references
877
-
878
-
```
879
-
KnownCondition -> 'R'
880
-
| 'R' NumberRef
881
-
| 'R&' NamedRef
882
-
| '<' NamedOrNumberRef '>'
883
-
| "'" NamedOrNumberRef "'"
884
-
| 'DEFINE'
885
-
| 'VERSION' VersionCheck
886
-
| NumberRef
887
-
```
888
-
889
-
For named references in a group condition, there is a choice between `(?('name'))` and `(?(<name>))`. The preferred syntax in this case would likely reflect the syntax chosen for named groups.
890
-
891
-
### PCRE Callouts
892
-
893
-
```
894
-
PCRECallout -> '(?C' CalloutBody ')'
895
-
PCRECalloutBody -> '' | <Number>
896
-
| '`' <String> '`'
897
-
| "'" <String> "'"
898
-
| '"' <String> '"'
899
-
| '^' <String> '^'
900
-
| '%' <String> '%'
901
-
| '#' <String> '#'
902
-
| '$' <String> '$'
903
-
| '{' <String> '}'
904
-
```
813
+
Conditional references (*TODO*: intra-doc link) have a choice between `(?('name'))` and `(?(<name>))`. The preferred syntax in this case would likely reflect the syntax chosen for named groups.
905
814
906
-
PCRE accepts a number of alternative delimiters for callout string arguments. The `(?C"...")` syntax seems preferable due to its consistency with string literal syntax. However it may be necessary to prefer `(?C'...')` depending on whether the regex literal delimiter ends up involving double quotes e.g `re"..."`.
815
+
We are deferring runtime support for callouts from regex literals as future work, though we will correctly parse their contents. We have no current recommendation for a preference of PCRE-style callout syntax (*TODO*: intra-doc link), and would like to discuss with the community whether we should have one.
0 commit comments