Skip to content

Commit d933548

Browse files
jensmaurertkoeppe
authored andcommitted
P2736R2 Referencing The Unicode Standard
In [lex.name], the paper missed a change of the original term "character classes" to "character properties"; that change is included. Fixes NB FR 133, FR 013 (C++23 CD).
1 parent 4991a88 commit d933548

File tree

7 files changed

+56
-199
lines changed

7 files changed

+56
-199
lines changed

source/back.tex

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -18,24 +18,6 @@ \chapter{Bibliography}
1818
Programming languages, their environments, and system software interfaces ---
1919
Floating-point extensions for C --- Part 3: Interchange and extended types}
2020
% Other international standards.
21-
\item
22-
%%% Format for the following entry is based on that specified at
23-
%%% http://www.iec.ch/standardsdev/resources/draftingpublications/directives/principles/referencing.htm
24-
The Unicode Consortium. Unicode Standard Annex, \UAX{29},
25-
\doccite{Unicode Text Segmentation} [online].
26-
Edited by Mark Davis. Revision 35; issued for Unicode 12.0.0. 2019-02-15 [viewed 2020-02-23].
27-
Available from: \url{http://www.unicode.org/reports/tr29/tr29-35.html}
28-
\item
29-
The Unicode Consortium. Unicode Standard Annex, \UAX{31},
30-
\doccite{Unicode Identifier and Pattern Syntax} [online].
31-
Edited by Mark Davis. Revision 33; issued for Unicode 13.0.0.
32-
2020-02-13 [viewed 2021-06-08].
33-
Available from: \url{https://www.unicode.org/reports/tr31/tr31-33.html}
34-
\item
35-
The Unicode Standard Version 14.0,
36-
\doccite{Core Specification}.
37-
Unicode Consortium, ISBN 978-1-936213-29-0, copyright \copyright 2021 Unicode, Inc.
38-
Available from: \url{https://www.unicode.org/versions/Unicode14.0.0/UnicodeStandard-14.0.pdf}
3921
\item
4022
IANA Time Zone Database.
4123
Available from: \url{https://www.iana.org/time-zones}

source/future.tex

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2045,6 +2045,10 @@
20452045
If \tcode{(Mode \& little_endian)}, the facet shall generate a
20462046
multibyte sequence in little-endian order,
20472047
as opposed to the default big-endian order.
2048+
\item
2049+
UCS-2 is the same encoding as UTF-16,
2050+
except that it encodes scalar values in the range
2051+
\ucode{0000}--\ucode{ffff} (Basic Multilingual Plane) only.
20482052
\end{itemize}
20492053

20502054
\pnum
@@ -2055,8 +2059,7 @@
20552059
\begin{itemize}
20562060
\item
20572061
The facet shall convert between UTF-8 multibyte sequences
2058-
and UCS-2 or UTF-32 (depending on the size of \tcode{Elem})
2059-
within the program.
2062+
and UCS-2 or UTF-32 (depending on the size of \tcode{Elem}).
20602063
\item
20612064
Endianness shall not affect how multibyte sequences are read or written.
20622065
\item
@@ -2071,8 +2074,7 @@
20712074
\begin{itemize}
20722075
\item
20732076
The facet shall convert between UTF-16 multibyte sequences
2074-
and UCS-2 or UTF-32 (depending on the size of \tcode{Elem})
2075-
within the program.
2077+
and UCS-2 or UTF-32 (depending on the size of \tcode{Elem}).
20762078
\item
20772079
Multibyte sequences shall be read or written
20782080
according to the \tcode{Mode} flag, as set out above.
@@ -2095,13 +2097,6 @@
20952097
The multibyte sequences may be written as either a text or a binary file.
20962098
\end{itemize}
20972099

2098-
\pnum
2099-
The encoding forms UTF-8, UTF-16, and UTF-32 are specified in ISO/IEC 10646.
2100-
The encoding form UCS-2 is specified in ISO/IEC 10646:2003.
2101-
\begin{footnote}
2102-
Cancelled and replaced by ISO/IEC 10646:2017.
2103-
\end{footnote}
2104-
21052100
\rSec1[depr.conversions]{Deprecated convenience conversion interfaces}
21062101

21072102
\rSec2[depr.conversions.general]{General}

source/intro.tex

Lines changed: 2 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -51,14 +51,6 @@
5151
Operating System Interface (POSIX), Technical Corrigendum 1}
5252
\item ISO/IEC/IEEE 9945:2009/Cor 2:2017, \doccite{Information Technology --- Portable
5353
Operating System Interface (POSIX), Technical Corrigendum 2}
54-
\item ISO/IEC 10646, \doccite{Information technology ---
55-
Universal Coded Character Set (UCS)}
56-
\item ISO/IEC 10646:2003,
57-
\begin{footnote}
58-
Cancelled and replaced by ISO/IEC 10646:2017.
59-
\end{footnote}
60-
\doccite{Information technology ---
61-
Universal Multiple-Octet Coded Character Set (UCS)}
6254
\item ISO/IEC/IEEE 60559:2020, \doccite{Information technology ---
6355
Microprocessor Systems --- Floating-Point arithmetic}
6456
\item ISO 80000-2:2009, \doccite{Quantities and units ---
@@ -75,14 +67,8 @@
7567
Language Specification},
7668
Standard Ecma-262, third edition, 1999.
7769
\item
78-
The Unicode Consortium.
79-
Unicode Standard Annex, \UAX{44}, \doccite{Unicode Character Database}.
80-
Edited by Ken Whistler and Lauren\c{t}iu Iancu.
81-
Available from: \url{http://www.unicode.org/reports/tr44/}
82-
\item
83-
The Unicode Consortium.
84-
The Unicode Standard, \doccite{Derived Core Properties}.
85-
Available from: \url{https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt}
70+
The Unicode Consortium. \doccite{The Unicode Standard}.
71+
Available from: \url{https://www.unicode.org/versions/latest/}
8672
\end{itemize}
8773

8874
\pnum
@@ -104,12 +90,6 @@
10490
hereinafter called \defn{ECMA-262}.
10591
\indextext{references!normative|)}
10692

107-
\pnum
108-
\begin{note}
109-
References to ISO/IEC 10646:2003 are used only
110-
to support deprecated features\iref{depr.locale.stdcvt}.
111-
\end{note}
112-
11393
\rSec0[intro.defs]{Terms and definitions}
11494

11595
\pnum

source/iostreams.tex

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6850,7 +6850,7 @@
68506850
if invoking the native Unicode API requires transcoding,
68516851
implementations should substitute invalid code units
68526852
with \unicode{fffd}{replacement character} per
6853-
The Unicode Standard Version 14.0 - Core Specification, Chapter 3.9.
6853+
the Unicode Standard, Chapter 3.9 \ucode{fffd} Substitution in Conversion.
68546854
\end{itemdescr}
68556855

68566856
\rSec3[ostream.unformatted]{Unformatted output functions}
@@ -7786,7 +7786,7 @@
77867786
If invoking the native Unicode API requires transcoding,
77877787
implementations should substitute invalid code units
77887788
with \unicode{fffd}{replacement character} per
7789-
The Unicode Standard Version 14.0 - Core Specification, Chapter 3.9.
7789+
the Unicode Standard, Chapter 3.9 \ucode{fffd} Substitution in Conversion.
77907790
\end{itemdescr}
77917791

77927792
\indexlibraryglobal{vprint_nonunicode}%

source/lex.tex

Lines changed: 31 additions & 130 deletions
Original file line numberDiff line numberDiff line change
@@ -80,8 +80,10 @@
8080
\end{note}
8181
If an input file is determined to be a UTF-8 file,
8282
then it shall be a well-formed UTF-8 code unit sequence and
83-
it is decoded to produce a sequence of UCS scalar values
84-
that constitutes the sequence of elements of the translation character set.
83+
it is decoded to produce a sequence of Unicode scalar values.
84+
A sequence of translation character set elements is then formed
85+
by mapping each Unicode scalar value
86+
to the corresponding translation character set element.
8587
In the resulting sequence,
8688
each pair of characters in the input sequence consisting of
8789
\unicode{000d}{carriage return} followed by \unicode{000a}{line feed},
@@ -244,18 +246,17 @@
244246
The \defnadj{translation}{character set} consists of the following elements:
245247
\begin{itemize}
246248
\item
247-
each character named by ISO/IEC 10646,
248-
as identified by its unique UCS scalar value, and
249+
each abstract character assigned a code point in the Unicode codespace, and
249250
\item
250-
a distinct character for each UCS scalar value
251-
where no named character is assigned.
251+
a distinct character for each Unicode scalar value
252+
not assigned to an abstract character.
252253
\end{itemize}
253254
\begin{note}
254-
ISO/IEC 10646 code points are integers
255+
Unicode code points are integers
255256
in the range $[0, \mathrm{10FFFF}]$ (hexadecimal).
256257
A surrogate code point is a value
257258
in the range $[\mathrm{D800}, \mathrm{DFFF}]$ (hexadecimal).
258-
A UCS scalar value is any code point that is not a surrogate code point.
259+
A Unicode scalar value is any code point that is not a surrogate code point.
259260
\end{note}
260261

261262
\pnum
@@ -355,126 +356,27 @@
355356
\tcode{\textbackslash U} \grammarterm{hex-quad} \grammarterm{hex-quad}, or
356357
\tcode{\textbackslash u\{\grammarterm{simple-hexadecimal-digit-sequence}\}}
357358
designates the character in the translation character set
358-
whose UCS scalar value is the hexadecimal number represented by
359+
whose Unicode scalar value is the hexadecimal number represented by
359360
the sequence of \grammarterm{hexadecimal-digit}s
360361
in the \grammarterm{universal-character-name}.
361-
The program is ill-formed if that number is not a UCS scalar value.
362+
The program is ill-formed if that number is not a Unicode scalar value.
362363

363364
\pnum
364365
A \grammarterm{universal-character-name}
365366
that is a \grammarterm{named-universal-character}
366-
designates the character named by its \grammarterm{n-char-sequence}.
367-
A character is so named if the \grammarterm{n-char-sequence} is equal to
368-
\begin{itemize}
369-
\item
370-
the associated character name or associated character name alias
371-
specified in ISO/IEC 10646 subclause ``Code charts and lists of character names''
372-
or
373-
\item
374-
the control code alias given in \tref{lex.charset.ucn}.
367+
designates the corresponding character
368+
in the Unicode Standard (chapter 4.8 Name)
369+
if the \grammarterm{n-char-sequence} is equal
370+
to its character name or
371+
to one of its character name aliases of
372+
type ``control'', ``correction'', or ``alternate'';
373+
otherwise, the program is ill-formed.
375374
\begin{note}
376-
The aliases in \tref{lex.charset.ucn} are provided for control characters
377-
which otherwise have no associated character name or character name alias.
378-
These names are derived from
375+
These aliases are listed in
379376
the Unicode Character Database's \tcode{NameAliases.txt}.
380-
For historical reasons, control characters are formally unnamed.
381-
\end{note}
382-
\end{itemize}
383-
\begin{note}
384-
None of the associated character names,
385-
associated character name aliases, or
386-
control code aliases
387-
have leading or trailing spaces.
377+
None of these names or aliases have leading or trailing spaces.
388378
\end{note}
389379

390-
\begin{multicolfloattable}{Control code aliases}{lex.charset.ucn}{ll}
391-
\unicode{0000}{null} \\
392-
\unicode{0001}{start of heading} \\
393-
\unicode{0002}{start of text} \\
394-
\unicode{0003}{end of text} \\
395-
\unicode{0004}{end of transmission} \\
396-
\unicode{0005}{enquiry} \\
397-
\unicode{0006}{acknowledge} \\
398-
\unicode{0007}{alert} \\
399-
\unicode{0008}{backspace} \\
400-
\unicode{0009}{character tabulation} \\
401-
\unicode{0009}{horizontal tabulation} \\
402-
\unicode{000a}{line feed} \\
403-
\unicode{000a}{new line} \\
404-
\unicode{000a}{end of line} \\
405-
\unicode{000b}{line tabulation} \\
406-
\unicode{000b}{vertical tabulation} \\
407-
\unicode{000c}{form feed} \\
408-
\unicode{000d}{carriage return} \\
409-
\unicode{000e}{shift out} \\
410-
\unicode{000e}{locking-shift one} \\
411-
\unicode{000f}{shift in} \\
412-
\unicode{000f}{locking-shift zero} \\
413-
\unicode{0010}{data link escape} \\
414-
\unicode{0011}{device control one} \\
415-
\unicode{0012}{device control two} \\
416-
\unicode{0013}{device control three} \\
417-
\unicode{0014}{device control four} \\
418-
\unicode{0015}{negative acknowledge} \\
419-
\unicode{0016}{synchronous idle} \\
420-
\unicode{0017}{end of transmission block} \\
421-
\unicode{0018}{cancel} \\
422-
\unicode{0019}{end of medium} \\
423-
\unicode{001a}{substitute} \\
424-
\unicode{001b}{escape} \\
425-
\unicode{001c}{information separator four} \\
426-
\unicode{001c}{file separator} \\
427-
\unicode{001d}{information separator three} \\
428-
\unicode{001d}{group separator} \\
429-
\unicode{001e}{information separator two} \\
430-
\unicode{001e}{record separator} \\
431-
\unicode{001f}{information separator one} \\
432-
\unicode{001f}{unit separator} \\
433-
\columnbreak
434-
\unicode{007f}{delete} \\
435-
\unicode{0082}{break permitted here} \\
436-
\unicode{0083}{no break here} \\
437-
\unicode{0084}{index} \\
438-
\unicode{0085}{next line} \\
439-
\unicode{0086}{start of selected area} \\
440-
\unicode{0087}{end of selected area} \\
441-
\unicode{0088}{character tabulation set} \\
442-
\unicode{0088}{horizontal tabulation set} \\
443-
\unicode{0089}{character tabulation with justification} \\
444-
\unicode{0089}{horizontal tabulation with justification} \\
445-
\unicode{008a}{line tabulation set} \\
446-
\unicode{008a}{vertical tabulation set} \\
447-
\unicode{008b}{partial line forward} \\
448-
\unicode{008b}{partial line down} \\
449-
\unicode{008c}{partial line backward} \\
450-
\unicode{008c}{partial line up} \\
451-
\unicode{008d}{reverse line feed} \\
452-
\unicode{008d}{reverse index} \\
453-
\unicode{008e}{single shift two} \\
454-
\unicode{008e}{single-shift-2} \\
455-
\unicode{008f}{single shift three} \\
456-
\unicode{008f}{single-shift-3} \\
457-
\unicode{0090}{device control string} \\
458-
\unicode{0091}{private use one} \\
459-
\unicode{0091}{private use-1} \\
460-
\unicode{0092}{private use two} \\
461-
\unicode{0092}{private use-2} \\
462-
\unicode{0093}{set transmit state} \\
463-
\unicode{0094}{cancel character} \\
464-
\unicode{0095}{message waiting} \\
465-
\unicode{0096}{start of guarded area} \\
466-
\unicode{0096}{start of protected area} \\
467-
\unicode{0097}{end of guarded area} \\
468-
\unicode{0097}{end of protected area} \\
469-
\unicode{0098}{start of string} \\
470-
\unicode{009a}{single character introducer} \\
471-
\unicode{009b}{control sequence introducer} \\
472-
\unicode{009c}{string terminator} \\
473-
\unicode{009d}{operating system command} \\
474-
\unicode{009e}{privacy message} \\
475-
\unicode{009f}{application program command} \\
476-
\end{multicolfloattable}
477-
478380
\pnum
479381
If a \grammarterm{universal-character-name} outside
480382
the \grammarterm{c-char-sequence}, \grammarterm{s-char-sequence}, or
@@ -493,10 +395,6 @@
493395
The \defnadj{basic literal}{character set} consists of
494396
all characters of the basic character set,
495397
plus the control characters specified in \tref{lex.charset.literal}.
496-
\begin{note}
497-
The alias \uname{bell} for \ucode{0007} shown in ISO 10646
498-
is ambiguous with \unicode{1f514}{bell}.
499-
\end{note}
500398

501399
\begin{floattable}{Additional control characters in the basic literal character set}{lex.charset.literal}{ll}
502400
\topline
@@ -546,9 +444,10 @@
546444
\indextext{UTF-16}%
547445
\indextext{UTF-32}%
548446
For a UTF-8, UTF-16, or UTF-32 literal,
549-
the UCS scalar value
447+
the Unicode scalar value
550448
corresponding to each character of the translation character set
551-
is encoded as specified in ISO/IEC 10646 for the respective UCS encoding form.
449+
is encoded as specified in the Unicode Standard
450+
for the respective Unicode encoding form.
552451
\indextext{character set|)}
553452

554453
\rSec1[lex.pptoken]{Preprocessing tokens}
@@ -889,14 +788,14 @@
889788
\begin{bnf}
890789
\nontermdef{identifier-start}\br
891790
nondigit\br
892-
\textnormal{an element of the translation character set of class XID_Start}
791+
\textnormal{an element of the translation character set with the Unicode property XID_Start}
893792
\end{bnf}
894793

895794
\begin{bnf}
896795
\nontermdef{identifier-continue}\br
897796
digit\br
898797
nondigit\br
899-
\textnormal{an element of the translation character set of class XID_Continue}
798+
\textnormal{an element of the translation character set with the Unicode property XID_Continue}
900799
\end{bnf}
901800

902801
\begin{bnf}
@@ -915,8 +814,9 @@
915814
\pnum
916815
\indextext{name!length of}%
917816
\indextext{name}%
918-
The character classes XID_Start and XID_Continue
919-
are Derived Core Properties as described by \UAX{44}.
817+
\begin{note}
818+
The character properties XID_Start and XID_Continue are Derived Core Properties
819+
as described by \UAX{44} of the Unicode Standard.
920820
\begin{footnote}
921821
On systems in which linkers cannot accept extended
922822
characters, an encoding of the \grammarterm{universal-character-name} can be used in
@@ -927,9 +827,10 @@
927827
place a translation limit on significant characters for external
928828
identifiers.
929829
\end{footnote}
830+
\end{note}
930831
The program is ill-formed
931832
if an \grammarterm{identifier} does not conform to
932-
Normalization Form C as specified in ISO/IEC 10646.
833+
Normalization Form C as specified in the Unicode Standard.
933834
\begin{note}
934835
Identifiers are case-sensitive.
935836
\end{note}
@@ -2102,7 +2003,7 @@
21022003
\impldef{code unit sequence for non-representable \grammarterm{string-literal}}
21032004
code unit sequence is encoded.
21042005
\begin{note}
2105-
No character lacks representation in any of the UCS encoding forms.
2006+
No character lacks representation in any Unicode encoding form.
21062007
\end{note}
21072008
When encoding a stateful character encoding,
21082009
implementations should encode the first such sequence

0 commit comments

Comments
 (0)