Skip to content

Commit d0f899f

Browse files
committed
More UTF documentation
1 parent 4d9c159 commit d0f899f

File tree

5 files changed

+76
-38
lines changed

5 files changed

+76
-38
lines changed

doc/html/pcre2_jit_match.html

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,15 +32,25 @@ <h1>pcre2_jit_match man page</h1>
3232
processed by the JIT compiler against a given subject string, using a matching
3333
algorithm that is similar to Perl's. It is a "fast path" interface to JIT, and
3434
it bypasses some of the sanity checks that <b>pcre2_match()</b> applies.
35-
Its arguments are exactly the same as for
35+
</P>
36+
<P>
37+
In UTF mode, the subject string is not checked for UTF validity. Unless
38+
PCRE2_MATCH_INVALID_UTF was set when the pattern was compiled, passing an
39+
invalid UTF string results in undefined behaviour. Your program may crash or
40+
loop or give wrong results. In the absence of PCRE2_MATCH_INVALID_UTF you
41+
should only call <b>pcre2_jit_match()</b> in UTF mode if you are sure the
42+
subject is valid.
43+
</P>
44+
<P>
45+
The arguments for <b>pcre2_jit_match()</b> are exactly the same as for
3646
<a href="pcre2_match.html"><b>pcre2_match()</b>,</a>
3747
except that the subject string must be specified with a length;
3848
PCRE2_ZERO_TERMINATED is not supported.
3949
</P>
4050
<P>
4151
The supported options are PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
4252
PCRE2_NOTEMPTY_ATSTART, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Unsupported
43-
options are ignored. The subject string is not checked for UTF validity.
53+
options are ignored.
4454
</P>
4555
<P>
4656
The return values are the same as for <b>pcre2_match()</b> plus

doc/html/pcre2jit.html

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -445,7 +445,10 @@ <h1>pcre2jit man page</h1>
445445
the subject pointer is NULL but the length is non-zero, an immediate error is
446446
given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested
447447
for validity. In the interests of speed, these checks do not happen on the JIT
448-
fast path, and if invalid data is passed, the result is undefined.
448+
fast path, and if invalid UTF data is passed, the result is undefined. The
449+
program may crash or loop or give wrong results. In the absence of
450+
PCRE2_MATCH_INVALID_UTF you should only call <b>pcre2_jit_match()</b> in UTF
451+
mode if you are sure the subject is valid.
449452
</P>
450453
<P>
451454
Bypassing the sanity checks and the <b>pcre2_match()</b> wrapping can give
@@ -466,9 +469,9 @@ <h1>pcre2jit man page</h1>
466469
</P>
467470
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
468471
<P>
469-
Last updated: 30 November 2021
472+
Last updated: 20 January 2023
470473
<br>
471-
Copyright &copy; 1997-2021 University of Cambridge.
474+
Copyright &copy; 1997-2023 University of Cambridge.
472475
<br>
473476
<p>
474477
Return to the <a href="index.html">PCRE2 index page</a>.

doc/html/pcre2unicode.html

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -432,6 +432,14 @@ <h1>pcre2unicode man page</h1>
432432
valid UTF string.
433433
</P>
434434
<P>
435+
If you do not set PCRE2_MATCH_INVALID_UTF when calling <b>pcre2_compile</b>, and
436+
you are not certain that your subject strings are valid UTF sequences, you
437+
should not make use of the JIT "fast path" function <b>pcre2_jit_match()</b>
438+
because it bypasses sanity checks, including the one for UTF validity. An
439+
invalid string may cause undefined behaviour, including looping, crashing, or
440+
giving the wrong answer.
441+
</P>
442+
<P>
435443
Setting PCRE2_MATCH_INVALID_UTF does not affect what <b>pcre2_compile()</b>
436444
generates, but if <b>pcre2_jit_compile()</b> is subsequently called, it does
437445
generate different code. If JIT is not used, the option affects the behaviour
@@ -473,9 +481,9 @@ <h1>pcre2unicode man page</h1>
473481
can be useful when searching for UTF text in executable or other binary files.
474482
</P>
475483
<P>
476-
Note, however, that the 16-bit and 32-bit PCRE2 libraries process strings as
477-
sequences of uint16_t or uint32_t code points. They cannot find valid UTF
478-
sequences within an arbitrary string of bytes unless such sequences are
484+
Note, however, that the 16-bit and 32-bit PCRE2 libraries process strings as
485+
sequences of uint16_t or uint32_t code points. They cannot find valid UTF
486+
sequences within an arbitrary string of bytes unless such sequences are
479487
suitably aligned.
480488
</P>
481489
<br><b>

doc/pcre2.txt

Lines changed: 37 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -5557,10 +5557,13 @@ JIT FAST PATH API
55575557
ple, if the subject pointer is NULL but the length is non-zero, an im-
55585558
mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF
55595559
subject string is tested for validity. In the interests of speed, these
5560-
checks do not happen on the JIT fast path, and if invalid data is
5561-
passed, the result is undefined.
5560+
checks do not happen on the JIT fast path, and if invalid UTF data is
5561+
passed, the result is undefined. The program may crash or loop or give
5562+
wrong results. In the absence of PCRE2_MATCH_INVALID_UTF you should
5563+
only call pcre2_jit_match() in UTF mode if you are sure the subject is
5564+
valid.
55625565

5563-
Bypassing the sanity checks and the pcre2_match() wrapping can give
5566+
Bypassing the sanity checks and the pcre2_match() wrapping can give
55645567
speedups of more than 10%.
55655568

55665569

@@ -5578,8 +5581,8 @@ AUTHOR
55785581

55795582
REVISION
55805583

5581-
Last updated: 30 November 2021
5582-
Copyright (c) 1997-2021 University of Cambridge.
5584+
Last updated: 20 January 2023
5585+
Copyright (c) 1997-2023 University of Cambridge.
55835586
------------------------------------------------------------------------------
55845587

55855588

@@ -11544,47 +11547,54 @@ MATCHING IN INVALID UTF STRINGS
1154411547
set, it forces PCRE2_UTF to be set as well. Note, however, that the
1154511548
pattern itself must be a valid UTF string.
1154611549

11547-
Setting PCRE2_MATCH_INVALID_UTF does not affect what pcre2_compile()
11548-
generates, but if pcre2_jit_compile() is subsequently called, it does
11550+
If you do not set PCRE2_MATCH_INVALID_UTF when calling pcre2_compile,
11551+
and you are not certain that your subject strings are valid UTF se-
11552+
quences, you should not make use of the JIT "fast path" function
11553+
pcre2_jit_match() because it bypasses sanity checks, including the one
11554+
for UTF validity. An invalid string may cause undefined behaviour, in-
11555+
cluding looping, crashing, or giving the wrong answer.
11556+
11557+
Setting PCRE2_MATCH_INVALID_UTF does not affect what pcre2_compile()
11558+
generates, but if pcre2_jit_compile() is subsequently called, it does
1154911559
generate different code. If JIT is not used, the option affects the be-
1155011560
haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
11551-
VALID_UTF is set at compile time, PCRE2_NO_UTF_CHECK is ignored at
11561+
VALID_UTF is set at compile time, PCRE2_NO_UTF_CHECK is ignored at
1155211562
match time.
1155311563

11554-
In this mode, an invalid code unit sequence in the subject never
11555-
matches any pattern item. It does not match dot, it does not match
11556-
\p{Any}, it does not even match negative items such as [^X]. A lookbe-
11557-
hind assertion fails if it encounters an invalid sequence while moving
11558-
the current point backwards. In other words, an invalid UTF code unit
11564+
In this mode, an invalid code unit sequence in the subject never
11565+
matches any pattern item. It does not match dot, it does not match
11566+
\p{Any}, it does not even match negative items such as [^X]. A lookbe-
11567+
hind assertion fails if it encounters an invalid sequence while moving
11568+
the current point backwards. In other words, an invalid UTF code unit
1155911569
sequence acts as a barrier which no match can cross.
1156011570

1156111571
You can also think of this as the subject being split up into fragments
11562-
of valid UTF, delimited internally by invalid code unit sequences. The
11563-
pattern is matched fragment by fragment. The result of a successful
11564-
match, however, is given as code unit offsets in the entire subject
11572+
of valid UTF, delimited internally by invalid code unit sequences. The
11573+
pattern is matched fragment by fragment. The result of a successful
11574+
match, however, is given as code unit offsets in the entire subject
1156511575
string in the usual way. There are a few points to consider:
1156611576

11567-
The internal boundaries are not interpreted as the beginnings or ends
11568-
of lines and so do not match circumflex or dollar characters in the
11577+
The internal boundaries are not interpreted as the beginnings or ends
11578+
of lines and so do not match circumflex or dollar characters in the
1156911579
pattern.
1157011580

11571-
If pcre2_match() is called with an offset that points to an invalid
11572-
UTF-sequence, that sequence is skipped, and the match starts at the
11581+
If pcre2_match() is called with an offset that points to an invalid
11582+
UTF-sequence, that sequence is skipped, and the match starts at the
1157311583
next valid UTF character, or the end of the subject.
1157411584

1157511585
At internal fragment boundaries, \b and \B behave in the same way as at
11576-
the beginning and end of the subject. For example, a sequence such as
11577-
\bWORD\b would match an instance of WORD that is surrounded by invalid
11586+
the beginning and end of the subject. For example, a sequence such as
11587+
\bWORD\b would match an instance of WORD that is surrounded by invalid
1157811588
UTF code units.
1157911589

11580-
Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi-
11581-
trary data, knowing that any matched strings that are returned are
11590+
Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi-
11591+
trary data, knowing that any matched strings that are returned are
1158211592
valid UTF. This can be useful when searching for UTF text in executable
1158311593
or other binary files.
1158411594

11585-
Note, however, that the 16-bit and 32-bit PCRE2 libraries process
11586-
strings as sequences of uint16_t or uint32_t code points. They cannot
11587-
find valid UTF sequences within an arbitrary string of bytes unless
11595+
Note, however, that the 16-bit and 32-bit PCRE2 libraries process
11596+
strings as sequences of uint16_t or uint32_t code points. They cannot
11597+
find valid UTF sequences within an arbitrary string of bytes unless
1158811598
such sequences are suitably aligned.
1158911599

1159011600

doc/pcre2unicode.3

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -409,6 +409,13 @@ not by \fBpcre2_dfa_match()\fP. When PCRE2_MATCH_INVALID_UTF is set, it forces
409409
PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a
410410
valid UTF string.
411411
.P
412+
If you do not set PCRE2_MATCH_INVALID_UTF when calling \fBpcre2_compile\fP, and
413+
you are not certain that your subject strings are valid UTF sequences, you
414+
should not make use of the JIT "fast path" function \fBpcre2_jit_match()\fP
415+
because it bypasses sanity checks, including the one for UTF validity. An
416+
invalid string may cause undefined behaviour, including looping, crashing, or
417+
giving the wrong answer.
418+
.P
412419
Setting PCRE2_MATCH_INVALID_UTF does not affect what \fBpcre2_compile()\fP
413420
generates, but if \fBpcre2_jit_compile()\fP is subsequently called, it does
414421
generate different code. If JIT is not used, the option affects the behaviour
@@ -443,9 +450,9 @@ Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
443450
data, knowing that any matched strings that are returned are valid UTF. This
444451
can be useful when searching for UTF text in executable or other binary files.
445452
.P
446-
Note, however, that the 16-bit and 32-bit PCRE2 libraries process strings as
447-
sequences of uint16_t or uint32_t code points. They cannot find valid UTF
448-
sequences within an arbitrary string of bytes unless such sequences are
453+
Note, however, that the 16-bit and 32-bit PCRE2 libraries process strings as
454+
sequences of uint16_t or uint32_t code points. They cannot find valid UTF
455+
sequences within an arbitrary string of bytes unless such sequences are
449456
suitably aligned.
450457
.
451458
.

0 commit comments

Comments
 (0)