More UTF documentation

PhilipHazel · PhilipHazel · commit d0f899f4d3ca · 2023-01-20T17:08:34.000Z
diff --git a/doc/html/pcre2_jit_match.html b/doc/html/pcre2_jit_match.html
@@ -32,15 +32,25 @@ <h1>pcre2_jit_match man page</h1>
 processed by the JIT compiler against a given subject string, using a matching
 algorithm that is similar to Perl's. It is a "fast path" interface to JIT, and
 it bypasses some of the sanity checks that <b>pcre2_match()</b> applies.
-Its arguments are exactly the same as for
+</P>
+<P>
+In UTF mode, the subject string is not checked for UTF validity. Unless
+PCRE2_MATCH_INVALID_UTF was set when the pattern was compiled, passing an
+invalid UTF string results in undefined behaviour. Your program may crash or
+loop or give wrong results. In the absence of PCRE2_MATCH_INVALID_UTF you
+should only call <b>pcre2_jit_match()</b> in UTF mode if you are sure the
+subject is valid.
+</P>
+<P>
+The arguments for <b>pcre2_jit_match()</b> are exactly the same as for
 <a href="pcre2_match.html"><b>pcre2_match()</b>,</a>
 except that the subject string must be specified with a length;
 PCRE2_ZERO_TERMINATED is not supported.
 </P>
 <P>
 The supported options are PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
 PCRE2_NOTEMPTY_ATSTART, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. Unsupported
-options are ignored. The subject string is not checked for UTF validity.
+options are ignored.
 </P>
 <P>
 The return values are the same as for <b>pcre2_match()</b> plus
diff --git a/doc/html/pcre2jit.html b/doc/html/pcre2jit.html
@@ -445,7 +445,10 @@ <h1>pcre2jit man page</h1>
 the subject pointer is NULL but the length is non-zero, an immediate error is
 given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested
 for validity. In the interests of speed, these checks do not happen on the JIT
-fast path, and if invalid data is passed, the result is undefined.
+fast path, and if invalid UTF data is passed, the result is undefined. The
+program may crash or loop or give wrong results. In the absence of
+PCRE2_MATCH_INVALID_UTF you should only call <b>pcre2_jit_match()</b> in UTF
+mode if you are sure the subject is valid.
 </P>
 <P>
 Bypassing the sanity checks and the <b>pcre2_match()</b> wrapping can give
@@ -466,9 +469,9 @@ <h1>pcre2jit man page</h1>
 </P>
 <br><a name="SEC14" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 30 November 2021
+Last updated: 20 January 2023
 <br>
-Copyright &copy; 1997-2021 University of Cambridge.
+Copyright &copy; 1997-2023 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2unicode.html b/doc/html/pcre2unicode.html
@@ -432,6 +432,14 @@ <h1>pcre2unicode man page</h1>
 valid UTF string.
 </P>
 <P>
+If you do not set PCRE2_MATCH_INVALID_UTF when calling <b>pcre2_compile</b>, and
+you are not certain that your subject strings are valid UTF sequences, you
+should not make use of the JIT "fast path" function <b>pcre2_jit_match()</b>
+because it bypasses sanity checks, including the one for UTF validity. An
+invalid string may cause undefined behaviour, including looping, crashing, or
+giving the wrong answer.
+</P>
+<P>
 Setting PCRE2_MATCH_INVALID_UTF does not affect what <b>pcre2_compile()</b>
 generates, but if <b>pcre2_jit_compile()</b> is subsequently called, it does
 generate different code. If JIT is not used, the option affects the behaviour
@@ -473,9 +481,9 @@ <h1>pcre2unicode man page</h1>
 can be useful when searching for UTF text in executable or other binary files.
 </P>
 <P>
-Note, however, that the 16-bit and 32-bit PCRE2 libraries process strings as 
-sequences of uint16_t or uint32_t code points. They cannot find valid UTF 
-sequences within an arbitrary string of bytes unless such sequences are 
+Note, however, that the 16-bit and 32-bit PCRE2 libraries process strings as
+sequences of uint16_t or uint32_t code points. They cannot find valid UTF
+sequences within an arbitrary string of bytes unless such sequences are
 suitably aligned.
 </P>
 <br><b>
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
@@ -5557,10 +5557,13 @@ JIT FAST PATH API
        ple,  if the subject pointer is NULL but the length is non-zero, an im-
        mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set,  a  UTF
        subject string is tested for validity. In the interests of speed, these
-       checks do not happen on the JIT fast  path,  and  if  invalid  data  is
-       passed, the result is undefined.
+       checks do not happen on the JIT fast path, and if invalid UTF  data  is
+       passed,  the result is undefined. The program may crash or loop or give
+       wrong results. In the absence  of  PCRE2_MATCH_INVALID_UTF  you  should
+       only  call pcre2_jit_match() in UTF mode if you are sure the subject is
+       valid.
 
-       Bypassing  the  sanity  checks  and the pcre2_match() wrapping can give
+       Bypassing the sanity checks and the  pcre2_match()  wrapping  can  give
        speedups of more than 10%.
 
 
@@ -5578,8 +5581,8 @@ AUTHOR
 
 REVISION
 
-       Last updated: 30 November 2021
-       Copyright (c) 1997-2021 University of Cambridge.
+       Last updated: 20 January 2023
+       Copyright (c) 1997-2023 University of Cambridge.
 ------------------------------------------------------------------------------
  
  
@@ -11544,47 +11547,54 @@ MATCHING IN INVALID UTF STRINGS
        set,  it  forces  PCRE2_UTF  to be set as well. Note, however, that the
        pattern itself must be a valid UTF string.
 
-       Setting PCRE2_MATCH_INVALID_UTF does not  affect  what  pcre2_compile()
-       generates,  but  if pcre2_jit_compile() is subsequently called, it does
+       If you do not set PCRE2_MATCH_INVALID_UTF when  calling  pcre2_compile,
+       and  you  are  not  certain that your subject strings are valid UTF se-
+       quences, you should not make  use  of  the  JIT  "fast  path"  function
+       pcre2_jit_match()  because it bypasses sanity checks, including the one
+       for UTF validity. An invalid string may cause undefined behaviour,  in-
+       cluding looping, crashing, or giving the wrong answer.
+
+       Setting  PCRE2_MATCH_INVALID_UTF  does  not affect what pcre2_compile()
+       generates, but if pcre2_jit_compile() is subsequently called,  it  does
        generate different code. If JIT is not used, the option affects the be-
        haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
-       VALID_UTF is set at compile  time,  PCRE2_NO_UTF_CHECK  is  ignored  at
+       VALID_UTF  is  set  at  compile  time, PCRE2_NO_UTF_CHECK is ignored at
        match time.
 
-       In  this  mode,  an  invalid  code  unit  sequence in the subject never
-       matches any pattern item. It does not match  dot,  it  does  not  match
-       \p{Any},  it does not even match negative items such as [^X]. A lookbe-
-       hind assertion fails if it encounters an invalid sequence while  moving
-       the  current  point backwards. In other words, an invalid UTF code unit
+       In this mode, an invalid  code  unit  sequence  in  the  subject  never
+       matches  any  pattern  item.  It  does not match dot, it does not match
+       \p{Any}, it does not even match negative items such as [^X]. A  lookbe-
+       hind  assertion fails if it encounters an invalid sequence while moving
+       the current point backwards. In other words, an invalid UTF  code  unit
        sequence acts as a barrier which no match can cross.
 
        You can also think of this as the subject being split up into fragments
-       of  valid UTF, delimited internally by invalid code unit sequences. The
-       pattern is matched fragment by fragment. The  result  of  a  successful
-       match,  however,  is  given  as code unit offsets in the entire subject
+       of valid UTF, delimited internally by invalid code unit sequences.  The
+       pattern  is  matched  fragment  by fragment. The result of a successful
+       match, however, is given as code unit offsets  in  the  entire  subject
        string in the usual way. There are a few points to consider:
 
-       The internal boundaries are not interpreted as the beginnings  or  ends
-       of  lines  and  so  do not match circumflex or dollar characters in the
+       The  internal  boundaries are not interpreted as the beginnings or ends
+       of lines and so do not match circumflex or  dollar  characters  in  the
        pattern.
 
-       If pcre2_match() is called with an offset that  points  to  an  invalid
-       UTF-sequence,  that  sequence  is  skipped, and the match starts at the
+       If  pcre2_match()  is  called  with an offset that points to an invalid
+       UTF-sequence, that sequence is skipped, and the  match  starts  at  the
        next valid UTF character, or the end of the subject.
 
        At internal fragment boundaries, \b and \B behave in the same way as at
-       the  beginning  and end of the subject. For example, a sequence such as
-       \bWORD\b would match an instance of WORD that is surrounded by  invalid
+       the beginning and end of the subject. For example, a sequence  such  as
+       \bWORD\b  would match an instance of WORD that is surrounded by invalid
        UTF code units.
 
-       Using  PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi-
-       trary data, knowing that any matched  strings  that  are  returned  are
+       Using PCRE2_MATCH_INVALID_UTF, an application can run matches on  arbi-
+       trary  data,  knowing  that  any  matched strings that are returned are
        valid UTF. This can be useful when searching for UTF text in executable
        or other binary files.
 
-       Note, however, that the  16-bit  and  32-bit  PCRE2  libraries  process
-       strings  as  sequences of uint16_t or uint32_t code points. They cannot
-       find valid UTF sequences within an arbitrary  string  of  bytes  unless
+       Note,  however,  that  the  16-bit  and  32-bit PCRE2 libraries process
+       strings as sequences of uint16_t or uint32_t code points.  They  cannot
+       find  valid  UTF  sequences  within an arbitrary string of bytes unless
        such sequences are suitably aligned.
 
 
diff --git a/doc/pcre2unicode.3 b/doc/pcre2unicode.3
@@ -409,6 +409,13 @@ not by \fBpcre2_dfa_match()\fP. When PCRE2_MATCH_INVALID_UTF is set, it forces
 PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a
 valid UTF string.
 .P
+If you do not set PCRE2_MATCH_INVALID_UTF when calling \fBpcre2_compile\fP, and
+you are not certain that your subject strings are valid UTF sequences, you
+should not make use of the JIT "fast path" function \fBpcre2_jit_match()\fP
+because it bypasses sanity checks, including the one for UTF validity. An
+invalid string may cause undefined behaviour, including looping, crashing, or
+giving the wrong answer.
+.P
 Setting PCRE2_MATCH_INVALID_UTF does not affect what \fBpcre2_compile()\fP
 generates, but if \fBpcre2_jit_compile()\fP is subsequently called, it does
 generate different code. If JIT is not used, the option affects the behaviour
@@ -443,9 +450,9 @@ Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
 data, knowing that any matched strings that are returned are valid UTF. This
 can be useful when searching for UTF text in executable or other binary files.
 .P
-Note, however, that the 16-bit and 32-bit PCRE2 libraries process strings as 
-sequences of uint16_t or uint32_t code points. They cannot find valid UTF 
-sequences within an arbitrary string of bytes unless such sequences are 
+Note, however, that the 16-bit and 32-bit PCRE2 libraries process strings as
+sequences of uint16_t or uint32_t code points. They cannot find valid UTF
+sequences within an arbitrary string of bytes unless such sequences are
 suitably aligned.
 .
 .