@@ -5557,10 +5557,13 @@ JIT FAST PATH API
5557
5557
ple, if the subject pointer is NULL but the length is non-zero, an im-
5558
5558
mediate error is given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF
5559
5559
subject string is tested for validity. In the interests of speed, these
5560
- checks do not happen on the JIT fast path, and if invalid data is
5561
- passed, the result is undefined.
5560
+ checks do not happen on the JIT fast path, and if invalid UTF data is
5561
+ passed, the result is undefined. The program may crash or loop or give
5562
+ wrong results. In the absence of PCRE2_MATCH_INVALID_UTF you should
5563
+ only call pcre2_jit_match() in UTF mode if you are sure the subject is
5564
+ valid.
5562
5565
5563
- Bypassing the sanity checks and the pcre2_match() wrapping can give
5566
+ Bypassing the sanity checks and the pcre2_match() wrapping can give
5564
5567
speedups of more than 10%.
5565
5568
5566
5569
@@ -5578,8 +5581,8 @@ AUTHOR
5578
5581
5579
5582
REVISION
5580
5583
5581
- Last updated: 30 November 2021
5582
- Copyright (c) 1997-2021 University of Cambridge.
5584
+ Last updated: 20 January 2023
5585
+ Copyright (c) 1997-2023 University of Cambridge.
5583
5586
------------------------------------------------------------------------------
5584
5587
5585
5588
@@ -11544,47 +11547,54 @@ MATCHING IN INVALID UTF STRINGS
11544
11547
set, it forces PCRE2_UTF to be set as well. Note, however, that the
11545
11548
pattern itself must be a valid UTF string.
11546
11549
11547
- Setting PCRE2_MATCH_INVALID_UTF does not affect what pcre2_compile()
11548
- generates, but if pcre2_jit_compile() is subsequently called, it does
11550
+ If you do not set PCRE2_MATCH_INVALID_UTF when calling pcre2_compile,
11551
+ and you are not certain that your subject strings are valid UTF se-
11552
+ quences, you should not make use of the JIT "fast path" function
11553
+ pcre2_jit_match() because it bypasses sanity checks, including the one
11554
+ for UTF validity. An invalid string may cause undefined behaviour, in-
11555
+ cluding looping, crashing, or giving the wrong answer.
11556
+
11557
+ Setting PCRE2_MATCH_INVALID_UTF does not affect what pcre2_compile()
11558
+ generates, but if pcre2_jit_compile() is subsequently called, it does
11549
11559
generate different code. If JIT is not used, the option affects the be-
11550
11560
haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN-
11551
- VALID_UTF is set at compile time, PCRE2_NO_UTF_CHECK is ignored at
11561
+ VALID_UTF is set at compile time, PCRE2_NO_UTF_CHECK is ignored at
11552
11562
match time.
11553
11563
11554
- In this mode, an invalid code unit sequence in the subject never
11555
- matches any pattern item. It does not match dot, it does not match
11556
- \p{Any}, it does not even match negative items such as [^X]. A lookbe-
11557
- hind assertion fails if it encounters an invalid sequence while moving
11558
- the current point backwards. In other words, an invalid UTF code unit
11564
+ In this mode, an invalid code unit sequence in the subject never
11565
+ matches any pattern item. It does not match dot, it does not match
11566
+ \p{Any}, it does not even match negative items such as [^X]. A lookbe-
11567
+ hind assertion fails if it encounters an invalid sequence while moving
11568
+ the current point backwards. In other words, an invalid UTF code unit
11559
11569
sequence acts as a barrier which no match can cross.
11560
11570
11561
11571
You can also think of this as the subject being split up into fragments
11562
- of valid UTF, delimited internally by invalid code unit sequences. The
11563
- pattern is matched fragment by fragment. The result of a successful
11564
- match, however, is given as code unit offsets in the entire subject
11572
+ of valid UTF, delimited internally by invalid code unit sequences. The
11573
+ pattern is matched fragment by fragment. The result of a successful
11574
+ match, however, is given as code unit offsets in the entire subject
11565
11575
string in the usual way. There are a few points to consider:
11566
11576
11567
- The internal boundaries are not interpreted as the beginnings or ends
11568
- of lines and so do not match circumflex or dollar characters in the
11577
+ The internal boundaries are not interpreted as the beginnings or ends
11578
+ of lines and so do not match circumflex or dollar characters in the
11569
11579
pattern.
11570
11580
11571
- If pcre2_match() is called with an offset that points to an invalid
11572
- UTF-sequence, that sequence is skipped, and the match starts at the
11581
+ If pcre2_match() is called with an offset that points to an invalid
11582
+ UTF-sequence, that sequence is skipped, and the match starts at the
11573
11583
next valid UTF character, or the end of the subject.
11574
11584
11575
11585
At internal fragment boundaries, \b and \B behave in the same way as at
11576
- the beginning and end of the subject. For example, a sequence such as
11577
- \bWORD\b would match an instance of WORD that is surrounded by invalid
11586
+ the beginning and end of the subject. For example, a sequence such as
11587
+ \bWORD\b would match an instance of WORD that is surrounded by invalid
11578
11588
UTF code units.
11579
11589
11580
- Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi-
11581
- trary data, knowing that any matched strings that are returned are
11590
+ Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi-
11591
+ trary data, knowing that any matched strings that are returned are
11582
11592
valid UTF. This can be useful when searching for UTF text in executable
11583
11593
or other binary files.
11584
11594
11585
- Note, however, that the 16-bit and 32-bit PCRE2 libraries process
11586
- strings as sequences of uint16_t or uint32_t code points. They cannot
11587
- find valid UTF sequences within an arbitrary string of bytes unless
11595
+ Note, however, that the 16-bit and 32-bit PCRE2 libraries process
11596
+ strings as sequences of uint16_t or uint32_t code points. They cannot
11597
+ find valid UTF sequences within an arbitrary string of bytes unless
11588
11598
such sequences are suitably aligned.
11589
11599
11590
11600
0 commit comments