Skip to content

Commit 8a59998

Browse files
avargitster
authored andcommitted
grep: stess test PCRE v2 on invalid UTF-8 data
Since my b65abca ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01) we've been dying on invalid UTF-8 data when grepping for fixed strings if the following are all true: * The subject string is non-ASCII (e.g. "ævar") * We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C" * We compiled with PCRE v2 * That PCRE v2 did not have JIT support The last of those is why this wasn't caught earlier, per pcre2jit(3): "unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the interests of speed, these checks do not happen on the JIT fast path, and if invalid data is passed, the result is undefined." I.e. the subject being matched against our pattern was invalid, but we were lucky and getting away with it on the JIT path, but the non-JIT one is stricter. This patch does nothing to fix that, instead we sneak in support for fixed patterns starting with "(*NO_JIT)", this disables the PCRE v2 jit with implicit fixed-string matching for testing, see pcre2syntax(3) the syntax. This is technically a change in behavior, but it's so obscure that I figured it was OK. We'd previously consider this an invalid regular expression as regcomp() would die on it, now we feed it to the PCRE v2 fixed-string path. I thought this was better than introducing yet another GIT_TEST_* environment variable. We're also relying on a behavior of PCRE v2 that technically could change, but I think the test coverage is worth dipping our toe into some somewhat undefined behavior. Signed-off-by: Ævar Arnfjörð Bjarmason <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 09872f6 commit 8a59998

File tree

2 files changed

+38
-0
lines changed

2 files changed

+38
-0
lines changed

grep.c

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -615,6 +615,16 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
615615
die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
616616

617617
p->is_fixed = is_fixed(p->pattern, p->patternlen);
618+
#ifdef USE_LIBPCRE2
619+
if (!p->fixed && !p->is_fixed) {
620+
const char *no_jit = "(*NO_JIT)";
621+
const int no_jit_len = strlen(no_jit);
622+
if (starts_with(p->pattern, no_jit) &&
623+
is_fixed(p->pattern + no_jit_len,
624+
p->patternlen - no_jit_len))
625+
p->is_fixed = 1;
626+
}
627+
#endif
618628
if (p->fixed || p->is_fixed) {
619629
#ifdef USE_LIBPCRE2
620630
opt->pcre2 = 1;

t/t7812-grep-icase-non-ascii.sh

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,4 +53,32 @@ test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
5353
test_cmp expected actual
5454
'
5555

56+
test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' '
57+
printf "\\200\\n" >invalid-0x80 &&
58+
echo "ævar" >expected &&
59+
cat expected >>invalid-0x80 &&
60+
git add invalid-0x80
61+
'
62+
63+
test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UTF-8 data' '
64+
git grep -h "var" invalid-0x80 >actual &&
65+
test_cmp expected actual &&
66+
git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
67+
test_cmp expected actual
68+
'
69+
70+
test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' '
71+
test_might_fail git grep -h "æ" invalid-0x80 >actual &&
72+
test_cmp expected actual &&
73+
test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 &&
74+
test_cmp expected actual
75+
'
76+
77+
test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' '
78+
test_might_fail git grep -hi "Æ" invalid-0x80 >actual &&
79+
test_cmp expected actual &&
80+
test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 &&
81+
test_cmp expected actual
82+
'
83+
5684
test_done

0 commit comments

Comments
 (0)