Skip to content

bpo-35859: re module, fix three bugs about capturing groups #11756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

bpo-35859: re module, fix three bugs about capturing groups #11756

wants to merge 5 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Feb 4, 2019

Fix wrong capturing groups in rare cases, the span of capturing group may be lost when backtracking.
These bugs exist since Python 2.

  • macro MARK_PUSH(lastmark) didn't protect MARK 0 if it was the only available mark
  • jump JUMP_MIN_UNTIL_3 needs LASTMARK_SAVE() and MARK_PUSH()
  • jump JUMP_ASSERT_NOT needs LASTMARK_SAVE() and MARK_PUSH()

Please read review guide in issue35859:

https://bugs.python.org/issue35859

@ghost
Copy link
Author

ghost commented Feb 8, 2019

This fix is not correct, I'll update this PR when I can use my computer.

@ghost ghost changed the title bpo-35859: in re module, save marks before JUMP_MIN_UNTIL_3 jump bpo-35859: re module, fix wrong capturing groups in rare cases Feb 9, 2019
@serhiy-storchaka serhiy-storchaka self-requested a review February 18, 2019 14:23
@serhiy-storchaka serhiy-storchaka added type-bug An unexpected behavior, bug, or error needs backport to 2.7 labels Feb 18, 2019
@ghost
Copy link
Author

ghost commented Feb 18, 2019

@serhiy-storchaka
I'm afraid this PR can't be merged to 2.7 branch automatically. I will create a PR for 2.7 branch tomorrow, along with the patch in #11546.

def test_bug_35859(self):
# Capture behavior depends on the order of an alternation
s = 'ab'
self.assertEqual(re.search(r'(ab|a)*?b', s).groups(), ('a',))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why search() is used instead of match() or fullmatch()?

ab|a is equivalent to ab?. Is there a reason why use the former? If there is a difference, it is better to use .b|a instead, because ab|a can be transformed to ab? by the RE compiler in future versions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, this PR doesn't fix the problem.

>>> re.match(r'(ab?)*?b', 'ab').groups()
('',)

The correct output should be:

>>> regex.match(r'(ab?)*?b', 'ab').groups()
('a',)

I will recheck the patch tomorrow.

Copy link
Contributor

@davisjam davisjam Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serhiy-storchaka

Why search() is used instead of match() or fullmatch()?
ab|a is equivalent to ab?. Is there a reason why use the former?

"Because it can be". (ab|a)*?b is a reduced test case for a real regex I found during my research.

s = 'ab'
self.assertEqual(re.search(r'(ab|a)*?b', s).groups(), ('a',))
self.assertEqual(re.search(r'(ab|a)+?b', s).groups(), ('a',))
self.assertEqual(re.search(r'(ab|a){0,}?b', s).groups(), ('a',))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X{0,}? is equivalent to X*?, so this test is redundant.

@ghost ghost changed the title bpo-35859: re module, fix wrong capturing groups in rare cases [WIP] bpo-35859: re module, fix wrong capturing groups in rare cases Feb 22, 2019
wjssz added 4 commits March 1, 2019 18:14
Show the wrong behaviors before this fix.
MARK_PUSH(lastmark) macro didn't protect MARK-0 if it was the only available mark.
@ghost ghost changed the title [WIP] bpo-35859: re module, fix wrong capturing groups in rare cases bpo-35859: re module, fix three bugs about capturing groups Mar 1, 2019
before this fix, the test-case returns ('c', 'c', 'c')

No need to save state->repeat in ctx->u.rep.
SRE_OP_BRANCH saves state->repeat in ctx->u.rep, this is because after JUMP_BRANCH, the state->repeat may be NULL.
While SRE_OP_ASSERT_NOT doesn't have this problem.
@bedevere-bot
Copy link

GH-12134 is a backport of this pull request to the 2.7 branch.

@ghost
Copy link
Author

ghost commented Mar 19, 2019

See PR 12427

@ghost ghost closed this Mar 19, 2019
@ghost ghost deleted the issue35859 branch April 4, 2022 06:52
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants