Skip to content

bpo-37723: fix performance regression on regular expression parsing #15030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 3 additions & 5 deletions Lib/sre_parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,11 +432,9 @@ def _escape(source, escape, state):
def _uniq(items):
if len(set(items)) == len(items):
return items
newitems = []
for item in items:
if item not in newitems:
newitems.append(item)
return newitems
seen_items = set()
seen_items_add = seen_items.add
return [item for item in items if not (item in seen_items or seen_items_add(item))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

def _uniq(items):
    return list(dict.fromkeys(items))

Copy link

@alvations alvations Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since dict.fromkeys() already jumbled up the item unless there's a need to return a list, returning a set might be more optimal when checking for unique, i.e.

def _uniq(items):
    return set(dict.fromkeys(items))

But it seems like the original _uniq() function returns a list. Is there a reason to do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set(dict.fromkeys(X)) is totally useless. When you need set, set(X) is enoguh.
But in this case, we need list, not set. So I used list(dict.fromkeys(X)).

Copy link
Contributor Author

@yannvgn yannvgn Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We indeed need a list. And I guess that preserving the order is important, looking at the initial implementation.

I'm fine with list(dict.fromkeys(items)) (and it's much prettier), but I have 2 minor concerns:

  • list(dict.fromkeys(items)) might be a bit slower (though I didn't check, and the difference is negligible I think)
  • we have to make sure this won't be backported to python < 3.7 as the dict order is not necessarily the insert order for python < 3.7 (but again, the bug only affects >=3.7 and the function does not exist on <3.7, so it won't be backported anyway)

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • list(dict.fromkeys(items)) might be a bit slower (though I didn't check, and the difference is negligeible I think)

I believe list(dict.fromkeys(items)) is significantly faster than [it for it in items if it not in seen or seen.add(it)]

  • we have to make sure this won't be backported to python < 3.7 as the dict order is not necessarily the insert order for python < 3.7 (but again, the bug only affects >=3.7 and the function does not exist on <3.7, so it won't be backported anyway)

You said "sre_parse._uniq function, which does not exist in Python <= 3.6."
So no need to think about 3.6.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@methane you were right, list(dict.fromkeys(items)) seems to be a bit faster. Changed ✅.


def _parse_sub(source, state, verbose, nested):
# parse an alternation: a|b|c
Expand Down