Skip to content

bpo-45508: Specialize INPLACE_ADD #29024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Conversation

sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Oct 18, 2021

@sweeneyde sweeneyde added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Oct 18, 2021
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @sweeneyde for commit 62c1d38 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Oct 18, 2021
@sweeneyde
Copy link
Member Author

The s390x RHEL7 LTO PR failure is https://bugs.python.org/issue45484

@markshannon
Copy link
Member

You can remove unicode_concatenate

@markshannon markshannon self-assigned this Oct 18, 2021
@markshannon
Copy link
Member

Any performance numbers or specialization stats for this?

@markshannon
Copy link
Member

Maybe I should read the first comment 🙂

@markshannon
Copy link
Member

I'd like to hold off merging this until we have a better way to handle classes like decimal.
I'll look into a fix for BINARY_ADD that we then apply here.

@sweeneyde
Copy link
Member Author

I merged with main and ran some microbenchmarks, and it seems decimal += decimal is not that bad.

Microbenchmark program:

from pyperf import Runner

runner = Runner()

runner.timeit("int+=int",
    setup="from itertools import repeat",
    stmt="x = 0\n"
         "for y in repeat(1, 10_000):\n"
         "    x += y; x += y; x += y; x += y; x += y"
    )
runner.timeit("float+=float",
    setup="from itertools import repeat",
    stmt="x = 0.0\n"
         "for y in repeat(1.0, 10_000):\n"
         "    x += y; x += y; x += y; x += y; x += y"
    )
runner.timeit("str+=str",
    setup="from itertools import repeat",
    stmt="for y in repeat('a', 10_000):\n"
         "    x = ''; x += y; x += y; x += y; x += y; x += y"
    )
runner.timeit("list[0]+=str",
    setup="from itertools import repeat",
    stmt="x = [None]\n"
         "for y in repeat('a', 10_000):\n"
         "    x[0] = ''; x[0] += y; x[0] += y; x[0] += y; x[0] += y; x[0] += y"
    )
runner.timeit("float+=int",
    setup="from itertools import repeat",
    stmt="x = 0.0\n"
         "for y in repeat(1, 10_000):\n"
         "    x += y; x += y; x += y; x += y; x += y"
    )
runner.timeit("decimal+=decimal",
    setup="from itertools import repeat; from decimal import Decimal as D",
    stmt="x = D(0)\n"
         "for y in repeat(D(1), 10_000):\n"
         "    x += y; x += y; x += y; x += y; x += y"
    )
runner.timeit("list[0]+=1",
    setup="from itertools import repeat; from collections import defaultdict",
    stmt="dd = [0]\n"
         "for y in repeat(1, 10_000):\n"
         "    dd[0] += y; dd[0] += y; dd[0] += y; dd[0] += y; dd[0] += y",
    )
runner.timeit("defaultdict(int)[0]+=1",
    setup="from itertools import repeat; from collections import defaultdict",
    stmt="dd = defaultdict(int)\n"
         "for y in repeat(1, 10_000):\n"
         "    dd[0] += y; dd[0] += y; dd[0] += y; dd[0] += y; dd[0] += y",
    )

Results from PGO on MSVC:

Benchmark main_inplace_add_micro specialized_inplace_add_micro
float+=float 1.18 ms 988 us: 1.19x faster
int+=int 1.34 ms 1.16 ms: 1.15x faster
str+=str 2.08 ms 1.80 ms: 1.15x faster
list[0]+=1 2.47 ms 2.39 ms: 1.03x faster
list[0]+=str 3.62 ms 3.67 ms: 1.01x slower
defaultdict(int)[0]+=1 3.29 ms 3.43 ms: 1.04x slower
float+=int 1.64 ms 1.74 ms: 1.06x slower
Geometric mean (ref) 1.05x faster

Benchmark hidden because not significant (1): decimal+=decimal

Results from PGO on GCC (WSL):

Benchmark main_inplace_add_micro2 specialized_inplace_add_micro2
float+=float 885 us 672 us: 1.32x faster
int+=int 1.14 ms 987 us: 1.16x faster
defaultdict(int)[0]+=1 2.61 ms 2.49 ms: 1.05x faster
list[0]+=1 2.03 ms 1.96 ms: 1.03x faster
float+=int 1.36 ms 1.32 ms: 1.03x faster
str+=str 1.39 ms 1.45 ms: 1.04x slower
list[0]+=str 2.75 ms 2.89 ms: 1.05x slower
Geometric mean (ref) 1.06x faster

Benchmark hidden because not significant (1): decimal+=decimal

@markshannon
Copy link
Member

Once #29482 is merged this will need to reworked for the Nth (and hopefully last) time. Sorry about that.
On the bright side, it should save some boilerplate.

@sweeneyde
Copy link
Member Author

I think PR 29482 will just make this obsolete -- it manages to re-use *_ADD opcodes for INPLACE_ADD opcodes (nice!):

(In specialize.c)
        case NB_ADD:
        case NB_INPLACE_ADD:
            if (PyUnicode_CheckExact(lhs)) {
                if (_Py_OPCODE(instr[1]) == STORE_FAST && Py_REFCNT(lhs) == 2) {
                    *instr = _Py_MAKECODEUNIT(BINARY_OP_INPLACE_ADD_UNICODE,
                                              _Py_OPARG(*instr));
                    goto success;
                }
                *instr = _Py_MAKECODEUNIT(BINARY_OP_ADD_UNICODE,
                                          _Py_OPARG(*instr));
                goto success;
            }

@sweeneyde sweeneyde closed this Nov 11, 2021
@sweeneyde sweeneyde deleted the inplace_add branch December 19, 2021 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants