Skip to content

bpo-45636: Simpler design for BINARY_OP/INPLACE_OP #29418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Nov 5, 2021

No perf impact:

Benchmark main binary-op-alt
nbody 118 ms 113 ms: 1.05x faster
logging_silent 118 ns 114 ns: 1.04x faster
xml_etree_iterparse 109 ms 106 ms: 1.03x faster
go 172 ms 167 ms: 1.02x faster
richards 57.8 ms 56.8 ms: 1.02x faster
dulwich_log 67.7 ms 66.7 ms: 1.01x faster
telco 6.33 ms 6.24 ms: 1.01x faster
pickle_pure_python 369 us 364 us: 1.01x faster
scimark_sparse_mat_mult 4.74 ms 4.69 ms: 1.01x faster
hexiom 7.53 ms 7.44 ms: 1.01x faster
meteor_contest 107 ms 106 ms: 1.01x faster
xml_etree_parse 155 ms 154 ms: 1.01x faster
pickle_list 4.58 us 4.56 us: 1.01x faster
regex_compile 145 ms 144 ms: 1.00x faster
nqueens 88.5 ms 88.1 ms: 1.00x faster
sympy_sum 170 ms 169 ms: 1.00x faster
python_startup 8.53 ms 8.52 ms: 1.00x faster
python_startup_no_site 5.90 ms 5.91 ms: 1.00x slower
pickle 9.90 us 9.94 us: 1.00x slower
pidigits 194 ms 195 ms: 1.00x slower
fannkuch 417 ms 419 ms: 1.01x slower
json_dumps 12.5 ms 12.5 ms: 1.01x slower
chaos 76.5 ms 76.9 ms: 1.01x slower
unpickle 13.4 us 13.5 us: 1.01x slower
pyflate 534 ms 538 ms: 1.01x slower
mako 11.9 ms 12.0 ms: 1.01x slower
json_loads 25.4 us 25.7 us: 1.01x slower
logging_format 6.73 us 6.79 us: 1.01x slower
logging_simple 6.06 us 6.13 us: 1.01x slower
chameleon 7.53 ms 7.62 ms: 1.01x slower
scimark_fft 347 ms 352 ms: 1.01x slower
regex_dna 216 ms 219 ms: 1.01x slower
pathlib 18.7 ms 19.0 ms: 1.02x slower
unpickle_list 4.95 us 5.05 us: 1.02x slower
spectral_norm 107 ms 110 ms: 1.03x slower
scimark_monte_carlo 78.0 ms 80.4 ms: 1.03x slower
regex_v8 23.2 ms 24.2 ms: 1.04x slower
pickle_dict 27.5 us 28.7 us: 1.04x slower
regex_effbot 3.18 ms 3.34 ms: 1.05x slower
unpack_sequence 44.0 ns 48.8 ns: 1.11x slower
Geometric mean (ref) 1.00x slower

https://bugs.python.org/issue45636

@brandtbucher
Copy link
Member Author

Currently re-running benchmarks, for my own sanity…

@brandtbucher
Copy link
Member Author

I'm glad I did. It seems that that single change contributes to the slowdown we saw on the other PR:

Slower (37):
- crypto_pyaes: 87.5 ms +- 0.6 ms -> 91.3 ms +- 1.1 ms: 1.04x slower
- pickle_dict: 26.8 us +- 0.1 us -> 27.9 us +- 4.1 us: 1.04x slower
- go: 165 ms +- 1 ms -> 171 ms +- 2 ms: 1.04x slower
- richards: 56.9 ms +- 0.9 ms -> 58.8 ms +- 0.7 ms: 1.03x slower
- chameleon: 7.48 ms +- 0.10 ms -> 7.71 ms +- 0.07 ms: 1.03x slower
- regex_compile: 144 ms +- 2 ms -> 148 ms +- 1 ms: 1.03x slower
- spectral_norm: 107 ms +- 1 ms -> 110 ms +- 2 ms: 1.02x slower
- tornado_http: 109 ms +- 3 ms -> 111 ms +- 3 ms: 1.02x slower
- scimark_monte_carlo: 77.5 ms +- 0.8 ms -> 79.1 ms +- 1.4 ms: 1.02x slower
- chaos: 76.7 ms +- 0.7 ms -> 78.3 ms +- 2.5 ms: 1.02x slower
- deltablue: 4.85 ms +- 0.06 ms -> 4.95 ms +- 0.07 ms: 1.02x slower
- nbody: 115 ms +- 2 ms -> 118 ms +- 3 ms: 1.02x slower
- sympy_expand: 501 ms +- 5 ms -> 510 ms +- 5 ms: 1.02x slower
- float: 80.5 ms +- 0.9 ms -> 81.8 ms +- 0.9 ms: 1.02x slower
- json_loads: 25.4 us +- 0.2 us -> 25.8 us +- 0.2 us: 1.02x slower
- pickle_pure_python: 367 us +- 3 us -> 373 us +- 4 us: 1.02x slower
- regex_dna: 212 ms +- 1 ms -> 215 ms +- 2 ms: 1.01x slower
- pyflate: 539 ms +- 9 ms -> 545 ms +- 4 ms: 1.01x slower
- unpickle_pure_python: 272 us +- 3 us -> 275 us +- 3 us: 1.01x slower
- sqlite_synth: 2.74 us +- 0.04 us -> 2.77 us +- 0.07 us: 1.01x slower
- sympy_integrate: 22.0 ms +- 0.2 ms -> 22.2 ms +- 0.1 ms: 1.01x slower
- django_template: 37.2 ms +- 0.5 ms -> 37.6 ms +- 0.4 ms: 1.01x slower
- scimark_sparse_mat_mult: 4.84 ms +- 0.08 ms -> 4.89 ms +- 0.14 ms: 1.01x slower
- xml_etree_iterparse: 107 ms +- 3 ms -> 108 ms +- 3 ms: 1.01x slower
- pickle_list: 4.48 us +- 0.04 us -> 4.52 us +- 0.03 us: 1.01x slower
- sympy_str: 303 ms +- 4 ms -> 306 ms +- 3 ms: 1.01x slower
- meteor_contest: 106 ms +- 2 ms -> 106 ms +- 2 ms: 1.01x slower
- unpickle_list: 5.05 us +- 0.04 us -> 5.08 us +- 0.07 us: 1.01x slower
- xml_etree_process: 59.1 ms +- 0.6 ms -> 59.5 ms +- 0.4 ms: 1.01x slower
- xml_etree_generate: 80.4 ms +- 0.7 ms -> 80.9 ms +- 1.0 ms: 1.01x slower
- 2to3: 271 ms +- 1 ms -> 273 ms +- 1 ms: 1.01x slower
- hexiom: 7.53 ms +- 0.12 ms -> 7.57 ms +- 0.07 ms: 1.01x slower
- sympy_sum: 170 ms +- 2 ms -> 171 ms +- 2 ms: 1.01x slower
- dulwich_log: 67.6 ms +- 0.3 ms -> 67.9 ms +- 0.4 ms: 1.00x slower
- regex_v8: 23.2 ms +- 0.1 ms -> 23.3 ms +- 0.2 ms: 1.00x slower
- python_startup: 8.51 ms +- 0.01 ms -> 8.54 ms +- 0.01 ms: 1.00x slower
- python_startup_no_site: 5.90 ms +- 0.00 ms -> 5.91 ms +- 0.00 ms: 1.00x slower

Faster (6):
- scimark_fft: 355 ms +- 10 ms -> 346 ms +- 2 ms: 1.03x faster
- fannkuch: 427 ms +- 4 ms -> 417 ms +- 2 ms: 1.02x faster
- pidigits: 190 ms +- 0 ms -> 187 ms +- 0 ms: 1.01x faster
- json_dumps: 12.6 ms +- 0.1 ms -> 12.5 ms +- 0.1 ms: 1.01x faster
- pathlib: 19.0 ms +- 0.3 ms -> 18.8 ms +- 0.2 ms: 1.01x faster
- mako: 11.9 ms +- 0.1 ms -> 11.9 ms +- 0.1 ms: 1.00x faster

Benchmark hidden because not significant (15): logging_format, logging_silent, logging_simple, nqueens, pickle, raytrace, regex_effbot, scimark_lu, scimark_sor, sqlalchemy_declarative, sqlalchemy_imperative, telco, unpack_sequence, unpickle, xml_etree_parse

Geometric mean: 1.01x slower

It's not clear to me why. Perhaps int is that much faster than uint16_t?

@python python deleted a comment from HannahGraham23 Nov 6, 2021
@python python deleted a comment from HannahGraham23 Nov 6, 2021
@arhadthedev
Copy link
Member

@brandtbucher Maybe the change created a branching point with random targets, so CPU branch predictor just gives up with prefetching. A profiler will help to find exact lines with the longest execution time.

@markshannon markshannon self-assigned this Nov 8, 2021
Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the awkward size of nb_name_info may be responsible for most of the slowdown.

#undef NB_NAME_INFO

PyObject *
_PyNumber_Op(PyObject *o1, PyObject *o2, unsigned op)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just forwards to binary_op. You might want to consider:

  1. Make this the function that does the work (effectively inlining binary_op here).
  2. Make the old API PyNumber_TrueDivide, etc. call _PyNumber_Op
  3. Delete binary_op.

}

PyObject *
_PyNumber_InPlaceOp(PyObject *o1, PyObject *o2, unsigned op)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for _PyNumber_Op.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@brandtbucher
Copy link
Member Author

See my new (preferred) approach in #29482.

@brandtbucher
Copy link
Member Author

You were right @markshannon: adjusting the alignment fixes the slowdown. Nice catch.

Slower (25):
- pickle_dict: 26.8 us +- 0.1 us -> 28.0 us +- 0.1 us: 1.04x slower
- go: 165 ms +- 1 ms -> 172 ms +- 2 ms: 1.04x slower
- regex_effbot: 3.15 ms +- 0.03 ms -> 3.25 ms +- 0.04 ms: 1.03x slower
- regex_dna: 212 ms +- 1 ms -> 219 ms +- 1 ms: 1.03x slower
- scimark_lu: 138 ms +- 3 ms -> 143 ms +- 5 ms: 1.03x slower
- float: 80.5 ms +- 0.9 ms -> 82.8 ms +- 1.1 ms: 1.03x slower
- logging_format: 6.73 us +- 0.09 us -> 6.89 us +- 0.08 us: 1.02x slower
- chameleon: 7.48 ms +- 0.10 ms -> 7.64 ms +- 0.09 ms: 1.02x slower
- regex_v8: 23.2 ms +- 0.1 ms -> 23.7 ms +- 0.4 ms: 1.02x slower
- tornado_http: 109 ms +- 3 ms -> 111 ms +- 3 ms: 1.02x slower
- scimark_monte_carlo: 77.5 ms +- 0.8 ms -> 79.0 ms +- 1.3 ms: 1.02x slower
- nqueens: 90.1 ms +- 0.7 ms -> 91.3 ms +- 1.0 ms: 1.01x slower
- django_template: 37.2 ms +- 0.5 ms -> 37.7 ms +- 0.6 ms: 1.01x slower
- sympy_expand: 501 ms +- 5 ms -> 507 ms +- 3 ms: 1.01x slower
- sqlalchemy_imperative: 18.6 ms +- 0.5 ms -> 18.8 ms +- 0.6 ms: 1.01x slower
- richards: 56.9 ms +- 0.9 ms -> 57.6 ms +- 1.4 ms: 1.01x slower
- regex_compile: 144 ms +- 2 ms -> 146 ms +- 2 ms: 1.01x slower
- unpack_sequence: 43.7 ns +- 0.9 ns -> 44.2 ns +- 1.1 ns: 1.01x slower
- xml_etree_generate: 80.4 ms +- 0.7 ms -> 81.2 ms +- 0.8 ms: 1.01x slower
- json_loads: 25.4 us +- 0.2 us -> 25.6 us +- 0.2 us: 1.01x slower
- mako: 11.9 ms +- 0.1 ms -> 12.0 ms +- 0.0 ms: 1.01x slower
- pickle_list: 4.48 us +- 0.04 us -> 4.52 us +- 0.04 us: 1.01x slower
- scimark_sparse_mat_mult: 4.84 ms +- 0.08 ms -> 4.88 ms +- 0.05 ms: 1.01x slower
- logging_simple: 6.15 us +- 0.08 us -> 6.19 us +- 0.07 us: 1.01x slower
- unpickle_list: 5.05 us +- 0.04 us -> 5.07 us +- 0.05 us: 1.00x slower

Faster (15):
- scimark_fft: 355 ms +- 10 ms -> 345 ms +- 2 ms: 1.03x faster
- fannkuch: 427 ms +- 4 ms -> 419 ms +- 4 ms: 1.02x faster
- xml_etree_iterparse: 107 ms +- 3 ms -> 105 ms +- 2 ms: 1.02x faster
- telco: 6.38 ms +- 0.25 ms -> 6.28 ms +- 0.11 ms: 1.02x faster
- pickle: 9.97 us +- 0.16 us -> 9.84 us +- 0.11 us: 1.01x faster
- pidigits: 190 ms +- 0 ms -> 188 ms +- 0 ms: 1.01x faster
- xml_etree_parse: 155 ms +- 3 ms -> 153 ms +- 1 ms: 1.01x faster
- crypto_pyaes: 87.5 ms +- 0.6 ms -> 86.6 ms +- 1.1 ms: 1.01x faster
- hexiom: 7.53 ms +- 0.12 ms -> 7.47 ms +- 0.06 ms: 1.01x faster
- raytrace: 331 ms +- 2 ms -> 329 ms +- 1 ms: 1.01x faster
- json_dumps: 12.6 ms +- 0.1 ms -> 12.5 ms +- 0.1 ms: 1.01x faster
- spectral_norm: 107 ms +- 1 ms -> 107 ms +- 1 ms: 1.01x faster
- python_startup: 8.51 ms +- 0.01 ms -> 8.49 ms +- 0.01 ms: 1.00x faster
- dulwich_log: 67.6 ms +- 0.3 ms -> 67.4 ms +- 0.6 ms: 1.00x faster
- python_startup_no_site: 5.90 ms +- 0.00 ms -> 5.89 ms +- 0.00 ms: 1.00x faster

Benchmark hidden because not significant (18): 2to3, chaos, deltablue, logging_silent, meteor_contest, nbody, pathlib, pickle_pure_python, pyflate, scimark_sor, sqlalchemy_declarative, sqlite_synth, sympy_integrate, sympy_sum, sympy_str, unpickle, unpickle_pure_python, xml_etree_process

Geometric mean: 1.00x slower

@brandtbucher
Copy link
Member Author

Closing in favor of #29482.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants