Skip to content

bpo-46420: Use NOTRACE_DISPATCH() in specialized opcodes #30652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 25, 2022

Conversation

sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Jan 17, 2022

Some benchmarks from GCC with --enable-optimizations --with-lto on WSL:

Slower (6):

  • pickle_dict: 27.2 us +- 0.4 us -> 28.9 us +- 0.4 us: 1.06x slower
  • unpickle_list: 5.01 us +- 0.11 us -> 5.12 us +- 0.10 us: 1.02x slower
  • json_dumps: 12.8 ms +- 0.2 ms -> 13.1 ms +- 0.3 ms: 1.02x slower
  • regex_dna: 208 ms +- 4 ms -> 212 ms +- 2 ms: 1.02x slower
  • django_template: 41.2 ms +- 2.6 ms -> 42.0 ms +- 0.9 ms: 1.02x slower
  • pickle_list: 4.57 us +- 0.13 us -> 4.65 us +- 0.16 us: 1.02x slower

Faster (34):

  • pickle: 12.5 us +- 1.2 us -> 11.3 us +- 1.3 us: 1.10x faster
  • nbody: 100 ms +- 2 ms -> 92.7 ms +- 2.5 ms: 1.08x faster
  • deltablue: 4.55 ms +- 0.06 ms -> 4.29 ms +- 0.16 ms: 1.06x faster
  • spectral_norm: 98.5 ms +- 2.4 ms -> 93.8 ms +- 1.1 ms: 1.05x faster
  • json_loads: 27.2 us +- 1.1 us -> 26.0 us +- 1.8 us: 1.05x faster
  • chaos: 80.5 ms +- 0.9 ms -> 77.1 ms +- 1.8 ms: 1.04x faster
  • hexiom: 7.01 ms +- 0.07 ms -> 6.72 ms +- 0.26 ms: 1.04x faster
  • nqueens: 87.3 ms +- 1.9 ms -> 83.7 ms +- 2.8 ms: 1.04x faster
  • crypto_pyaes: 91.0 ms +- 2.2 ms -> 88.0 ms +- 1.3 ms: 1.03x faster
  • scimark_lu: 112 ms +- 2 ms -> 108 ms +- 4 ms: 1.03x faster
  • pidigits: 187 ms +- 1 ms -> 182 ms +- 1 ms: 1.03x faster
  • dulwich_log: 83.8 ms +- 1.4 ms -> 81.5 ms +- 1.1 ms: 1.03x faster
  • logging_format: 8.19 us +- 0.10 us -> 7.98 us +- 0.14 us: 1.03x faster
  • logging_simple: 7.24 us +- 0.17 us -> 7.07 us +- 0.13 us: 1.02x faster
  • go: 152 ms +- 3 ms -> 149 ms +- 2 ms: 1.02x faster
  • regex_compile: 150 ms +- 3 ms -> 146 ms +- 2 ms: 1.02x faster
  • scimark_sor: 123 ms +- 3 ms -> 120 ms +- 2 ms: 1.02x faster
  • fannkuch: 406 ms +- 9 ms -> 398 ms +- 6 ms: 1.02x faster
  • float: 81.2 ms +- 1.3 ms -> 79.7 ms +- 1.4 ms: 1.02x faster
  • raytrace: 332 ms +- 8 ms -> 326 ms +- 7 ms: 1.02x faster
  • tornado_http: 191 ms +- 8 ms -> 188 ms +- 7 ms: 1.02x faster
  • mako: 11.7 ms +- 0.2 ms -> 11.6 ms +- 0.1 ms: 1.02x faster
  • scimark_monte_carlo: 72.5 ms +- 1.8 ms -> 71.4 ms +- 0.9 ms: 1.01x faster
  • unpickle_pure_python: 263 us +- 6 us -> 259 us +- 6 us: 1.01x faster
  • sympy_expand: 570 ms +- 14 ms -> 562 ms +- 17 ms: 1.01x faster
  • pickle_pure_python: 351 us +- 7 us -> 346 us +- 8 us: 1.01x faster
  • sympy_str: 342 ms +- 8 ms -> 338 ms +- 12 ms: 1.01x faster
  • regex_v8: 23.7 ms +- 0.6 ms -> 23.5 ms +- 0.3 ms: 1.01x faster
  • pathlib: 20.9 ms +- 0.6 ms -> 20.6 ms +- 0.5 ms: 1.01x faster
  • regex_effbot: 3.11 ms +- 0.08 ms -> 3.08 ms +- 0.03 ms: 1.01x faster
  • xml_etree_generate: 79.9 ms +- 1.0 ms -> 79.1 ms +- 1.0 ms: 1.01x faster
  • chameleon: 7.19 ms +- 0.13 ms -> 7.13 ms +- 0.10 ms: 1.01x faster
  • 2to3: 272 ms +- 2 ms -> 269 ms +- 2 ms: 1.01x faster
  • xml_etree_process: 57.1 ms +- 0.8 ms -> 56.7 ms +- 0.7 ms: 1.01x faster

Benchmark hidden because not significant (15): logging_silent, meteor_contest, pyflate, python_startup, python_startup_no_site, richards, scimark_fft, scimark_sparse_mat_mult, sympy_integrate, sympy_sum, telco, unpack_sequence, unpickle, xml_etree_parse, xml_etree_iterparse

Geometric mean: 1.01x faster

https://bugs.python.org/issue46420

@markshannon
Copy link
Member

It is good to see that removing many tracing checks can produce a measurable speedup.
Unfortunately, I don't think this is correct. It is possible for tracing to be turned on during several of these bytecodes.

Most obviously, CALL_NO_KW_BUILTIN_CLASS_1 and CALL_NO_KW_BUILTIN_FAST could end up calling sys.settrace.
More subtly, any allocation can trigger the cycle collector, which can execute arbitrary code.

The approach I want to take, although it needs more work, is to only check for tracing in the RESUME instruction by using instrumentation instead.

See faster-cpython/ideas#112 and PEP 669.
Hopefully, using instrumentation should provide at least the speedup you have observed, and reduce the performance loss when tracing is used.

@sweeneyde
Copy link
Member Author

Since STORE_FAST__STORE_FAST calls SETLOCAL, which can call Py_DECREF, which can execute arbitrary __del__ methods, does that mean that existing usage of NOTRACE_DISPATCH is mildly incorrect as well?

@markshannon
Copy link
Member

Perhaps.
I think it is OK, because the __del__ is not specified to occur at any specific time, or in any particular order.
So, the behavior is the same as if we checked for tracing on every dispatch.

Most of your changes are also probably safe. Probably. It is hard to be sure.
Although I do think the changes to CALL specializations are incorrect.

The instrumentation approach is much more robust. At least, I think it is.

…UILTIN_CLASS_1,

CALL_NO_KW_BUILTIN_O, CALL_NO_KW_BUILTIN_FAST, CALL_NO_KW_LEN, CALL_NO_KW_ISINSTANCE.

Use NOTRACE_DISPATCH on CALL_NO_KW_LIST_APPEND.
@markshannon
Copy link
Member

Looks good. All the new uses of NOTRACE_DISPATCH() are consistent with existing usage.

@markshannon markshannon merged commit 96bf84d into python:main Jan 25, 2022
@sweeneyde sweeneyde deleted the notrace branch January 25, 2022 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage skip news
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants