Skip to content

Support Python 3 in the benchmark suite #30085

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 3, 2020

Conversation

broadwaylamb
Copy link
Contributor

@broadwaylamb broadwaylamb commented Feb 26, 2020

I've tested this patch with both Python 2 and 3, and it seems to work as expected.

cc @gottesmm @palimondo

@compnerd
Copy link
Member

@swift-ci please benchmark

@palimondo
Copy link
Contributor

Could we please hold on this for a moment? It builds on #29719 which I’d like to back out in order to land #26462 first.

@swift-ci
Copy link
Contributor

Performance: -O

Regression OLD NEW DELTA RATIO
FlattenListLoop 2717 3257 +19.9% 0.83x (?)
FlattenListFlatMap 5259 5949 +13.1% 0.88x (?)
 
Improvement OLD NEW DELTA RATIO
ArrayPlusEqualSingleElementCollection 470 423 -10.0% 1.11x (?)

Code size: -O

Performance: -Osize

Code size: -Osize

Performance: -Onone

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

@broadwaylamb
Copy link
Contributor Author

Could we please hold on this for a moment? It builds on #29719 which I’d like to back out in order to land #26462 first.

Sure! I’ll try to do the same for other Python scripts for the time being.

@gottesmm
Copy link
Contributor

@palimondo I don't think it is reasonable to block support for python 2 -> python 3 b/c of an out of tree patch that will not be promptly landed. The last substantive comment on that other PR is from Nov and the patch stream is from Sept. It is almost March, that is too long to stop progress here IMO, especially since python 2 is EOL and swift needs to finish transitioning ASAP.

@compnerd
Copy link
Member

@palimondo, the measurement improvements are extremely valuable and look like a move in the right direction. I think that the python2 -> python3 conversion is extremely important as well though. python2 has been EOL'ed and there are other places where the conversion is absolutely critical (e.g. lldb requires python3 support, but parts of the build are still python2 dependent). I realize that you have been working on the patch set for quite some time, and it does seem like it is nearing where you want it to be. Perhaps you could split the work up such that some of the earlier pieces can be rebased and merged into the tree standalone to allow both sides to make progress? I believe that we need to continue to make progress here though in order to keep both the build and the debugging scenarios healthy.

@compnerd
Copy link
Member

@swift-ci please smoke test

@compnerd
Copy link
Member

@swift-ci please test Windows platform

@palimondo
Copy link
Contributor

palimondo commented Feb 27, 2020

@gottesmm @compnerd How about you give me this weekend to finish the patch?

@gottesmm
Copy link
Contributor

@palimondo How big of a change are we talking about? We depend on this infrastructure, so we can't just throw things in. We need to be careful.

@gottesmm
Copy link
Contributor

@eeckstein I noticed that you are the original reviewer of this code. What are your thoughts?

@palimondo
Copy link
Contributor

@gottesmm did you mean to ask @eeckstein the above in #26462?

@broadwaylamb
Copy link
Contributor Author

The tests have failed because the linter complained. I'll update the patch.

@broadwaylamb
Copy link
Contributor Author

It should be fixed now, could anyone please trigger the CI again?

@compnerd
Copy link
Member

@swift-ci please smoke test

Comment on lines -490 to +507
runtime = min(
[
(result.samples.min - correction)
for i_series in [
BenchmarkDoctor._select(measurements, num_iters=i)
for correction in [(setup / i) for i in [1, 2]]
]
for result in i_series
]
)

runtimes = []
for i in range(1, 3):
correction = setup / i
i_series = BenchmarkDoctor._select(measurements, num_iters=i)
for result in i_series:
runtimes.append(result.samples.min - correction)
runtime = min(runtimes)
Copy link
Contributor Author

@broadwaylamb broadwaylamb Feb 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like whoever will review this to pay attention at this particular change.

The old version relied on Python 2-only feature where the variables defined in a list comprehension leak to the function scope. But I believe this was erroneous: since in Python 2 list comprehensions are eagerly evaluated, the value of i was always 2 when evaluating the BenchmarkDoctor._select(measurements, num_iters=i) expression. Similarly, when evaluating (result.samples.min - correction), the value of correction was always the same, namely, the last value it took while evaluating this expression:

[
    BenchmarkDoctor._select(measurements, num_iters=i)
    for correction in [(setup / i) for i in [1, 2]]
]

Here, I've rewritten it to be a stupid old for loop, which IMHO reads way better, and also seems to behave correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm… I’ll need to check this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve corrected the Python3 incompatibility by explicitly propagating the variables from the nested scopes. I don’t see any issue of incorrect evaluation. Here’s standalone simplified test case base on the original source code before the automatic formatter made mess of it in #29719:

$ python
Python 2.7.10 (default, Feb 22 2019, 21:55:15) 
[GCC 4.2.1 Compatible Apple LLVM 10.0.1 (clang-1001.0.37.14)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> setup = 134
>>> measurements = {1: [[324], [326], [325]], 2: [[257], [259], [258]]}
>>> 
>>> 
>>> def select(measurements, num_iters):
...     return measurements[num_iters]
... 
>>> 
>>> runtimesP3 = [
...     (result[0] - correction) for correction, i_series in
...     [(correction, select(measurements, num_iters=i))
...      for i, correction in [(i, setup // i) for i in [1, 2]]
...      ] for result in i_series]
>>> 
>>> runtimes = [
...     (result[0] - correction) for i_series in
...     [select(measurements, num_iters=i)
...      for correction in [(setup / i) for i in [1, 2]]
...      ] for result in i_series]
>>> 
>>> runtimes == runtimesP3
True
>>> runtimesP3
[190, 192, 191, 190, 192, 191]

It works correctly in Python 3, too:

$ python3
Python 3.6.5 (default, May 24 2018, 10:57:11) 
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> setup = 134
>>> measurements = {1: [[324], [326], [325]], 2: [[257], [259], [258]]}
>>> 
>>> 
>>> def select(measurements, num_iters):
...     return measurements[num_iters]
... 
>>> 
>>> runtimesP3 = [
...     (result[0] - correction) for correction, i_series in
...     [(correction, select(measurements, num_iters=i))
...      for i, correction in [(i, setup // i) for i in [1, 2]]
...      ] for result in i_series]
>>> 
>>> runtimesP3
[190, 192, 191, 190, 192, 191]

I’ll try to incorporate the Python 3 changes in #26462 (it already had fix for the assertEqualtats typo...

Comment on lines -575 to +585
setup = min([result.setup for result in BenchmarkDoctor._select(measurements)])
setup = min(
[result.setup or 0 for result in BenchmarkDoctor._select(measurements)]
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None compares less than anything in Python 2. In Python 3 None is not comparable at all, so I needed to provide a reasonable default value. 0 seems okay, since this function only has effect if setup is greater than 200000.

Comment on lines +140 to +142
return DTraceResult(
test_name, int(not foundInstability), results, self.csv_output
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DTraceResult's initializer takes exactly 4 arguments. I don't know why it worked before.

@compnerd
Copy link
Member

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

Performance: -O

Regression OLD NEW DELTA RATIO
ObjectiveCBridgeFromNSArrayAnyObjectForced 4880 5420 +11.1% 0.90x (?)
 
Improvement OLD NEW DELTA RATIO
FlattenListFlatMap 9367 8518 -9.1% 1.10x (?)

Code size: -O

Performance: -Osize

Improvement OLD NEW DELTA RATIO
ObjectiveCBridgeStubFromNSStringRef 174 158 -9.2% 1.10x (?)
ObjectiveCBridgeFromNSArrayAnyObjectForced 5440 5080 -6.6% 1.07x (?)

Code size: -Osize

Performance: -Onone

Improvement OLD NEW DELTA RATIO
TypeFlood 217 199 -8.3% 1.09x (?)
DictionaryBridgeToObjC_Access 1077 1002 -7.0% 1.07x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

@palimondo
Copy link
Contributor

@swift-ci test

@palimondo
Copy link
Contributor

palimondo commented Mar 1, 2020

I’m porting these changes into my branch, but I got stuck a bit on the differences between open in Python 2 and 3 (the need to use encoding=‘utf-8’ on latter), when running test_compare_perf_test.py. Looks like there’s no elegant solution except explicit version checks… @broadwaylamb did you work around that in some other way?

@broadwaylamb
Copy link
Contributor Author

@palimondo can you give an example, what issue are you experiencing? Why do you need to use encoding='utf-8'?

@palimondo
Copy link
Contributor

palimondo commented Mar 1, 2020

@broadwaylamb The way I’m testing the Python 3 port as well as continued support for Python 2.7 is by adjusting the shebang between: #!/usr/bin/env python and #!/usr/bin/env python3 and running the unit tests. In this case test_compare_perf_tests.py. There is a read/write pair of open calls between compare_perf_test.py’s main function and execute_main_with_format on class Test_compare_perf_tests_main in test_compare_perf_tests.py. Switching to python3 with current code was causing encoding (UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 1064: ordinal not in range(128)) and decoding (UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1064: ordinal not in range(128)) issues.

So far I have this solution and was wondering if that’s the best it can get:

    if args.output:
        if sys.version_info < (3, 0):
            with open(args.output, 'w') as f:
                f.write(report)
        else:
            with open(args.output, 'w', encoding='utf-8') as f:
                f.write(report)

Other solutions I found on StackOverflow weren’t IMO much better.

@eeckstein
Copy link
Contributor

@gottesmm I'm in favor of landing this change and not wait for #26462

@broadwaylamb
Copy link
Contributor Author

@palimondo sorry, I haven't been able to reproduce this. It's all working for me with Python 3 even without encoding='utf-8'.

@broadwaylamb
Copy link
Contributor Author

By the way, this PR doesn't change shebangs to use python3, the CI builders will still be using python.

This makes it easy to accidentally break Python 3 compatibility. What do you people think about switching the CI to use Python 3 for the benchmarking suite?

@eeckstein
Copy link
Contributor

What do you people think about switching the CI to use Python 3 for the benchmarking suite?

@shahmishal Any thoughts on this?

@palimondo
Copy link
Contributor

palimondo commented Mar 2, 2020

@eeckstein Is there a deadline for Python 3 compatibility you are trying to meet?
I think I could split #26462 in two parts:

@shahmishal
Copy link
Member

What do you people think about switching the CI to use Python 3 for the benchmarking suite?

@shahmishal Any thoughts on this?

I don't mind if you switch to using python3, however python3 will be only available with xcrun python3 on macOS nodes.

@eeckstein
Copy link
Contributor

eeckstein commented Mar 3, 2020

Can we do something like that inside the script (pseudocode):

if running on python2 on mac:
  invoke this script with 'xcrun python3'
  exit

With "script", I mean e.g. 'run_smoke_bench'

@palimondo
Copy link
Contributor

Or… we could add rely on @swift-ci’s test command to catch the problems, if we modified test/benchmark/benchmark-scripts.test-sh from

// RUN: %{python} -m unittest discover -s %swift_src_root/benchmark/scripts/

to

// RUN: %{xcrun python3} -m unittest discover -s %swift_src_root/benchmark/scripts/

@broadwaylamb
Copy link
Contributor Author

Anyway, if we do this (switch to Python 3), it'd probably make sense to make this change in a separate PR so it can easily be reverted.

What's the plan here? Are we waiting for #30149 and #30158 to land first?

@compnerd
Copy link
Member

compnerd commented Mar 4, 2020

@swift-ci please test Windows platform

@palimondo
Copy link
Contributor

@compnerd Does the Windows test run the benchmarks? If so, what version is used to run those python scripts? (I assume that python isn’t shipped on Windows and you control the configuration?)

@compnerd
Copy link
Member

compnerd commented Mar 5, 2020

@palimondo no, the current scripts don't run it - Windows uses Python 2.7 and Python 3.7.5 IIRC - those are the versions which are shipped as part of VS2019. Nearly the entirety of the build/test infrastructure is whatever is Visual Studio is shipping. Unfortunately, that becomes really slow, so the one special case is CMake which is 3.16.x rather than the Visual Studio version (3.15.x).

@broadwaylamb
Copy link
Contributor Author

What would be the plan regarding this change?

@compnerd
Copy link
Member

compnerd commented Apr 3, 2020

CC: @drexin I expect that this should be safe to merge, thoughts?

@drexin
Copy link
Contributor

drexin commented Apr 3, 2020

No objections from my end. @shahmishal Are you okay with merging this?

@shahmishal shahmishal merged commit df53c20 into swiftlang:master Apr 3, 2020
@shahmishal
Copy link
Member

Thanks @broadwaylamb!

@broadwaylamb broadwaylamb deleted the benchmark-python-3 branch April 4, 2020 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants