Skip to content

Simplify vector_norm() main loop and eliminate branch misprediction costs #9006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 31, 2018

Conversation

rhettinger
Copy link
Contributor

@rhettinger rhettinger commented Aug 29, 2018

Summary

Instead of searching for and skipping over a value equal to
max, just start the accumlation for 1.0 and subtract it
back out in the end.

This eliminates the contortion of swapping max with last,
making loop body simpler and branchless. Timings show
that the cost of an extra loop is more than offset by the
avoiding branch misprediction costs.

Since the loop body no longer has to search for values equal to max,
the vector values no longer need to be restricted to non-negative
values.

Main loop disassembly

The main loop is much cleaner now (one memory read per loop, memory
accesses are for consecutive memory locations, and there are no branches,
register spills, or reloads).

L379:
    movsd   (%rax), %xmm2
    addq    $8, %rax
    cmpq    %rax, %rdx
    divsd   %xmm1, %xmm2
    mulsd   %xmm2, %xmm2
    movapd  %xmm2, %xmm3
    addsd   %xmm0, %xmm3
    subsd   %xmm3, %xmm0
    addsd   %xmm0, %xmm2
    movapd  %xmm3, %xmm0
    addsd   %xmm2, %xmm4
    jne L379

Timings

Performance is the same or slightly improved for various number of arguments to hypot():

$ py time_hypot_max_location.py    # baseline 1
0.19610544399999985
$ py time_hypot_max_location.py    # baseline 2
0.2929759919999999
$ py time_hypot_max_location.py    # baseline 3
0.329982003
$ py time_hypot_max_location.py    # baseline 4
0.3638649850000002
$ py time_hypot_max_location.py    # baseline 5
0.398908402
$ py time_hypot_max_location.py    # baseline 10
0.5381930719999999
$ py time_hypot_max_location.py    # baseline 15
0.714894589
$ py time_hypot_max_location.py    # baseline 30
1.4070674050000012
$ py time_hypot_max_location.py    # baseline 60
2.381592201
$ py time_hypot_max_location.py    # baseline 120
5.435193673
$ py time_hypot_max_location.py    # baseline 240
9.103425692999991
------------------------

$ py time_hypot_max_location.py    # patched 1
0.18985156299999995
$ py time_hypot_max_location.py    # patched 2
0.2807232770000001
$ py time_hypot_max_location.py    # patched 3
0.31244054500000007
$ py time_hypot_max_location.py    # patched 4
0.34701948800000004
$ py time_hypot_max_location.py    # patched 5
0.381604936
$ py time_hypot_max_location.py    # patched 10
0.5410756759999997
$ py time_hypot_max_location.py    # patched 15
0.7458159609999999
$ py time_hypot_max_location.py    # patched 30
1.3642926619999995
$ py time_hypot_max_location.py    # patched 60
2.397398169999999
$ py time_hypot_max_location.py    # patched 120
5.419441274999997
$ py time_hypot_max_location.py    # patched 240
9.029506676999997

Timing Code

'Make sure branch misprediction gets included in timings'

from random import randrange
from itertools import starmap
from collections import deque
from math import hypot
from timeit import repeat

def consume(it):
    deque(it, maxlen=0)

def one_row(n, zero=0.0, one=1.0):
    'Return an n-length tuple with the max value in a rondom position'
    vec = [zero] * n
    vec[randrange(n)] = one
    return tuple(vec)

def trial(coordinates):
    consume(starmap(hypot, coordinates))

setup = '''
from __main__ import one_row, trial

trials = 10_000
n = 1
coordinates = [one_row(n) for i in range(trials)]
'''

stmts = '''
trial(coordinates)
'''

print(min(repeat(stmts, setup, repeat=9, number=1_000)))

Copy link
Member

@tim-one tim-one left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I don't know that starting with 1.0 and subtracting 1.0 in the end is always wholly harmless, but it sure can't matter much. While it's true that subtracting 1.0 at the end is exact, that can move the result into a smaller binade, forcing the last bit to 0 when it would otherwise may not have been.

Concretely, like for the two arguments 1.0 and 2**-26. Square both and add and you get 1.0 + 2**-52, the smallest representable double larger than 1.0. But if the sum starts at 1, then 1 + 1 + 2**-52 (in any order) rounds to exactly 2.0, so subtracting 1 at the end gives exactly 1.0. In that specific example, it doesn't matter after the last step, because sqrt(1.0) and sqrt(1 + 2**-52) both round to 1.0.

But, given that we're also doing compensated summation, in that case while csum ends up as 2.0, frac will contain the otherwise-lost 2**-52, and csum - 1.0 + frac ends up with 1.0 + 2**-52 anyway.

Which is long way of saying "sure, ship it!" ;-)

@rhettinger
Copy link
Contributor Author

rhettinger commented Aug 30, 2018

FWIW, I'm not confident this is the best way to go. The code looks simpler with the patch, but the concept of "add 1.0 at the beginning and remove it at the end" feels like more of a trick than "skip over the max entry".

Also, I don't really like having the extra loop so that the n==2 case runs two divide-square-add steps rather than one. OTOH, I do like being able to relax the requirement that the input vector consists of non-negative values. I do like that the main loop would become branchless.

Which do you prefer, the current baseline version or the subtract-one-at-the-end version?

@rhettinger rhettinger merged commit 745c0f3 into python:master Aug 31, 2018
@rhettinger rhettinger deleted the hypot_no_max_test branch August 31, 2018 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage skip issue skip news
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants