Simplify vector_norm() main loop and eliminate branch misprediction costs #9006

rhettinger · 2018-08-29T22:42:44Z

Summary

Instead of searching for and skipping over a value equal to
max, just start the accumlation for 1.0 and subtract it
back out in the end.

This eliminates the contortion of swapping max with last,
making loop body simpler and branchless. Timings show
that the cost of an extra loop is more than offset by the
avoiding branch misprediction costs.

Since the loop body no longer has to search for values equal to max,
the vector values no longer need to be restricted to non-negative
values.

Main loop disassembly

The main loop is much cleaner now (one memory read per loop, memory
accesses are for consecutive memory locations, and there are no branches,
register spills, or reloads).

L379:
    movsd   (%rax), %xmm2
    addq    $8, %rax
    cmpq    %rax, %rdx
    divsd   %xmm1, %xmm2
    mulsd   %xmm2, %xmm2
    movapd  %xmm2, %xmm3
    addsd   %xmm0, %xmm3
    subsd   %xmm3, %xmm0
    addsd   %xmm0, %xmm2
    movapd  %xmm3, %xmm0
    addsd   %xmm2, %xmm4
    jne L379

Timings

Performance is the same or slightly improved for various number of arguments to hypot():

$ py time_hypot_max_location.py    # baseline 1
0.19610544399999985
$ py time_hypot_max_location.py    # baseline 2
0.2929759919999999
$ py time_hypot_max_location.py    # baseline 3
0.329982003
$ py time_hypot_max_location.py    # baseline 4
0.3638649850000002
$ py time_hypot_max_location.py    # baseline 5
0.398908402
$ py time_hypot_max_location.py    # baseline 10
0.5381930719999999
$ py time_hypot_max_location.py    # baseline 15
0.714894589
$ py time_hypot_max_location.py    # baseline 30
1.4070674050000012
$ py time_hypot_max_location.py    # baseline 60
2.381592201
$ py time_hypot_max_location.py    # baseline 120
5.435193673
$ py time_hypot_max_location.py    # baseline 240
9.103425692999991
------------------------

$ py time_hypot_max_location.py    # patched 1
0.18985156299999995
$ py time_hypot_max_location.py    # patched 2
0.2807232770000001
$ py time_hypot_max_location.py    # patched 3
0.31244054500000007
$ py time_hypot_max_location.py    # patched 4
0.34701948800000004
$ py time_hypot_max_location.py    # patched 5
0.381604936
$ py time_hypot_max_location.py    # patched 10
0.5410756759999997
$ py time_hypot_max_location.py    # patched 15
0.7458159609999999
$ py time_hypot_max_location.py    # patched 30
1.3642926619999995
$ py time_hypot_max_location.py    # patched 60
2.397398169999999
$ py time_hypot_max_location.py    # patched 120
5.419441274999997
$ py time_hypot_max_location.py    # patched 240
9.029506676999997

Timing Code

'Make sure branch misprediction gets included in timings'

from random import randrange
from itertools import starmap
from collections import deque
from math import hypot
from timeit import repeat

def consume(it):
    deque(it, maxlen=0)

def one_row(n, zero=0.0, one=1.0):
    'Return an n-length tuple with the max value in a rondom position'
    vec = [zero] * n
    vec[randrange(n)] = one
    return tuple(vec)

def trial(coordinates):
    consume(starmap(hypot, coordinates))

setup = '''
from __main__ import one_row, trial

trials = 10_000
n = 1
coordinates = [one_row(n) for i in range(trials)]
'''

stmts = '''
trial(coordinates)
'''

print(min(repeat(stmts, setup, repeat=9, number=1_000)))

tim-one

Well, I don't know that starting with 1.0 and subtracting 1.0 in the end is always wholly harmless, but it sure can't matter much. While it's true that subtracting 1.0 at the end is exact, that can move the result into a smaller binade, forcing the last bit to 0 when it would otherwise may not have been.

Concretely, like for the two arguments 1.0 and 2**-26. Square both and add and you get 1.0 + 2**-52, the smallest representable double larger than 1.0. But if the sum starts at 1, then 1 + 1 + 2**-52 (in any order) rounds to exactly 2.0, so subtracting 1 at the end gives exactly 1.0. In that specific example, it doesn't matter after the last step, because sqrt(1.0) and sqrt(1 + 2**-52) both round to 1.0.

But, given that we're also doing compensated summation, in that case while csum ends up as 2.0, frac will contain the otherwise-lost 2**-52, and csum - 1.0 + frac ends up with 1.0 + 2**-52 anyway.

Which is long way of saying "sure, ship it!" ;-)

rhettinger · 2018-08-30T16:20:01Z

FWIW, I'm not confident this is the best way to go. The code looks simpler with the patch, but the concept of "add 1.0 at the beginning and remove it at the end" feels like more of a trick than "skip over the max entry".

Also, I don't really like having the extra loop so that the n==2 case runs two divide-square-add steps rather than one. OTOH, I do like being able to relax the requirement that the input vector consists of non-negative values. I do like that the main loop would become branchless.

Which do you prefer, the current baseline version or the subtract-one-at-the-end version?

rhettinger added 6 commits August 29, 2018 03:02

Hardwirte -1.0 instead of repeated tests for *max*

ddc5af8

Optimize the n==1 case

98b52d9

Update comment and remove the x >= 0 restriction

794210c

Explain why subtracting 1.0 is lossless.

7e27830

Tweak comment

b02d54a

Note that *max* can be larger than max(map(abs, vec))

35d6845

rhettinger added performance Performance or resource usage skip issue skip news labels Aug 29, 2018

rhettinger requested a review from tim-one August 29, 2018 22:42

the-knights-who-say-ni added the CLA signed label Aug 29, 2018

bedevere-bot added the awaiting merge label Aug 29, 2018

tim-one approved these changes Aug 30, 2018

View reviewed changes

rhettinger merged commit 745c0f3 into python:master Aug 31, 2018

bedevere-bot removed the awaiting merge label Aug 31, 2018

rhettinger deleted the hypot_no_max_test branch August 31, 2018 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Simplify vector_norm() main loop and eliminate branch misprediction costs #9006

Simplify vector_norm() main loop and eliminate branch misprediction costs #9006

Uh oh!

rhettinger commented Aug 29, 2018 •

edited

Loading

Uh oh!

tim-one left a comment

Uh oh!

rhettinger commented Aug 30, 2018 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Simplify vector_norm() main loop and eliminate branch misprediction costs #9006

Simplify vector_norm() main loop and eliminate branch misprediction costs #9006

Uh oh!

Conversation

rhettinger commented Aug 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Main loop disassembly

Timings

Timing Code

Uh oh!

tim-one left a comment

Choose a reason for hiding this comment

Uh oh!

rhettinger commented Aug 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rhettinger commented Aug 29, 2018 •

edited

Loading

rhettinger commented Aug 30, 2018 •

edited

Loading