Skip to content

bpo-35696: Remove unnecessary operation in long_compare() #16146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 18, 2019

Conversation

hongweipeng
Copy link
Contributor

@hongweipeng hongweipeng commented Sep 14, 2019

$ ./python -m pyperf timeit -s "a = 100; b = 99" "a == b" --compare-to=../cpython-master/python --duplicate=1000
/home/weapon/workspace/cpython-master/python: ..................... 13.0 ns +- 0.3 ns
/home/weapon/workspace/cpython/python: ..................... 12.7 ns +- 0.1 ns

Mean +- std dev: [/home/weapon/workspace/cpython-master/python] 13.0 ns +- 0.3 ns -> [/home/weapon/workspace/cpython/python] 12.7 ns +- 0.1 ns: 1.02x faster (-2%)

$ ./python -m pyperf timeit -s "a = 2 ** 100 + 1; b = 2 ** 100" "a == b" --compare-to=../cpython-master/python --duplicate=1000
/home/weapon/workspace/cpython-master/python: ..................... 14.1 ns +- 0.2 ns
/home/weapon/workspace/cpython/python: ..................... 13.6 ns +- 0.1 ns

Mean +- std dev: [/home/weapon/workspace/cpython-master/python] 14.1 ns +- 0.2 ns -> [/home/weapon/workspace/cpython/python] 13.6 ns +- 0.1 ns: 1.04x faster (-4%)

long_compare

https://bugs.python.org/issue35696

Copy link
Member

@shihai1991 shihai1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks your PR without test case is OK, maybe somedays performance benchmark test cloud become a test suite too ;)

@ghost

This comment has been minimized.

@hongweipeng

This comment has been minimized.

@eduardo-elizondo
Copy link
Contributor

Only using pyperf will not give a full picture of why the performance is better. For instance, maybe your change created a better code alignment for your particular architecture, resulting in a slight speedup. However, that doesn't mean that it will be a perf win for all other systems.

Also, it's not clear what's your build environment. i.e. How are you building? Is this debug or release? Are you using LTO and PGO?

If you want to keep working on this, I would suggest to:

  • Try to build an environment to get a stable benchmark. Follow @vstinner's guide: https://vstinner.github.io/journey-to-stable-benchmark-system.html
  • Clearly outline your build steps and your environment in the summary
  • Give more information than just time. For instance, what's the instruction count? Branch misses? instruction cache misses?
  • Follow up to the previous point but use a simulated machine. For instance use Valgrind's cache grind to gather that data.

Ideally this should all be handled by a common set of benchmark machines that could be triggered by the PR. Unfortunately that's not available yet. So doing this will make it really clear if this is indeed a perf win.

Cheers,
Eddie

Copy link
Contributor

@eduardo-elizondo eduardo-elizondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes to get more data points

Copy link
Contributor

@sir-sigurd sir-sigurd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it doesn't work correctly when Py_SIZE(a) - Py_SIZE(b) is out of int range.

@ghost

This comment has been minimized.

@aeros aeros added the performance Performance or resource usage label Sep 17, 2019
@hongweipeng
Copy link
Contributor Author

Optimization is achieved by removing unnecessary operations. I am not good at benchmarking.Maybe I change the title from improving performance to removing unnecessary operations is better.

It looks like it doesn't work correctly when Py_SIZE(a) - Py_SIZE(b) is out of int range.

Good eye! Maybe it can be submitted a new PR to avoid overflow.

if a == 0 and b == 0, the index is out of bound:

This has been processed upstream:

long_richcompare(PyObject *self, PyObject *other, int op)
{
    int result;
    CHECK_BINOP(self, other);
    if (self == other)
        result = 0;
    else
        result = long_compare((PyLongObject*)self, (PyLongObject*)other);
    Py_RETURN_RICHCOMPARE(result, 0, op);
}

But it does have hidden dangers. For example, disable small_ints. Also need to confirm the other places where long_compare is called.

Copy link
Contributor

@aeros aeros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @hongweipeng.

I would highly recommend following through with the suggestions from @eduardo-elizondo, to gather more well rounded benchmarking data to verify the optimizations are present across other environments.

Also, I have a minor formatting suggestion for the news entry.

@ghost
Copy link

ghost commented Sep 17, 2019

It looks like it doesn't work correctly when Py_SIZE(a) - Py_SIZE(b) is out of int range.

Good eye! Maybe it can be submitted a new PR to avoid overflow.

There should be no problem.

In a 32-bit build, the entire memory space is 0~0xFFFFFFFF.
So the sum of "two allocated int object's memory" should not be greater than this value.

Py_SIZE(a) is the number of digit elements, digit is 2-byte unsigned short.
So abs(Py_SIZE(a)) + abs(Py_SIZE(b)) always < 0xFFFFFFFF/2

This means that Py_SIZE(a) - Py_SIZE(b) will never overflow.

if a == 0 and b == 0, the index is out of bound:

This has been processed upstream:

long_richcompare(PyObject *self, PyObject *other, int op)
{
    int result;
    CHECK_BINOP(self, other);
    if (self == other)
        result = 0;
    else
        result = long_compare((PyLongObject*)self, (PyLongObject*)other);
    Py_RETURN_RICHCOMPARE(result, 0, op);
}

But it does have hidden dangers. For example, disable small_ints. Also need to confirm the other places where long_compare is called.

Other long_compare() call sites in longobject.c don't check (self == other)

@sir-sigurd
Copy link
Contributor

There should be no problem.

In a 32-bit build, the entire memory space is 0~0xFFFFFFFF.
So the sum of "two allocated int object's memory" should not be greater than this value.

Py_SIZE(a) is the number of digit elements, digit is 2-byte unsigned short.
So abs(Py_SIZE(a)) + abs(Py_SIZE(b)) always < 0xFFFFFFFF/2

This means that Py_SIZE(a) - Py_SIZE(b) will never overflow.

long_compare() returns int. On 64-bit build Py_SIZE(a) - Py_SIZE(b) can be out of range of int, don't it? What will happen in such case, for example when Py_SIZE(a) - Py_SIZE(b) == (1LL << 32)?

@ghost
Copy link

ghost commented Sep 17, 2019

long_compare() returns int. On 64-bit build Py_SIZE(a) - Py_SIZE(b) can be out of range of int, don't it? What will happen in such case, for example when Py_SIZE(a) - Py_SIZE(b) == (1LL << 32)?

You are right. We can return Py_ssize_t instead of int.
In addition, maybe we can use inline function here?

@hongweipeng
Copy link
Contributor Author

Other long_compare() call sites in longobject.c don't check (self == other)

But they won't be called if both a and b are 0.

There are 4 places call long_compare() in longobject.c:

1.in long_richcompare().

Already discussed, unless the small_ints is disabled.

2.in long_invmod() :

if (long_compare(a, (PyLongObject *)_PyLong_One)) {

_PyLong_One is int 1.

3.in _PyLong_DivmodNear(PyObject *a, PyObject *b):

cmp = long_compare((PyLongObject *)twice_rem, (PyLongObject *)b);

And calling _PyLong_DivmodNear is long_round, it has set the parameter b to be 10:

long_round(PyObject *self, PyObject *args)
{
    result = PyLong_FromLong(10L);
    ...
    temp = _PyLong_DivmodNear(self, result);
}
  1. in _PyLong_GCD()

If a and b are both 0. It use simple way instead of calling long_compare():

if (-2 <= size_a && size_a <= 2 && -2 <= size_b && size_b <= 2) {
    Py_INCREF(a);
    Py_INCREF(b);
    goto simple;
}

If I miss other palce, please tell me. I am learning how to benchmark and it will take some time. If anyone can help, it’s better.

@sir-sigurd
Copy link
Contributor

sir-sigurd commented Sep 17, 2019

You are right. We can return Py_ssize_t instead of int.

Yes, however we should check that result is not casted to int in call sites.

In addition, maybe we can use inline function here?

I don't think it makes sense.

@hongweipeng
Copy link
Contributor Author

I think it's easier to replace Py_SIZE(a) - Py_SIZE(b) with return Py_SIZE(a)>Py_SIZE(b)?1:-1.

@hongweipeng
Copy link
Contributor Author

Oh, it should be better to return Py_ssize_t instead of int.

@sir-sigurd
Copy link
Contributor

@hongweipeng
I'd rewrite it like long_compare1() here: https://godbolt.org/z/djHSPX.

@hongweipeng
Copy link
Contributor Author

@sir-sigurd Thank you very much!

@vstinner
Copy link
Member

I'm not sure of the purpose of the whole change since it doesn't seem to make comparison way faster. It's around 1% faster or slower...

cc @serhiy-storchaka

@hongweipeng
Copy link
Contributor Author

while (--i >= 0 && !(diff = (sdigit)a->ob_digit[i] - (sdigit)b->ob_digit[i])); produces fewer instructions than while (--i > 0 && a->ob_digit[i] == b->ob_digit[i]);diff = (sdigit)a->ob_digit[i] - (sdigit)b->ob_digit[i];

@sir-sigurd Can I use your code in the rewrite? I will leave your name in the NEWS.

@sir-sigurd
Copy link
Contributor

@sir-sigurd Can I use your code in the rewrite? I will leave your name in the NEWS.

Sure.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if you remove the NEWS entry.

@serhiy-storchaka: I let you decide if this change is worth it or not ;-)

@hongweipeng hongweipeng changed the title bpo-35696:Slightly improve perfomance of long_compare. bpo-35696:Remove unnecessary operation in long_compare(). Sep 18, 2019
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

For the one who is going to merge the change, be careful of the commit message generated by GitHub. I suggest to reuse the current PR title:

"bpo-35696: Remove unnecessary operation in long_compare() (GH-16146)"

@serhiy-storchaka, @methane: Would you mind to double check this change? I prefer to not pretend that it optimizes Python, but it should not make it slower :-)

@methane methane changed the title bpo-35696:Remove unnecessary operation in long_compare(). bpo-35696:Remove unnecessary operation in long_compare() Sep 18, 2019
@methane methane changed the title bpo-35696:Remove unnecessary operation in long_compare() bpo-35696: remove unnecessary operation in long_compare() Sep 18, 2019
@methane methane changed the title bpo-35696: remove unnecessary operation in long_compare() bpo-35696: Remove unnecessary operation in long_compare() Sep 18, 2019
@ghost
Copy link

ghost commented Sep 18, 2019

Will diff overflow?

        sdigit diff = 0;
        ...
        diff = (sdigit) a->ob_digit[i] - (sdigit) b->ob_digit[i];
        ...
        sign = Py_SIZE(a) < 0 ? -diff : diff;

edit:
It seems will not overflow.
In a 32-bit build, the minimum possible value of the subtracting is -32766 = 1 - ((1 << 15) - 1)
FYI:

#define PyLong_BASE     ((digit)1 << PyLong_SHIFT)
#define PyLong_MASK     ((digit)(PyLong_BASE - 1))

sdigit (signed short, -32768 <= x <= 32767) just won't overflow.

This code looks fragile, but it works.

@methane methane merged commit 42acb7b into python:master Sep 18, 2019
@hongweipeng hongweipeng deleted the imporve_long_compare branch September 19, 2019 01:14
@vstinner
Copy link
Member

This code looks fragile, but it works.

Would removing "diff" variable and reuse sign (Py_ssize_t) instead would be less fragile? If yes, propose maybe a PR.

@aeros
Copy link
Contributor

aeros commented Sep 20, 2019

@vstinner:

Would removing "diff" variable and reuse sign (Py_ssize_t) instead would be less fragile? If yes, propose maybe a PR.

In what way could it be considered fragile? The main criticism seemed to be the overflowing issue, but that doesn't seem to be a problem.

Also, I'm not certain that I understand how you would get the same functionality if diff was removed, particularly from the while loop:

...
        while (--i >= 0) {
            diff = (sdigit) a->ob_digit[i] - (sdigit) b->ob_digit[i];
            if (diff) {
                break;
            }
        }
...

Note: I'm not experienced with C, so I'm asking the above questions only for learning purposes, not as a criticism of the suggestion.

@ghost
Copy link

ghost commented Sep 20, 2019

@aeros167, please look at this code.

static Py_ssize_t
long_compare(PyLongObject *a, PyLongObject *b)
{
    Py_ssize_t sign = Py_SIZE(a) - Py_SIZE(b);
    if (sign == 0) {
        Py_ssize_t i = Py_ABS(Py_SIZE(a));
        while (--i >= 0) {
            sign = (Py_ssize_t) a->ob_digit[i] - b->ob_digit[i];
            if (sign) {
                break;
            }
        }
        sign = Py_SIZE(a) < 0 ? -sign : sign;
    }
    return sign;
}

Honestly speaking, this PR reduces readability a bit.

@aeros
Copy link
Contributor

aeros commented Sep 20, 2019

@animalize:

@aeros167, please look at this code.

Thanks for clearing that up! For some reason, I thought an sdigit was still needed and wasn't sure how sign could fill that role (since it's of type Py_ssize_t). I'm gradually improving my knowledge of the C-API (mainly for the purpose of improving my reviews), but it's definitely not an area that I'm experienced in.

@aeros
Copy link
Contributor

aeros commented Sep 20, 2019

@animalize:

sign = (Py_ssize_t) a->ob_digit[i] - b->ob_digit[i];

Is the second cast to Py_ssize_t not needed?

            sign = (Py_ssize_t) a->ob_digit[i] - (Py_ssize_t) b->ob_digit[i];

@ghost
Copy link

ghost commented Sep 20, 2019

@aeros167

Is the second cast to Py_ssize_t not needed?

The key is using it to the first operand. IIRC, the cast for the second operand can be omitted.

@aeros
Copy link
Contributor

aeros commented Sep 20, 2019

@animalize:

IIRC, the cast for the second operand can be omitted.

Ah, I see. I've seen it used for multiple operands before, but that might be more of a styling preference to make it more explicit. I have no idea what is preferred for the C-API though, maybe @vstinner would know. Typically it seems that we tend to learn towards explicit for the most part, but it might be different in this case. Good to know though, thanks. (:

@vstinner
Copy link
Member

Ah, I see. I've seen it used for multiple operands before, but that might be more of a styling preference to make it more explicit.

Nobody understands C standards. I prefer to keep an explicit cast on both operands. Does anyone want to propose a PR? This change introduced a new warning: https://bugs.python.org/issue35696#msg352875

@hongweipeng
Copy link
Contributor Author

This change introduced a new warning:

The warning seems to be caused by PYLONG_FROM_UINT(unsigned long, ival) not long_compare().

@hongweipeng hongweipeng restored the imporve_long_compare branch December 6, 2019 09:46
@hongweipeng hongweipeng deleted the imporve_long_compare branch October 14, 2020 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage skip news
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants