Skip to content

bpo-36957: Speed up math.isqrt #13405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
May 19, 2019
Merged

Conversation

mdickinson
Copy link
Member

@mdickinson mdickinson commented May 18, 2019

Speed up math.isqrt by:

  • Adding a fast path for inputs smaller than 2**64
  • Performing the first 5 iterations entirely in C integer arithmetic rather than Python long integer arithmetic for larger inputs.

https://bugs.python.org/issue36957

@mdickinson
Copy link
Member Author

No news entry needed, since math.isqrt hasn't yet appeared in any released version of Python.

@mdickinson mdickinson added the performance Performance or resource usage label May 18, 2019

/* The first 4 steps can be performed entirely in 32-bit arithmetic;
the last needs 64-bit arithmetic. */
u = 1U + (uint32_t)(m >> 62);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this code be shared?

if (fastpath) {
    // init fast path
}
else {
    // init general path
}
// common code
if (fastpath) {
    // return fast path
}
// general path

With new shifting functions this may be clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I considered this, but in the end I decided it was cleaner to keep the two paths separate.

I'll try the refactoring and see how it looks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing is a bit ugly, because the last line has to be different in the two cases:

u = (u << 15) + (uint32_t)((m >> 17) / u);

versus

v = (u << 15) + ((m >> 17) / u);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to use always the second one? In any case you cast u to 64-bit for the final check. The overhead is unlikely noticeable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to share via a helper function.

uint64_t
_approximate_isqrt(uint64_t n)
{
uint32_t u = 1U + (n >> 62);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u = 1U + (uint32_t)(n >> 62)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the uint32_t casts on the basis that it makes the code look cleaner and the compiler will likely optimise them away anyway, given that the target type is uint32_t.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checked: at least on my machine (macOS 10.14.5, clang from the OS), with -O3, the generated assembly includes one divb, two divls and one divq, so it looks as though the compiler's smart enough to optimise those divisions.

I have to confess I was a bit surprised by the divb.

And the second division could actually be done using 16-bit arithmetic. I don't know whether it's worth trying to persuade the compiler to do that. In theory these division instructions could be a bottleneck, but in practise I suspect that all the overhead of the function call, argument passing, etc. will likely outweigh any advantage from optimising one division.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree!

if (m == (uint64_t)(-1) && PyErr_Occurred()) {
return NULL;
}
u = _approximate_isqrt(m);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above 5 lines (or 4 if exclude Py_DECREF) could be shared too. Not sure that they should.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I think I'd rather keep _approximate_isqrt as a pure uint64_t to uint64_t function, though.

mdickinson and others added 2 commits May 19, 2019 14:31
Add missing `static` declaration to helper function.

Co-Authored-By: Serhiy Storchaka <[email protected]>
@mdickinson
Copy link
Member Author

@serhiy-storchaka This PR has changed substantially since you approved it. Are you still okay with me merging it?

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you would not create this PR I would create a similar PR myself. 😀

@mdickinson mdickinson merged commit 5c08ce9 into python:master May 19, 2019
@bedevere-bot
Copy link

@mdickinson: Please replace # with GH- in the commit message next time. Thanks!

@mdickinson mdickinson deleted the math-isqrt-fast-path branch May 19, 2019 16:52
@mdickinson
Copy link
Member Author

@serhiy-storchaka Again, thank you for the thorough review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage skip news
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants