-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[libc] Move printf long double to simple calc #75414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
michaelrj-google
merged 12 commits into
llvm:main
from
michaelrj-google:libcPrintfLongDoublePrecalc
Jan 25, 2024
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
aa4c0f9
[libc] Move printf long double to simple calc
michaelrj-google 72fb569
cleanup, adjust docs, update build systems
michaelrj-google 635931b
Finish rebase
michaelrj-google d627e41
address simple comments
michaelrj-google cab9c77
address more comments
michaelrj-google 75469ac
reorganize and comment to clarify design
michaelrj-google 187c90b
clean up code style
michaelrj-google 8ec9f26
Rebase and update docs
michaelrj-google 746a300
move FloatProp to FPBits
michaelrj-google 4e860d7
fix more nits
michaelrj-google 2735cfc
cleanup and address comments
michaelrj-google e179810
Fix to match new fpbits
michaelrj-google File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,10 +27,17 @@ namespace LIBC_NAMESPACE::cpp { | |
|
||
template <size_t Bits, bool Signed> struct BigInt { | ||
|
||
// This being hardcoded as 64 is okay because we're using uint64_t as our | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we also add
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've done that, though I'm leaving the large scale refactoring for a followup patch. |
||
// internal type which will always be 64 bits. | ||
using word_type = uint64_t; | ||
LIBC_INLINE_VAR static constexpr size_t WORD_SIZE = | ||
sizeof(word_type) * CHAR_BIT; | ||
|
||
// TODO: Replace references to 64 with WORD_SIZE, and uint64_t with word_type. | ||
static_assert(Bits > 0 && Bits % 64 == 0, | ||
"Number of bits in BigInt should be a multiple of 64."); | ||
LIBC_INLINE_VAR static constexpr size_t WORDCOUNT = Bits / 64; | ||
cpp::array<uint64_t, WORDCOUNT> val{}; | ||
cpp::array<word_type, WORDCOUNT> val{}; | ||
|
||
LIBC_INLINE_VAR static constexpr uint64_t MASK32 = 0xFFFFFFFFu; | ||
|
||
|
@@ -448,6 +455,8 @@ template <size_t Bits, bool Signed> struct BigInt { | |
// pos is the index of the current 64-bit chunk that we are processing. | ||
size_t pos = WORDCOUNT; | ||
|
||
// TODO: look into if constexpr(Bits > 256) skip leading zeroes. | ||
|
||
for (size_t q_pos = WORDCOUNT - lower_pos; q_pos > 0; --q_pos) { | ||
// q_pos is 1 + the index of the current 64-bit chunk of the quotient | ||
// being processed. | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this 10% or 10 times? 10x seems like a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 times. It a lot, but only sort of. I separated out the decimal (%f) long double tests and ran them both with the new specialization and with the mega table with
-O3
in release mode. On my workstation the mega table version took ~400 microseconds and the new version took ~8000 microseconds.The reason I think this is a worthwhile tradeoff is because the mega table takes up a lot of space. Just the converter (found in
build/libc/src/stdio/printf_core/CMakeFiles/libc.src.stdio.printf_core.converter.dir/
) takes up ~5 megabytes with the mega table, but only ~150 kilobytes with the specialization.The other option I was considering for saving space was using integer calculation (
INT_CALC
) which takes up more space (~200 Kb) and time (~7,500,000 microseconds) so in comparison the new version looks quite good.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache wise the 5MiB table seems pretty bad for a function that can be called frequently on multiple cores.
The sheer size is an issue in itself but cache invalidation is also a concern, libc instruction and data footprint should be small to allow for more application state to stay in the cache. How is the table accessed by the way? Are the indices about always the same or is it completely random? Is there a way to compress the table? Can we come up with a smaller table and interpolate between values without losing precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table is accessed sequentially. It's a flattened two-dimensional table, where the
POW10_OFFSET
array gives you the starting indexes of each value in the larger arrayPOW10_SPLIT
. The negative indices have a similar but slightly different layout. Each number will map to one of thePOW10_OFFSET
values, then have an offset from that into the larger array which will be incremented or decremented (depending on if this is the positive or negative exponent table).The 5MB table is already compressed. If you look at the top of
ryu_long_double_constants.h
you'll seeTABLE_SHIFT_CONST
,IDX_SIZE
, andMID_INT_SIZE
. These are the constants we can adjust to try to shrink the table. There's a more comprehensive explanation of what these mean inutils/mathtools/ryu_tablegen.py
but here's the short version:MID_INT_SIZE
is the size of each entry in thePOW10_SPLIT
array. This needs to be at leastTABLE_SHIFT_CONST + IDX_SIZE + sizeof(BLOCK_INT)
so that it can actually fit the values.TABLE_SHIFT_CONST
(calledCONSTANT
inryu_tablegen.py
) adjusts the precision of each entry in the array, with higher being more precise. It's 120 here because anything lower tends to cause errors to get into the result.IDX_SIZE
is the compression factor. It's the number of exponents that can be mapped onto one entry inPOW10_OFFSET
. From my testing it only works when it's a power of 2. Pushing this higher would require increasingMID_INT_SIZE
a lot, which would significantly reduce the actual size savings, and would also make the calculations slower.I talked a bit with Ulf Adams (the original designer of the Ryu algorithm) and he suggested that it might be possible to also compress the table by approximating the next value by multiplying by
5**9
. In my testing this worked, but required a higherTABLE_SHIFT_CONST
since you lose some precision with each approximation. In the end this only shrank the table a bit, and so I decided it wasn't worthwhile.In conclusion, I believe the table is already compressed within an order of magnitude of its minimum size, and that makes it too large to be practical. Long doubles are rarely used, so carrying a large table all the time to speed them up seems like a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the long explanation. I really appreciate it.
I agree.