Skip to content

[libc] Move printf long double to simple calc #75414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
2 changes: 1 addition & 1 deletion libc/config/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"doc": "Disable handling of %n in printf format string."
},
"LIBC_CONF_PRINTF_FLOAT_TO_STR_USE_MEGA_LONG_DOUBLE_TABLE": {
"value": true,
"value": false,
"doc": "Use large table for better printf long double performance."
}
},
Expand Down
23 changes: 17 additions & 6 deletions libc/docs/dev/printf_behavior.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ C standard and POSIX standard. If any behavior is not mentioned here, it should
be assumed to follow the behavior described in those standards.

The LLVM-libc codebase is under active development, and may change. This
document was last updated [August 18, 2023] by [michaelrj] and may
document was last updated [January 8, 2024] by [michaelrj] and may
not be accurate after this point.

The behavior of LLVM-libc's printf is heavily influenced by compile-time flags.
Expand Down Expand Up @@ -87,14 +87,26 @@ are not recommended to be adjusted except by persons familiar with the Printf
Ryu Algorithm. Additionally they have no effect when float conversions are
disabled.

LIBC_COPT_FLOAT_TO_STR_NO_SPECIALIZE_LD
---------------------------------------
This flag disables the separate long double conversion implementation. It is
not based on the Ryu algorithm, instead generating the digits by
multiplying/dividing the written-out number by 10^9 to get blocks. It's
significantly faster than INT_CALC, only about 10x slower than MEGA_TABLE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 10% or 10 times? 10x seems like a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 times. It a lot, but only sort of. I separated out the decimal (%f) long double tests and ran them both with the new specialization and with the mega table with -O3 in release mode. On my workstation the mega table version took ~400 microseconds and the new version took ~8000 microseconds.

The reason I think this is a worthwhile tradeoff is because the mega table takes up a lot of space. Just the converter (found in build/libc/src/stdio/printf_core/CMakeFiles/libc.src.stdio.printf_core.converter.dir/) takes up ~5 megabytes with the mega table, but only ~150 kilobytes with the specialization.

The other option I was considering for saving space was using integer calculation (INT_CALC) which takes up more space (~200 Kb) and time (~7,500,000 microseconds) so in comparison the new version looks quite good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache wise the 5MiB table seems pretty bad for a function that can be called frequently on multiple cores.

The sheer size is an issue in itself but cache invalidation is also a concern, libc instruction and data footprint should be small to allow for more application state to stay in the cache. How is the table accessed by the way? Are the indices about always the same or is it completely random? Is there a way to compress the table? Can we come up with a smaller table and interpolate between values without losing precision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table is accessed sequentially. It's a flattened two-dimensional table, where the POW10_OFFSET array gives you the starting indexes of each value in the larger array POW10_SPLIT. The negative indices have a similar but slightly different layout. Each number will map to one of the POW10_OFFSET values, then have an offset from that into the larger array which will be incremented or decremented (depending on if this is the positive or negative exponent table).

The 5MB table is already compressed. If you look at the top of ryu_long_double_constants.h you'll see TABLE_SHIFT_CONST, IDX_SIZE, and MID_INT_SIZE. These are the constants we can adjust to try to shrink the table. There's a more comprehensive explanation of what these mean in utils/mathtools/ryu_tablegen.py but here's the short version:

MID_INT_SIZE is the size of each entry in the POW10_SPLIT array. This needs to be at least TABLE_SHIFT_CONST + IDX_SIZE + sizeof(BLOCK_INT) so that it can actually fit the values.

TABLE_SHIFT_CONST (called CONSTANT in ryu_tablegen.py) adjusts the precision of each entry in the array, with higher being more precise. It's 120 here because anything lower tends to cause errors to get into the result.

IDX_SIZE is the compression factor. It's the number of exponents that can be mapped onto one entry in POW10_OFFSET. From my testing it only works when it's a power of 2. Pushing this higher would require increasing MID_INT_SIZE a lot, which would significantly reduce the actual size savings, and would also make the calculations slower.

I talked a bit with Ulf Adams (the original designer of the Ryu algorithm) and he suggested that it might be possible to also compress the table by approximating the next value by multiplying by 5**9. In my testing this worked, but required a higher TABLE_SHIFT_CONST since you lose some precision with each approximation. In the end this only shrank the table a bit, and so I decided it wasn't worthwhile.

In conclusion, I believe the table is already compressed within an order of magnitude of its minimum size, and that makes it too large to be practical. Long doubles are rarely used, so carrying a large table all the time to speed them up seems like a bad idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the long explanation. I really appreciate it.

In conclusion, I believe the table is already compressed within an order of magnitude of its minimum size, and that makes it too large to be practical. Long doubles are rarely used, so carrying a large table all the time to speed them up seems like a bad idea.

I agree.

and is small in binary size. Its downside is that it always calculates all
of the digits above the decimal point, making it slightly inefficient for %e
calls with large exponents. This is the default. This specialization overrides
other flags, so this flag must be set for other flags to effect the long double
behavior.

LIBC_COPT_FLOAT_TO_STR_USE_MEGA_LONG_DOUBLE_TABLE
-------------------------------------------------
When set, the float to string decimal conversion algorithm will use a larger
table to accelerate long double conversions. This larger table is around 5MB of
size when compiled. This flag is enabled by default in the CMake.
size when compiled.

LIBC_COPT_FLOAT_TO_STR_USE_DYADIC_FLOAT(_LD)
--------------------------------------------
LIBC_COPT_FLOAT_TO_STR_USE_DYADIC_FLOAT
---------------------------------------
When set, the float to string decimal conversion algorithm will use dyadic
floats instead of a table when performing floating point conversions. This
results in ~50 digits of accuracy in the result, then zeroes for the remaining
Expand All @@ -107,8 +119,7 @@ LIBC_COPT_FLOAT_TO_STR_USE_INT_CALC
When set, the float to string decimal conversion algorithm will use wide
integers instead of a table when performing floating point conversions. This
gives the same results as the table, but is very slow at the extreme ends of
the long double range. If no flags are set this is the default behavior for
long double conversions.
the long double range.

LIBC_COPT_FLOAT_TO_STR_NO_TABLE
-------------------------------
Expand Down
11 changes: 10 additions & 1 deletion libc/src/__support/UInt.h
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,17 @@ namespace LIBC_NAMESPACE::cpp {

template <size_t Bits, bool Signed> struct BigInt {

// This being hardcoded as 64 is okay because we're using uint64_t as our
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add

using word_type = uint64_t;
LIBC_INLINE_VAR static constexpr size_t WORD_SIZE = sizeof(word_type) * CHAR_T;
...
cpp::array<word_type, WORDCOUNT> val{};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done that, though I'm leaving the large scale refactoring for a followup patch.

// internal type which will always be 64 bits.
using word_type = uint64_t;
LIBC_INLINE_VAR static constexpr size_t WORD_SIZE =
sizeof(word_type) * CHAR_BIT;

// TODO: Replace references to 64 with WORD_SIZE, and uint64_t with word_type.
static_assert(Bits > 0 && Bits % 64 == 0,
"Number of bits in BigInt should be a multiple of 64.");
LIBC_INLINE_VAR static constexpr size_t WORDCOUNT = Bits / 64;
cpp::array<uint64_t, WORDCOUNT> val{};
cpp::array<word_type, WORDCOUNT> val{};

LIBC_INLINE_VAR static constexpr uint64_t MASK32 = 0xFFFFFFFFu;

Expand Down Expand Up @@ -448,6 +455,8 @@ template <size_t Bits, bool Signed> struct BigInt {
// pos is the index of the current 64-bit chunk that we are processing.
size_t pos = WORDCOUNT;

// TODO: look into if constexpr(Bits > 256) skip leading zeroes.

for (size_t q_pos = WORDCOUNT - lower_pos; q_pos > 0; --q_pos) {
// q_pos is 1 + the index of the current 64-bit chunk of the quotient
// being processed.
Expand Down
Loading