Skip to content

[libc][math] Implement atan2f correctly rounded to all rounding modes. #86716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 1, 2024

Conversation

lntue
Copy link
Contributor

@lntue lntue commented Mar 26, 2024

We compute atan2f(y, x) in 2 stages:

  • Fast step: perform computations in double precision , with relative errors < 2^-50
  • Accurate step: if the result from the Fast step fails Ziv's rounding test, then we perform computations in double-double precision, with relative errors < 2^-100.

On Ryzen 5900X, worst-case latency is ~ 200 clocks, compared to average latency ~ 60 clocks, and average reciprocal throughput ~ 20 clocks.

@lntue
Copy link
Contributor Author

lntue commented Mar 26, 2024

@zimmermann6

@llvmbot
Copy link
Member

llvmbot commented Mar 26, 2024

@llvm/pr-subscribers-libc

Author: None (lntue)

Changes

We compute atan2f(y, x) in 2 stages:

  • Fast step: perform computations in double precision , with relative errors < 2^-50
  • Accurate step: if the result from the Fast step fails Ziv's rounding test, then we perform computations in double-double precision, with relative errors < 2^-100.

On Ryzen 5900X, worst-case latency is ~ 200 clocks, compared to average latency ~ 60 clocks, and average reciprocal throughput ~ 20 clocks.


Patch is 39.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/86716.diff

22 Files Affected:

  • (modified) libc/config/baremetal/arm/entrypoints.txt (+1)
  • (modified) libc/config/baremetal/riscv/entrypoints.txt (+1)
  • (modified) libc/config/darwin/arm/entrypoints.txt (+1)
  • (modified) libc/config/linux/aarch64/entrypoints.txt (+1)
  • (modified) libc/config/linux/arm/entrypoints.txt (+1)
  • (modified) libc/config/linux/riscv/entrypoints.txt (+1)
  • (modified) libc/config/linux/x86_64/entrypoints.txt (+1)
  • (modified) libc/config/windows/entrypoints.txt (+1)
  • (modified) libc/docs/math/index.rst (+2-1)
  • (modified) libc/spec/stdc.td (+4)
  • (modified) libc/src/math/generic/CMakeLists.txt (+19)
  • (added) libc/src/math/generic/atan2f.cpp (+297)
  • (modified) libc/src/math/generic/atanf.cpp (+3-9)
  • (modified) libc/src/math/generic/inv_trigf_utils.cpp (+55-52)
  • (modified) libc/src/math/generic/inv_trigf_utils.h (+3-2)
  • (modified) libc/test/src/math/CMakeLists.txt (+13)
  • (added) libc/test/src/math/atan2f_test.cpp (+132)
  • (modified) libc/test/src/math/atanf_test.cpp (+4-1)
  • (modified) libc/test/src/math/smoke/CMakeLists.txt (+12)
  • (added) libc/test/src/math/smoke/atan2f_test.cpp (+50)
  • (modified) libc/utils/MPFRWrapper/MPFRUtils.cpp (+8)
  • (modified) libc/utils/MPFRWrapper/MPFRUtils.h (+1)
diff --git a/libc/config/baremetal/arm/entrypoints.txt b/libc/config/baremetal/arm/entrypoints.txt
index 17ce56e228a6ac..9e21f5c20d9207 100644
--- a/libc/config/baremetal/arm/entrypoints.txt
+++ b/libc/config/baremetal/arm/entrypoints.txt
@@ -207,6 +207,7 @@ set(TARGET_LIBM_ENTRYPOINTS
     libc.src.math.acoshf
     libc.src.math.asinf
     libc.src.math.asinhf
+    libc.src.math.atan2f
     libc.src.math.atanf
     libc.src.math.atanhf
     libc.src.math.ceil
diff --git a/libc/config/baremetal/riscv/entrypoints.txt b/libc/config/baremetal/riscv/entrypoints.txt
index 39756e1ee29f54..7664937da0f6e0 100644
--- a/libc/config/baremetal/riscv/entrypoints.txt
+++ b/libc/config/baremetal/riscv/entrypoints.txt
@@ -207,6 +207,7 @@ set(TARGET_LIBM_ENTRYPOINTS
     libc.src.math.acoshf
     libc.src.math.asinf
     libc.src.math.asinhf
+    libc.src.math.atan2f
     libc.src.math.atanf
     libc.src.math.atanhf
     libc.src.math.ceil
diff --git a/libc/config/darwin/arm/entrypoints.txt b/libc/config/darwin/arm/entrypoints.txt
index 02a09256606956..6b89ce55d72b65 100644
--- a/libc/config/darwin/arm/entrypoints.txt
+++ b/libc/config/darwin/arm/entrypoints.txt
@@ -118,6 +118,7 @@ set(TARGET_LIBM_ENTRYPOINTS
     libc.src.math.acoshf
     libc.src.math.asinf
     libc.src.math.asinhf
+    libc.src.math.atan2f
     libc.src.math.atanf
     libc.src.math.atanhf
     libc.src.math.copysign
diff --git a/libc/config/linux/aarch64/entrypoints.txt b/libc/config/linux/aarch64/entrypoints.txt
index 78da7f0b334b1f..4ba2d8387b07eb 100644
--- a/libc/config/linux/aarch64/entrypoints.txt
+++ b/libc/config/linux/aarch64/entrypoints.txt
@@ -330,6 +330,7 @@ set(TARGET_LIBM_ENTRYPOINTS
     libc.src.math.acoshf
     libc.src.math.asinf
     libc.src.math.asinhf
+    libc.src.math.atan2f
     libc.src.math.atanf
     libc.src.math.atanhf
     libc.src.math.copysign
diff --git a/libc/config/linux/arm/entrypoints.txt b/libc/config/linux/arm/entrypoints.txt
index 6e63e270280e7a..04baa4c1cf93ab 100644
--- a/libc/config/linux/arm/entrypoints.txt
+++ b/libc/config/linux/arm/entrypoints.txt
@@ -198,6 +198,7 @@ set(TARGET_LIBM_ENTRYPOINTS
     libc.src.math.acoshf
     libc.src.math.asinf
     libc.src.math.asinhf
+    libc.src.math.atan2f
     libc.src.math.atanf
     libc.src.math.atanhf
     libc.src.math.ceil
diff --git a/libc/config/linux/riscv/entrypoints.txt b/libc/config/linux/riscv/entrypoints.txt
index 5aae4e246cfb3c..25745513b920ba 100644
--- a/libc/config/linux/riscv/entrypoints.txt
+++ b/libc/config/linux/riscv/entrypoints.txt
@@ -338,6 +338,7 @@ set(TARGET_LIBM_ENTRYPOINTS
     libc.src.math.acoshf
     libc.src.math.asinf
     libc.src.math.asinhf
+    libc.src.math.atan2f
     libc.src.math.atanf
     libc.src.math.atanhf
     libc.src.math.copysign
diff --git a/libc/config/linux/x86_64/entrypoints.txt b/libc/config/linux/x86_64/entrypoints.txt
index 5b428e51aee620..56ac08ab23c609 100644
--- a/libc/config/linux/x86_64/entrypoints.txt
+++ b/libc/config/linux/x86_64/entrypoints.txt
@@ -348,6 +348,7 @@ set(TARGET_LIBM_ENTRYPOINTS
     libc.src.math.acoshf
     libc.src.math.asinf
     libc.src.math.asinhf
+    libc.src.math.atan2f
     libc.src.math.atanf
     libc.src.math.atanhf
     libc.src.math.canonicalize
diff --git a/libc/config/windows/entrypoints.txt b/libc/config/windows/entrypoints.txt
index f4456f561ec017..c38125a6462272 100644
--- a/libc/config/windows/entrypoints.txt
+++ b/libc/config/windows/entrypoints.txt
@@ -116,6 +116,7 @@ set(TARGET_LIBM_ENTRYPOINTS
     libc.src.math.acoshf
     libc.src.math.asinf
     libc.src.math.asinhf
+    libc.src.math.atan2f
     libc.src.math.atanf
     libc.src.math.atanhf
     libc.src.math.copysign
diff --git a/libc/docs/math/index.rst b/libc/docs/math/index.rst
index 080b6a4427f511..b7f1b8739648ca 100644
--- a/libc/docs/math/index.rst
+++ b/libc/docs/math/index.rst
@@ -422,7 +422,7 @@ Higher Math Functions
 +------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
 | atan2      |         |         |         |         |         |         |         |         |         |         |         |         |
 +------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
-| atan2f     |         |         |         |         |         |         |         |         |         |         |         |         |
+| atan2f     | |check| | |check| | |check| | |check| | |check| |         |         | |check| | |check| | |check| |         |         |
 +------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
 | atan2l     |         |         |         |         |         |         |         |         |         |         |         |         |
 +------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+
@@ -591,6 +591,7 @@ acosh          |check|
 asin           |check|
 asinh          |check|
 atan           |check|
+atan2          |check|
 atanh          |check|
 cos            |check|          large
 cosh           |check|
diff --git a/libc/spec/stdc.td b/libc/spec/stdc.td
index ac6e1d1801ba55..719bb9aa18cb0a 100644
--- a/libc/spec/stdc.td
+++ b/libc/spec/stdc.td
@@ -620,10 +620,14 @@ def StdC : StandardSpec<"stdc"> {
           FunctionSpec<"tanhf", RetValSpec<FloatType>, [ArgSpec<FloatType>]>,
 
           FunctionSpec<"acosf", RetValSpec<FloatType>, [ArgSpec<FloatType>]>,
+
           FunctionSpec<"asinf", RetValSpec<FloatType>, [ArgSpec<FloatType>]>,
           FunctionSpec<"asin", RetValSpec<DoubleType>, [ArgSpec<DoubleType>]>,
+
           FunctionSpec<"atanf", RetValSpec<FloatType>, [ArgSpec<FloatType>]>,
 
+          FunctionSpec<"atan2f", RetValSpec<FloatType>, [ArgSpec<FloatType>, ArgSpec<FloatType>]>,
+
           FunctionSpec<"acoshf", RetValSpec<FloatType>, [ArgSpec<FloatType>]>,
           FunctionSpec<"asinhf", RetValSpec<FloatType>, [ArgSpec<FloatType>]>,
           FunctionSpec<"atanhf", RetValSpec<FloatType>, [ArgSpec<FloatType>]>,
diff --git a/libc/src/math/generic/CMakeLists.txt b/libc/src/math/generic/CMakeLists.txt
index 4d9b91150d0200..7e9c9ce7bb2748 100644
--- a/libc/src/math/generic/CMakeLists.txt
+++ b/libc/src/math/generic/CMakeLists.txt
@@ -2854,6 +2854,25 @@ add_entrypoint_object(
     -O3
 )
 
+add_entrypoint_object(
+  atan2f
+  SRCS
+    atan2f.cpp
+  HDRS
+    ../atan2f.h
+  COMPILE_OPTIONS
+    -O3
+  DEPENDS
+    .inv_trigf_utils
+    libc.src.__support.FPUtil.except_value_utils
+    libc.src.__support.FPUtil.fp_bits
+    libc.src.__support.FPUtil.multiply_add
+    libc.src.__support.FPUtil.nearest_integer
+    libc.src.__support.FPUtil.polyeval
+    libc.src.__support.FPUtil.rounding_mode
+    libc.src.__support.macros.optimization
+)
+
 add_entrypoint_object(
   scalbn
   SRCS
diff --git a/libc/src/math/generic/atan2f.cpp b/libc/src/math/generic/atan2f.cpp
new file mode 100644
index 00000000000000..089e6c61984bea
--- /dev/null
+++ b/libc/src/math/generic/atan2f.cpp
@@ -0,0 +1,297 @@
+//===-- Single-precision atan2f function ----------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "src/math/atan2f.h"
+#include "inv_trigf_utils.h"
+#include "src/__support/FPUtil/FPBits.h"
+#include "src/__support/FPUtil/PolyEval.h"
+#include "src/__support/FPUtil/double_double.h"
+#include "src/__support/FPUtil/except_value_utils.h"
+#include "src/__support/FPUtil/multiply_add.h"
+#include "src/__support/FPUtil/nearest_integer.h"
+#include "src/__support/FPUtil/rounding_mode.h"
+#include "src/__support/macros/optimization.h" // LIBC_UNLIKELY
+
+namespace LIBC_NAMESPACE {
+
+namespace {
+
+// Look up tables for accurate pass:
+
+// atan(i/16) with i = 0..16, generated by Sollya with:
+// > for i from 0 to 16 do {
+//     a = round(atan(i/16), D, RN);
+//     b = round(atan(i/16) - a, D, RN);
+//     print("{", b, ",", a, "},");
+//   };
+constexpr fputil::DoubleDouble ATAN_I[17] = {
+    {0.0, 0.0},
+    {-0x1.c934d86d23f1dp-60, 0x1.ff55bb72cfdeap-5},
+    {-0x1.cd37686760c17p-59, 0x1.fd5ba9aac2f6ep-4},
+    {0x1.347b0b4f881cap-58, 0x1.7b97b4bce5b02p-3},
+    {0x1.8ab6e3cf7afbdp-57, 0x1.f5b75f92c80ddp-3},
+    {-0x1.963a544b672d8p-57, 0x1.362773707ebccp-2},
+    {-0x1.c63aae6f6e918p-56, 0x1.6f61941e4def1p-2},
+    {-0x1.24dec1b50b7ffp-56, 0x1.a64eec3cc23fdp-2},
+    {0x1.a2b7f222f65e2p-56, 0x1.dac670561bb4fp-2},
+    {-0x1.d5b495f6349e6p-56, 0x1.0657e94db30dp-1},
+    {-0x1.928df287a668fp-58, 0x1.1e00babdefeb4p-1},
+    {0x1.1021137c71102p-55, 0x1.345f01cce37bbp-1},
+    {0x1.2419a87f2a458p-56, 0x1.4978fa3269ee1p-1},
+    {0x1.0028e4bc5e7cap-57, 0x1.5d58987169b18p-1},
+    {-0x1.8c34d25aadef6p-56, 0x1.700a7c5784634p-1},
+    {-0x1.bf76229d3b917p-56, 0x1.819d0b7158a4dp-1},
+    {0x1.1a62633145c07p-55, 0x1.921fb54442d18p-1},
+};
+
+// Taylor polynomial, generated by Sollya with:
+// > for i from 0 to 8 do {
+//     j = (-1)^(i + 1)/(2*i + 1);
+//     a = round(j, D, RN);
+//     b = round(j - a, D, RN);
+//     print("{", b, ",", a, "},");
+//   };
+constexpr fputil::DoubleDouble COEFFS[9] = {
+    {0.0, 1.0},                                      // 1
+    {-0x1.5555555555555p-56, -0x1.5555555555555p-2}, // -1/3
+    {-0x1.999999999999ap-57, 0x1.999999999999ap-3},  // 1/5
+    {-0x1.2492492492492p-57, -0x1.2492492492492p-3}, // -1/7
+    {0x1.c71c71c71c71cp-58, 0x1.c71c71c71c71cp-4},   // 1/9
+    {0x1.745d1745d1746p-59, -0x1.745d1745d1746p-4},  // -1/11
+    {-0x1.3b13b13b13b14p-58, 0x1.3b13b13b13b14p-4},  // 1/13
+    {-0x1.1111111111111p-60, -0x1.1111111111111p-4}, // -1/15
+    {0x1.e1e1e1e1e1e1ep-61, 0x1.e1e1e1e1e1e1ep-5},   // 1/17
+};
+
+// Veltkamp's splitting of a double precision into hi + lo, where the hi part is
+// slightly smaller than an even split, so that the product of
+//   hi * s * k is exact,
+// where:
+//   s is single precsion,
+//   0 < k < 16 is an integer.
+// This is used when FMA instruction is not available.
+[[maybe_unused]] LIBC_INLINE constexpr fputil::DoubleDouble split_d(double a) {
+  fputil::DoubleDouble r{0.0, 0.0};
+  constexpr double C = 0x1.0p33 + 1.0;
+  double t1 = C * a;
+  double t2 = a - t1;
+  r.hi = t1 + t2;
+  r.lo = a - r.hi;
+  return r;
+}
+
+} // anonymous namespace
+
+// There are several range reduction steps we can take for atan2(y, x) as
+// follow:
+
+// * Range reduction 1: signness
+// atan2(y, x) will return a number between -PI and PI representing the angle
+// forming by the 0x axis and the vector (x, y) on the 0xy-plane.
+// In particular, we have that:
+//   atan2(y, x) = atan( y/x )         if x >= 0 and y >= 0 (I-quadrant)
+//               = pi + atan( y/x )    if x < 0 and y >= 0  (II-quadrant)
+//               = -pi + atan( y/x )   if x < 0 and y < 0   (III-quadrant)
+//               = atan( y/x )         if x >= 0 and y < 0  (IV-quadrant)
+// Since atan function is odd, we can use the formula:
+//   atan(-u) = -atan(u)
+// to adjust the above conditions a bit further:
+//   atan2(y, x) = atan( |y|/|x| )         if x >= 0 and y >= 0 (I-quadrant)
+//               = pi - atan( |y|/|x| )    if x < 0 and y >= 0  (II-quadrant)
+//               = -pi + atan( |y|/|x| )   if x < 0 and y < 0   (III-quadrant)
+//               = -atan( |y|/|x| )        if x >= 0 and y < 0  (IV-quadrant)
+// Which can be simplified to:
+//   atan2(y, x) = sign(y) * atan( |y|/|x| )             if x >= 0
+//               = sign(y) * (pi - atan( |y|/|x| ))      if x < 0
+
+// * Range reduction 2: reciprocal
+// Now that the argument inside atan is positive, we can use the formula:
+//   atan(1/x) = pi/2 - atan(x)
+// to make the argument inside atan <= 1 as follow:
+//   atan2(y, x) = sign(y) * atan( |y|/|x|)            if 0 <= |y| <= x
+//               = sign(y) * (pi/2 - atan( |x|/|y| )   if 0 <= x < |y|
+//               = sign(y) * (pi - atan( |y|/|x| ))    if 0 <= |y| <= -x
+//               = sign(y) * (pi/2 + atan( |x|/|y| ))  if 0 <= -x < |y|
+
+// * Range reduction 3: look up table.
+// After the previous two range reduction steps, we reduce the problem to
+// compute atan(u) with 0 <= u <= 1, or to be precise:
+//   atan( n / d ) where n = min(|x|, |y|) and d = max(|x|, |y|).
+// An accurate polynomial approximation for the whole [0, 1] input range will
+// require a very large degree.  To make it more efficient, we reduce the input
+// range further by finding an integer idx such that:
+//   | n/d - idx/16 | <= 1/32.
+// In particular,
+//   idx := 2^-4 * round(2^4 * n/d)
+// Then for the fast pass, we find a polynomial approximation for:
+//   atan( n/d ) ~ atan( idx/16 ) + (n/d - idx/16) * Q(n/d - idx/16)
+// For the accurate pass, we use the addition formula:
+//   atan( n/d ) - atan( idx/16 ) = atan( (n/d - idx/16)/(1 + (n*idx)/(16*d)) )
+//                                = atan( (n - d * idx/16)/(d + n * idx/16) )
+// And finally we use Taylor polynomial to compute the RHS in the accurate pass:
+//   atan(u) ~ P(u) = u - u^3/3 + u^5/5 - u^7/7 + u^9/9 - u^11/11 + u^13/13 -
+//                      - u^15/15 + u^17/17
+// It's error in double-double precision is estimated in Sollya to be:
+// > P = x - x^3/3 + x^5/5 -x^7/7 + x^9/9 - x^11/11 + x^13/13 - x^15/15
+//       + x^17/17;
+// > dirtyinfnorm(atan(x) - P, [-2^-5, 2^-5]);
+// 0x1.aec6f...p-100
+// which is about rounding errors of double-double (2^-104).
+
+LLVM_LIBC_FUNCTION(float, atan2f, (float y, float x)) {
+  using FPBits = typename fputil::FPBits<float>;
+  constexpr double IS_NEG[2] = {1.0, -1.0};
+  constexpr double PI = 0x1.921fb54442d18p1;
+  constexpr double PI_LO = 0x1.1a62633145c07p-53;
+  constexpr double PI_OVER_4 = 0x1.921fb54442d18p-1;
+  constexpr double PI_OVER_2 = 0x1.921fb54442d18p0;
+  constexpr double THREE_PI_OVER_4 = 0x1.2d97c7f3321d2p+1;
+  // Adjustment for constant term:
+  //   CONST_ADJ[x_sign][y_sign][recip]
+  constexpr fputil::DoubleDouble CONST_ADJ[2][2][2] = {
+      {{{0.0, 0.0}, {-PI_LO / 2, -PI_OVER_2}},
+       {{-0.0, -0.0}, {-PI_LO / 2, -PI_OVER_2}}},
+      {{{-PI_LO, -PI}, {PI_LO / 2, PI_OVER_2}},
+       {{-PI_LO, -PI}, {PI_LO / 2, PI_OVER_2}}}};
+
+  FPBits x_bits(x), y_bits(y);
+  bool x_sign = x_bits.sign().is_neg();
+  bool y_sign = y_bits.sign().is_neg();
+  x_bits.set_sign(Sign::POS);
+  y_bits.set_sign(Sign::POS);
+  uint32_t x_abs = x_bits.uintval();
+  uint32_t y_abs = y_bits.uintval();
+  uint32_t max_abs = x_abs > y_abs ? x_abs : y_abs;
+  uint32_t min_abs = x_abs <= y_abs ? x_abs : y_abs;
+  bool recip = x_abs < y_abs;
+
+  if (LIBC_UNLIKELY(max_abs >= 0x7f80'0000U || min_abs == 0U)) {
+    if (x_bits.is_nan() || y_bits.is_nan())
+      return FPBits::quiet_nan().get_val();
+    size_t x_except = x_abs == 0 ? 0 : (x_abs == 0x7f80'0000 ? 2 : 1);
+    size_t y_except = y_abs == 0 ? 0 : (y_abs == 0x7f80'0000 ? 2 : 1);
+
+    // Exceptional cases:
+    //   EXCEPT[y_except][x_except][x_is_neg]
+    // with x_except & y_except:
+    //   0: zero
+    //   1: finite, non-zero
+    //   2: infinity
+    constexpr double EXCEPTS[3][3][2] = {
+        {{0.0, PI}, {0.0, PI}, {0.0, PI}},
+        {{PI_OVER_2, PI_OVER_2}, {0.0, 0.0}, {0.0, PI}},
+        {{PI_OVER_2, PI_OVER_2},
+         {PI_OVER_2, PI_OVER_2},
+         {PI_OVER_4, THREE_PI_OVER_4}},
+    };
+
+    double r = IS_NEG[y_sign] * EXCEPTS[y_except][x_except][x_sign];
+
+    return static_cast<float>(r);
+  }
+
+  double final_sign = IS_NEG[(x_sign != y_sign) != recip];
+  fputil::DoubleDouble const_term = CONST_ADJ[x_sign][y_sign][recip];
+  double num_d = static_cast<double>(FPBits(min_abs).get_val());
+  double den_d = static_cast<double>(FPBits(max_abs).get_val());
+  double q_d = num_d / den_d;
+  int idx;
+
+  double k_d = fputil::nearest_integer(q_d * 0x1.0p4f);
+  idx = static_cast<int>(k_d);
+  q_d = fputil::multiply_add(k_d, -0x1.0p-4, q_d);
+
+  double p, r;
+
+  if (LIBC_UNLIKELY(idx == 0)) {
+    // For |x| < 1/16, we use Taylor polynomial:
+    //   atan(x) ~ x - x^3/3 + x^5/5 - x^7/7 + x^9/9 - x^11/11 + x^13/13
+    double q2 = q_d * q_d;
+    double c0 = fputil::multiply_add(q2, ATAN_COEFFS[0][2], ATAN_COEFFS[0][1]);
+    double c1 = fputil::multiply_add(q2, ATAN_COEFFS[0][4], ATAN_COEFFS[0][3]);
+    double c2 = fputil::multiply_add(q2, ATAN_COEFFS[0][6], ATAN_COEFFS[0][5]);
+    double q4 = q2 * q2;
+    double q3 = q_d * q2;
+    double d0 = fputil::polyeval(q4, c0, c1, c2);
+    p = fputil::multiply_add(q3, d0, q_d);
+    r = final_sign * (p + const_term.hi);
+  } else {
+    p = atan_eval(q_d, idx);
+    r = final_sign *
+        fputil::multiply_add(q_d, p, const_term.hi + ATAN_COEFFS[idx][0]);
+  }
+
+  constexpr uint32_t LOWER_ERR = 4;
+  // Mask sticky bits in double precision before rounding to single precision.
+  constexpr uint32_t MASK =
+      mask_trailing_ones<uint32_t, fputil::FPBits<double>::SIG_LEN -
+                                       FPBits::SIG_LEN - 1>();
+  constexpr uint32_t UPPER_ERR = MASK - LOWER_ERR;
+
+  uint32_t r_bits = static_cast<uint32_t>(cpp::bit_cast<uint64_t>(r)) & MASK;
+
+  // Ziv's rounding test.
+  if (LIBC_LIKELY(r_bits > LOWER_ERR && r_bits < UPPER_ERR))
+    return static_cast<float>(r);
+
+  // Use double-double.
+  fputil::DoubleDouble q;
+  double num_r, den_r;
+
+  if (idx != 0) {
+    // The following range reduction is accurate even without fma for
+    //   1/16 <= n/d <= 1.
+    // atan(n/d) - atan(idx/16) = atan((n/d - idx/16) / (1 + (n/d) * (idx/16)))
+    //                          = atan((n - d*(idx/16)) / (d + n*idx/16))
+    k_d *= 0x1.0p-4;
+    num_r = fputil::multiply_add(k_d, -den_d, num_d); // Exact
+    den_r = fputil::multiply_add(k_d, num_d, den_d);  // Exact
+    q.hi = num_r / den_r;
+  } else {
+    // For 0 < n/d < 1/16, we just need to calculate the lower part of their
+    // quotient.
+    q.hi = q_d;
+    num_r = num_d;
+    den_r = den_d;
+  }
+#ifdef LIBC_TARGET_CPU_HAS_FMA
+  q.lo = fputil::multiply_add(q.hi, -den_r, num_r) / den_r;
+#else
+  fputil::DoubleDouble q_hi_dd = split_d(q.hi);
+  double t1 = fputil::multiply_add(q_hi_dd.hi, -den_r, num_r); // Exact
+  double t2 = fputil::multiply_add(q_hi_dd.lo, -den_r, t1);
+  q.lo = t2 / den_r;
+#endif // LIBC_TARGET_CPU_HAS_FMA
+
+  fputil::DoubleDouble q2 = fputil::quick_mult(q, q);
+  fputil::DoubleDouble p_dd =
+      fputil::polyeval(q2, COEFFS[0], COEFFS[1], COEFFS[2], COEFFS[3],
+                       COEFFS[4], COEFFS[5], COEFFS[6], COEFFS[7], COEFFS[8]);
+  fputil::DoubleDouble r_dd =
+      fputil::add(const_term, fputil::multiply_add(q, p_dd, ATAN_I[idx]));
+  r_dd.hi *= final_sign;
+  r_dd.lo *= final_sign;
+
+  // Make sure the sum is normalized:
+  fputil::DoubleDouble rr = fputil::exact_add(r_dd.hi, r_dd.lo);
+  // Round to odd.
+  uint64_t rr_bits = cpp::bit_cast<uint64_t>(rr.hi);
+  if (LIBC_UNLIKELY(((rr_bits & 0xfff'ffff) == 0) && (rr.lo != 0.0))) {
+    Sign hi_sign = fputil::FPBits<double>(rr.hi).sign();
+    Sign lo_sign = fputil::FPBits<double>(rr.lo).sign();
+    if (hi_sign == lo_sign) {
+      ++rr_bits;
+    } else if ((rr_bits & fputil::FPBits<double>::FRACTION_MASK) > 0) {
+      --rr_bits;
+    }
+  }
+
+  return static_cast<float>(cpp::bit_cast<double>(rr_bits));
+}
+
+} // namespace LIBC_NAMESPACE
diff --git a/libc/src/math/generic/atanf.cpp b/libc/src/math/generic/atanf.cpp
index 4adda429cc041c..a58bbdd5033f87 100644
--- a/libc/src/math/generic/atanf.cpp
+++ b/libc/src/math/generic/atanf.cpp
@@ -81,11 +81,6 @@ LLVM_LIBC_FUNCTION(float, atanf, (float x)) {
   int idx;
 
   if (x_abs > 0x3f80'0000U) {
-    // Exceptional value:
-    if (LIBC_UNLIKELY(x_abs == 0x3ffe'2ec1U)) { // |x| = 0x1.fc5d82p+0
-      return sign.is_pos() ? fputil::round_result_slightly_up(0x1.1ab2fp0f)
-                           : fputil::round_result_slightly_down(-0x1.1ab2fp0f);
-    }
     // |x| > 1, we need to invert x, so we will perform range reduction in
     // double precision.
     x_d = 1.0 / static_cast<double>(x_bits.get_val());
@@ -98,10 +93,9 @@ LLVM_LIBC_FUNCTION(float, atanf, (float x)) {
                                       SIGNED_PI_OVER_2[sign.is_neg()]);
   } else {
     // Exceptional value:
-    if...
[truncated]

Copy link
Member

@nickdesaulniers nickdesaulniers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once a function starts pushing 100+ lines, consider if it makes sense to start breaking up portions into smaller static functions. Usually having another block of local variables declared in the middle of the function is a sign that a function can be made more modular. This has the added benefit of reducing the lifetime of variables which results in reduced stack usage due to improvements in stack slot reuse.

Comment on lines 203 to 206
int idx;

double k_d = fputil::nearest_integer(q_d * 0x1.0p4f);
idx = static_cast<int>(k_d);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine declaration + assignment of idx.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 70 to 72
uint64_t cc = 0;
float mx, my, mr = 0.0;
double tol = 0.5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure what cc or tol is shorthand for. Consider giving these more descriptive identifiers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

uint32_t y_abs = y_bits.uintval();
uint32_t max_abs = x_abs > y_abs ? x_abs : y_abs;
uint32_t min_abs = x_abs <= y_abs ? x_abs : y_abs;
bool recip = x_abs < y_abs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recip is not used until later, after the (unlikely) conditional that may return early without needing it. Consider sinking this closer to the use. Perhaps some of the expressions that the value of recip depends on can be sunk too, if they're also unused on the if statement starting on L173.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@lntue
Copy link
Contributor Author

lntue commented Mar 27, 2024

Once a function starts pushing 100+ lines, consider if it makes sense to start breaking up portions into smaller static functions. Usually having another block of local variables declared in the middle of the function is a sign that a function can be made more modular. This has the added benefit of reducing the lifetime of variables which results in reduced stack usage due to improvements in stack slot reuse.

I split the accurate pass into a separate function.

@zimmermann6
Copy link

on the performance side, on a Intel(R) Xeon(R) Silver 4214, I get:

zimmerma@croissant:~/svn/core-math$ LIBM=$L CORE_MATH_PERF_MODE=rdtsc ./perf.sh atan2f
GNU libc version: 2.37
GNU libc release: stable
[####################] 100 %
Ntrial = 20 ; Min = 21.403 + 0.730 clc/call; Median-Min = 0.526 clc/call; Max = 23.978 clc/call;
[####################] 100 %
Ntrial = 20 ; Min = 72.942 + 1.122 clc/call; Median-Min = 1.153 clc/call; Max = 75.873 clc/call;
[####################] 100 %
Ntrial = 20 ; Min = 25.242 + 0.662 clc/call; Median-Min = 0.613 clc/call; Max = 26.245 clc/call;
zimmerma@croissant:~/svn/core-math$ PERF_ARGS=--latency LIBM=$L CORE_MATH_PERF_MODE=rdtsc ./perf.sh atan2f
GNU libc version: 2.37
GNU libc release: stable
[####################] 100 %
Ntrial = 20 ; Min = 59.164 + 1.289 clc/call; Median-Min = 1.221 clc/call; Max = 61.282 clc/call;
[####################] 100 %
Ntrial = 20 ; Min = 99.246 + 2.177 clc/call; Median-Min = 2.411 clc/call; Max = 103.292 clc/call;
[####################] 100 %
Ntrial = 20 ; Min = 63.599 + 0.823 clc/call; Median-Min = 0.804 clc/call; Max = 65.413 clc/call;

which means a reciprocal throughput of 25.2 for LLVM (against 21.4 for CORE-MATH and 72.9 for the GNU libc), and a latency of 63.6 (against 59.2 and 99.2)

@lntue
Copy link
Contributor Author

lntue commented Mar 28, 2024

on the performance side, on a Intel(R) Xeon(R) Silver 4214, I get:

zimmerma@croissant:~/svn/core-math$ LIBM=$L CORE_MATH_PERF_MODE=rdtsc ./perf.sh atan2f
GNU libc version: 2.37
GNU libc release: stable
[####################] 100 %
Ntrial = 20 ; Min = 21.403 + 0.730 clc/call; Median-Min = 0.526 clc/call; Max = 23.978 clc/call;
[####################] 100 %
Ntrial = 20 ; Min = 72.942 + 1.122 clc/call; Median-Min = 1.153 clc/call; Max = 75.873 clc/call;
[####################] 100 %
Ntrial = 20 ; Min = 25.242 + 0.662 clc/call; Median-Min = 0.613 clc/call; Max = 26.245 clc/call;
zimmerma@croissant:~/svn/core-math$ PERF_ARGS=--latency LIBM=$L CORE_MATH_PERF_MODE=rdtsc ./perf.sh atan2f
GNU libc version: 2.37
GNU libc release: stable
[####################] 100 %
Ntrial = 20 ; Min = 59.164 + 1.289 clc/call; Median-Min = 1.221 clc/call; Max = 61.282 clc/call;
[####################] 100 %
Ntrial = 20 ; Min = 99.246 + 2.177 clc/call; Median-Min = 2.411 clc/call; Max = 103.292 clc/call;
[####################] 100 %
Ntrial = 20 ; Min = 63.599 + 0.823 clc/call; Median-Min = 0.804 clc/call; Max = 65.413 clc/call;

which means a reciprocal throughput of 25.2 for LLVM (against 21.4 for CORE-MATH and 72.9 for the GNU libc), and a latency of 63.6 (against 59.2 and 99.2)

Thanks Paul for checking it! I've updated the patch to use a different polynomial for small quotients, so that a separate branch is not needed. It helps slightly improve the performance.
@zimmermann6

// s is single precsion,
// 0 < k < 16 is an integer.
// This is used when FMA instruction is not available.
[[maybe_unused]] LIBC_INLINE constexpr fputil::DoubleDouble split_d(double a) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should LIBC_INLINE only be used for function definitions that are in headers? (Here and below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want them to be inline, but using the same macro so that it's less confusing and mix-matched between inline and LIBC_INLINE in our code base.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the semantics of this keyword. https://gist.github.com/nico/93df46db7f5e1d0bd941d63663db95b1

The inline keyword is much more watered down than its name implies.

Simply by putting it in an anonymous namespace outside of a header (so that it's no longer externally visible) and having only one call site (and enabling any optimization level), the inline cost model (in LLVM) will provide an near-infinite discount to the "cost" of inlining.

https://godbolt.org/z/c53vb4KEr

Once inlined, the definition no longer has any callers and is dead code eliminated.

You can verify this via:

diff --git a/libc/src/math/generic/atan2f.cpp b/libc/src/math/generic/atan2f.cpp
index d50ac6608836..c8e943d13466 100644
--- a/libc/src/math/generic/atan2f.cpp
+++ b/libc/src/math/generic/atan2f.cpp
@@ -81,7 +81,7 @@ constexpr fputil::DoubleDouble COEFFS[9] = {
 //                     = 33.
 // Thus, the Veltkamp splitting constant is C = 2^33 + 1.
 // This is used when FMA instruction is not available.
-[[maybe_unused]] LIBC_INLINE constexpr fputil::DoubleDouble split_d(double a) {
+[[maybe_unused]] constexpr fputil::DoubleDouble split_d(double a) {
   fputil::DoubleDouble r{0.0, 0.0};
   constexpr double C = 0x1.0p33 + 1.0;
   double t1 = C * a;
@@ -98,7 +98,7 @@ constexpr fputil::DoubleDouble COEFFS[9] = {
 //   idx, k_d   = round( 2^4 * num_d / den_d )
 //   final_sign = sign of the final result
 //   const_term = the constant term in the final expression.
-LIBC_INLINE float atan2f_double_double(double num_d, double den_d, double q_d,
+float atan2f_double_double(double num_d, double den_d, double q_d,
                                        int idx, double k_d, double final_sign,
                                        const fputil::DoubleDouble &const_term) {
   fputil::DoubleDouble q;
$ ninja libc
$ llvm-nm ./libc/src/math/generic/CMakeFiles/libc.src.math.generic.atan2f.dir/atan2f.cpp.o | llvm-cxxfilt | grep -e split_d -e atan2f_double_double
$ llvm-objdump -dr ./libc/src/math/generic/CMakeFiles/libc.src.math.generic.atan2f.dir/atan2f.cpp.o | llvm-cxxfilt | grep call -A1
     2a6: e8 00 00 00 00               	callq	0x2ab <atan2f+0x2ab>
< call to polyeval >
$

If you want stronger guarantees wrt. inline substitution, you must use the function attributes always_inline or noinline. Those are heavy handed, and I don't recommend using them until one has benchmarks or disassembly showing they're not getting inlining and need it. (I've seen codebases put always_inline on everything, at the extreme cost to compile time and resulting code size). They should not be necessary here.

inline has interesting semantics when used with static or extern on definitions in a header. There's also a function attribute gnu_inline, changes to extern inline in C99 (and newer) modes, and a -f flag for gnu_inline/C90 extern inline semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link! I removed the LIBC_INLINE from these 2 functions.

@zimmermann6
Copy link

zimmermann6 commented Mar 29, 2024

about split_d, I wonder why you use C=0x1p33+1, which makes hi fit into 53-33=20 bits, thus hi * s * k fits into 20+24+4=48 bits.
You could make hi fit into 5 more bits with C=0x1p28+1. Also note that Veltkamp's algorithm also works for directed roundings, see https://inria.hal.science/hal-04480440

@lntue
Copy link
Contributor Author

lntue commented Mar 29, 2024

about split_d, I wonder why you use C=0x1p33+1, which makes hi fit into 53-33=20 bits, thus hi * s * k fits into 20+24+4=48 bits. You could make hi fit into 5 more bits with C=0x1p28+1. Also note that Veltkamp's algorithm also works for directed roundings, see https://inria.hal.science/hal-04480440

Hi Paul, sorry the comment explaining split_d cosntant was not entirely correct and needs to be updated. The requirement for splitting is that the following needs to be exact:

   hi * den_r = hi * (k_d * num_d + den_d)

where

  num_d, den_d are single precision
  1/16 <= k_d <= 1

So the precision of den_r is:

  prec(den_r) = (1 + log2(msb(den_d)) - log2(lsb(k_d * num_d))) + 1                    (the last +1 is for overflow from addition)
              = (1 + log2(msb(num_d)) + 4 - log2(lsb(k_d)) - log2(lsb(num_d))) + 1     (num_d / den_d >= 1/16)
              = (1 + 23 + log2(lsb(num_d)) + 4 - (-4) - log2(lsb(num_d))) + 1
              = 33

The following example will demonstrate that 33 is sharp:
For atan2f( 0x1.781fcp+28f, 0x1.dcb3cap+23f ), if we use C = 0x1.0p32 + 1.0 in the Veltkamp's algorithm

  num_d = 0x1.dcb3cap+23
  den_d = 0x1.781fcp+28
  num_r = -0x1.138bb6p+23
  den_r = 0x1.790e19e5p+28
   q = -0x1.76294b4d835fap-6
  hi = -0x1.76295p-6
                hi * den_r = -0x1.138bb9758de248p23
  round(hi * den_r, D, RN) = -0x1.138bb9758de24p23

@zimmermann6

@lntue lntue merged commit 2be7225 into llvm:main Apr 1, 2024
@lntue lntue deleted the atan2f branch April 1, 2024 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants