[locale][num_get] Improve Stage 2 of string to float conversion #65168

ldionne · 2023-09-01T13:13:40Z

https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.2

"Stage 2" of num_get::do_get() depends on "a check ... to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1". Previously this was a very simple check whether the next character was in a set of allowed characters. This could lead to Stage 2 accumulating character sequences such as "1.2f" and passing them to strtold (Stage 3).
https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.3

Stage 3 can fail, however, if the entire character sequence from Stage 2 is not used in the conversion. For example, the "f" in "1.2f" is not used. https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.4

As a result, parsing a sequence like "1.2f" would return value 0.0 with failbit set.

This change improves the checks made in Stage 2, determining what is passed to Stage 3.

Hex digits are only considered valid if "0x" has been seen
INFINITY value is recognised
Characters in INFINITY and NAN are only valid in sequence. This is done by checking one character backwards, which has obvious limitations.
New tests are added. The old ones are preserved but refactored.

Differential Revision: https://reviews.llvm.org/D99091

https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.2 "Stage 2" of num_get::do_get() depends on "a check ... to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1". Previously this was a very simple check whether the next character was in a set of allowed characters. This could lead to Stage 2 accumulating character sequences such as "1.2f" and passing them to strtold (Stage 3). https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.3 Stage 3 can fail, however, if the entire character sequence from Stage 2 is not used in the conversion. For example, the "f" in "1.2f" is not used. https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.4 As a result, parsing a sequence like "1.2f" would return value 0.0 with failbit set. This change improves the checks made in Stage 2, determining what is passed to Stage 3. * Hex digits are only considered valid if "0x" has been seen * INFINITY value is recognised * Characters in INFINITY and NAN are only valid in sequence. This is done by checking one character backwards, which has obvious limitations. * New tests are added. The old ones are preserved but refactored. Differential Revision: https://reviews.llvm.org/D99091

EricWF · 2023-09-06T19:11:42Z

It seems possible to ship this without breaking the ABI. And assuming the new behavior doesn't start to reject meaningful previously accepted code then I think we should ship it retroactively.

frederick-vs-ja · 2023-09-20T01:40:50Z

LWG2381 (not implemented in libc++ yet) intendedly rejects Inf and NaN. Should we restrict the handling of Inf and NaN to some extension mode?

mordante · 2023-12-24T14:47:29Z

It seems possible to ship this without breaking the ABI. And assuming the new behavior doesn't start to reject meaningful previously accepted code then I think we should ship it retroactively.

I also don't see why this has to be an ABI break. What is the reason?

LWG2381 (not implemented in libc++ yet) intendedly rejects Inf and NaN. Should we restrict the handling of Inf and NaN to some extension mode?

IMO we should just implement the LWG issue and not have an extension mode.

ldionne · 2024-01-12T19:56:47Z

I'm more than happy to drop this PR if it's not valid anymore. I was just trying to salvage it from https://reviews.llvm.org/D99091, but honestly I'm not very fluent in this part of the library so progress has been almost non-existent, and this is a bit low on my priority list. I'll drop this.

…7948) This PR implements [LWG2381](https://cplusplus.github.io/LWG/issue2381) by rejecting `'i'`, `'I'`, `'n'`, `'N'` in FP parsing, as inf and NaN are intendedly rejected by that LWG issue. The source character array used for parsing is `"0123456789abcdefABCDEFxX+-pPiInN"`, whose first 26 or 28 characters are used for parsing integers or floating-point values respectively. Previously, libc++ used 32 characters, including `'i'`, `'I'`, `'n'`, `'N'`, for FP parsing, which was inconsistent with LWG2381. This PR also replaces magic numbers 26 and 28 (formerly 32) with named constants. Drive-by change: when the first character (possibly after the leading `'+'` or `'-'`) is not a decimal digit but an acceptable character (e.g., `'p'` or `'e'`), the character is not accumulated now (per Stage 2 in [facet.num.get.virtuals]/3). #65168 may be rendered invalid, see #65168 (comment). Apple back-deployment targets remain broken, likely due to dylib. XFAIL is marked in related tests. --------- Co-authored-by: Mark de Wever <[email protected]>

…7948) This PR implements [LWG2381](https://cplusplus.github.io/LWG/issue2381) by rejecting `'i'`, `'I'`, `'n'`, `'N'` in FP parsing, as inf and NaN are intendedly rejected by that LWG issue. The source character array used for parsing is `"0123456789abcdefABCDEFxX+-pPiInN"`, whose first 26 or 28 characters are used for parsing integers or floating-point values respectively. Previously, libc++ used 32 characters, including `'i'`, `'I'`, `'n'`, `'N'`, for FP parsing, which was inconsistent with LWG2381. This PR also replaces magic numbers 26 and 28 (formerly 32) with named constants. Drive-by change: when the first character (possibly after the leading `'+'` or `'-'`) is not a decimal digit but an acceptable character (e.g., `'p'` or `'e'`), the character is not accumulated now (per Stage 2 in [facet.num.get.virtuals]/3). #65168 may be rendered invalid, see llvm/llvm-project#65168 (comment). Apple back-deployment targets remain broken, likely due to dylib. XFAIL is marked in related tests. --------- Co-authored-by: Mark de Wever <[email protected]> NOKEYCHECK=True GitOrigin-RevId: 3e15c97fa3812993bdc319827a5c6d867b765ae8

ldionne added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Sep 1, 2023

ldionne self-assigned this Sep 1, 2023

ldionne force-pushed the wip/num_get-string-to-float-conversion branch from a69e54d to 18800a8 Compare September 1, 2023 13:24

frederick-vs-ja mentioned this pull request Jan 12, 2024

[libc++] LWG2381: Inconsistency in parsing floating point numbers #77948

Merged

ldionne closed this Jan 12, 2024

ldionne deleted the wip/num_get-string-to-float-conversion branch January 12, 2024 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[locale][num_get] Improve Stage 2 of string to float conversion #65168

[locale][num_get] Improve Stage 2 of string to float conversion #65168

Uh oh!

ldionne commented Sep 1, 2023

Uh oh!

EricWF commented Sep 6, 2023

Uh oh!

frederick-vs-ja commented Sep 20, 2023

Uh oh!

mordante commented Dec 24, 2023

Uh oh!

ldionne commented Jan 12, 2024

Uh oh!

Uh oh!

[locale][num_get] Improve Stage 2 of string to float conversion #65168

[locale][num_get] Improve Stage 2 of string to float conversion #65168

Uh oh!

Conversation

ldionne commented Sep 1, 2023

Uh oh!

EricWF commented Sep 6, 2023

Uh oh!

frederick-vs-ja commented Sep 20, 2023

Uh oh!

mordante commented Dec 24, 2023

Uh oh!

ldionne commented Jan 12, 2024

Uh oh!

Uh oh!