Skip to content

[locale][num_get] Improve Stage 2 of string to float conversion #65168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ldionne
Copy link
Member

@ldionne ldionne commented Sep 1, 2023

https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.2

"Stage 2" of num_get::do_get() depends on "a check ... to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1". Previously this was a very simple check whether the next character was in a set of allowed characters. This could lead to Stage 2 accumulating character sequences such as "1.2f" and passing them to strtold (Stage 3).
https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.3

Stage 3 can fail, however, if the entire character sequence from Stage 2 is not used in the conversion. For example, the "f" in "1.2f" is not used. https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.4

As a result, parsing a sequence like "1.2f" would return value 0.0 with failbit set.

This change improves the checks made in Stage 2, determining what is passed to Stage 3.

  • Hex digits are only considered valid if "0x" has been seen

  • INFINITY value is recognised

  • Characters in INFINITY and NAN are only valid in sequence. This is done by checking one character backwards, which has obvious limitations.

  • New tests are added. The old ones are preserved but refactored.

Differential Revision: https://reviews.llvm.org/D99091

@ldionne ldionne added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Sep 1, 2023
@ldionne ldionne self-assigned this Sep 1, 2023
https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.2

"Stage 2" of num_get::do_get() depends on "a check ... to determine if c is
allowed as the next character of an input field of the conversion specifier
returned by Stage 1". Previously this was a very simple check whether the next
character was in a set of allowed characters.  This could lead to Stage 2
accumulating character sequences such as "1.2f" and passing them to strtold
(Stage 3).
https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.3

Stage 3 can fail, however, if the entire character sequence from Stage 2 is not
used in the conversion. For example, the "f" in "1.2f" is not used.
https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.4

As a result, parsing a sequence like "1.2f" would return value 0.0 with failbit
set.

This change improves the checks made in Stage 2, determining what is passed to
Stage 3.

* Hex digits are only considered valid if "0x" has been seen

* INFINITY value is recognised

* Characters in INFINITY and NAN are only valid in sequence. This is done by
  checking one character backwards, which has obvious limitations.

* New tests are added. The old ones are preserved but refactored.

Differential Revision: https://reviews.llvm.org/D99091
@ldionne ldionne force-pushed the wip/num_get-string-to-float-conversion branch from a69e54d to 18800a8 Compare September 1, 2023 13:24
@EricWF
Copy link
Member

EricWF commented Sep 6, 2023

It seems possible to ship this without breaking the ABI. And assuming the new behavior doesn't start to reject meaningful previously accepted code then I think we should ship it retroactively.

@frederick-vs-ja
Copy link
Contributor

LWG2381 (not implemented in libc++ yet) intendedly rejects Inf and NaN. Should we restrict the handling of Inf and NaN to some extension mode?

@mordante
Copy link
Member

It seems possible to ship this without breaking the ABI. And assuming the new behavior doesn't start to reject meaningful previously accepted code then I think we should ship it retroactively.

I also don't see why this has to be an ABI break. What is the reason?

LWG2381 (not implemented in libc++ yet) intendedly rejects Inf and NaN. Should we restrict the handling of Inf and NaN to some extension mode?

IMO we should just implement the LWG issue and not have an extension mode.

@ldionne
Copy link
Member Author

ldionne commented Jan 12, 2024

I'm more than happy to drop this PR if it's not valid anymore. I was just trying to salvage it from https://reviews.llvm.org/D99091, but honestly I'm not very fluent in this part of the library so progress has been almost non-existent, and this is a bit low on my priority list. I'll drop this.

@ldionne ldionne closed this Jan 12, 2024
@ldionne ldionne deleted the wip/num_get-string-to-float-conversion branch January 12, 2024 19:56
mordante added a commit that referenced this pull request May 21, 2024
…7948)

This PR implements [LWG2381](https://cplusplus.github.io/LWG/issue2381)
by rejecting `'i'`, `'I'`, `'n'`, `'N'` in FP parsing, as inf and NaN
are intendedly rejected by that LWG issue.

The source character array used for parsing is
`"0123456789abcdefABCDEFxX+-pPiInN"`, whose first 26 or 28 characters
are used for parsing integers or floating-point values respectively.
Previously, libc++ used 32 characters, including `'i'`, `'I'`, `'n'`,
`'N'`, for FP parsing, which was inconsistent with LWG2381. This PR also
replaces magic numbers 26 and 28 (formerly 32) with named constants.

Drive-by change: when the first character (possibly after the leading
`'+'` or `'-'`) is not a decimal digit but an acceptable character
(e.g., `'p'` or `'e'`), the character is not accumulated now (per Stage
2 in [facet.num.get.virtuals]/3).

#65168 may be rendered invalid, see
#65168 (comment).

Apple back-deployment targets remain broken, likely due to dylib. XFAIL
is marked in related tests.

---------

Co-authored-by: Mark de Wever <[email protected]>
blueboxd pushed a commit to blueboxd/libcxx that referenced this pull request Apr 19, 2025
…7948)

This PR implements [LWG2381](https://cplusplus.github.io/LWG/issue2381)
by rejecting `'i'`, `'I'`, `'n'`, `'N'` in FP parsing, as inf and NaN
are intendedly rejected by that LWG issue.

The source character array used for parsing is
`"0123456789abcdefABCDEFxX+-pPiInN"`, whose first 26 or 28 characters
are used for parsing integers or floating-point values respectively.
Previously, libc++ used 32 characters, including `'i'`, `'I'`, `'n'`,
`'N'`, for FP parsing, which was inconsistent with LWG2381. This PR also
replaces magic numbers 26 and 28 (formerly 32) with named constants.

Drive-by change: when the first character (possibly after the leading
`'+'` or `'-'`) is not a decimal digit but an acceptable character
(e.g., `'p'` or `'e'`), the character is not accumulated now (per Stage
2 in [facet.num.get.virtuals]/3).

#65168 may be rendered invalid, see
llvm/llvm-project#65168 (comment).

Apple back-deployment targets remain broken, likely due to dylib. XFAIL
is marked in related tests.

---------

Co-authored-by: Mark de Wever <[email protected]>
NOKEYCHECK=True
GitOrigin-RevId: 3e15c97fa3812993bdc319827a5c6d867b765ae8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants