clang: Relax LangOpts checks when lexing quoted numbers during preprocessing #95798

nishithshah2211 · 2024-06-17T14:51:35Z

When trying to lex numeric constants that use single quotes as separators, if the lexer is parsing a preprocessor directive, we can relax the language options check. These checks will be enforced in a later phase after preprocessing/scanning.

This commit is a second attempt to fix
#88896. The first attempt to fix this is here: #93753, which had to be reverted because of the discussions in that same PR.

…cessing When trying to lex numeric constants that use single quotes as separators, if the lexer is parsing a preprocessor directive, we can relax the language options check. These checks will be enforced in a later phase after preprocessing/scanning. This commit is a second attempt to fix llvm#88896. The first attempt to fix this is here: llvm#93753, which had to be reverted because of the discussions in that same PR.

llvmbot · 2024-06-17T14:52:03Z

@llvm/pr-subscribers-clang

Author: Nishith Kumar M Shah (nishithshah2211)

Changes

When trying to lex numeric constants that use single quotes as separators, if the lexer is parsing a preprocessor directive, we can relax the language options check. These checks will be enforced in a later phase after preprocessing/scanning.

This commit is a second attempt to fix
#88896. The first attempt to fix this is here: #93753, which had to be reverted because of the discussions in that same PR.

Full diff: https://github.com/llvm/llvm-project/pull/95798.diff

2 Files Affected:

(modified) clang/lib/Lex/Lexer.cpp (+2-1)
(modified) clang/unittests/Lex/DependencyDirectivesScannerTest.cpp (+29)

diff --git a/clang/lib/Lex/Lexer.cpp b/clang/lib/Lex/Lexer.cpp
index e59c7805b3862..60c101871be75 100644
--- a/clang/lib/Lex/Lexer.cpp
+++ b/clang/lib/Lex/Lexer.cpp
@@ -2068,7 +2068,8 @@ bool Lexer::LexNumericConstant(Token &Result, const char *CurPtr) {
   }
 
   // If we have a digit separator, continue.
-  if (C == '\'' && (LangOpts.CPlusPlus14 || LangOpts.C23)) {
+  if (C == '\'' &&
+      (LangOpts.CPlusPlus14 || LangOpts.C23 || ParsingPreprocessorDirective)) {
     auto [Next, NextSize] = getCharAndSizeNoWarn(CurPtr + Size, LangOpts);
     if (isAsciiIdentifierContinue(Next)) {
       if (!isLexingRawMode())
diff --git a/clang/unittests/Lex/DependencyDirectivesScannerTest.cpp b/clang/unittests/Lex/DependencyDirectivesScannerTest.cpp
index 59fef9ecbb9c9..0d2d98b16f350 100644
--- a/clang/unittests/Lex/DependencyDirectivesScannerTest.cpp
+++ b/clang/unittests/Lex/DependencyDirectivesScannerTest.cpp
@@ -7,6 +7,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "clang/Lex/DependencyDirectivesScanner.h"
+#include "clang/Basic/TokenKinds.h"
 #include "llvm/ADT/SmallString.h"
 #include "gtest/gtest.h"
 
@@ -1012,4 +1013,32 @@ TEST(MinimizeSourceToDependencyDirectivesTest, TokensBeforeEOF) {
   EXPECT_STREQ("#ifndef A\n#define A\n#endif\n<TokBeforeEOF>\n", Out.data());
 }
 
+TEST(MinimizeSourceToDependencyDirectivesTest, QuoteSeparatedNumber) {
+  SmallVector<char, 128> Out;
+  SmallVector<dependency_directives_scan::Token, 4> Tokens;
+  SmallVector<Directive, 4> Directives;
+
+  StringRef Source = R"(
+#if 123'124
+#endif
+)";
+
+  ASSERT_FALSE(
+      minimizeSourceToDependencyDirectives(Source, Out, Tokens, Directives));
+  EXPECT_STREQ("#if 123'124\n#endif\n", Out.data());
+  ASSERT_EQ(Directives.size(), 3u);
+  EXPECT_EQ(Directives[0].Kind, dependency_directives_scan::pp_if);
+  EXPECT_EQ(Directives[1].Kind, dependency_directives_scan::pp_endif);
+  EXPECT_EQ(Directives[2].Kind, dependency_directives_scan::pp_eof);
+  ASSERT_EQ(Tokens.size(), 7u);
+
+  ASSERT_TRUE(Tokens[0].is(tok::hash));
+  ASSERT_TRUE(Tokens[1].is(tok::raw_identifier));   // "if"
+  ASSERT_TRUE(Tokens[2].is(tok::numeric_constant)); // 123'124
+  ASSERT_TRUE(Tokens[3].is(tok::eod));
+  ASSERT_TRUE(Tokens[4].is(tok::hash));
+  ASSERT_TRUE(Tokens[5].is(tok::raw_identifier)); // #endif
+  ASSERT_TRUE(Tokens[6].is(tok::eod));
+}
+
 } // end anonymous namespace

nishithshah2211 · 2024-06-17T14:53:07Z

@jansvoboda11 - I put up this PR to get your thoughts on addressing #88896, since the previous commit was reverted. Is this more along the lines of what you were thinking?

I will push more changes to include a test for the Lexer too.

AaronBallman

The changes should come with a release note in clang/docs/ReleaseNotes.rst so users know about the fix.

The changes otherwise look correct, but I'd like to hear from @jansvoboda11 as well (especially in light of the question I raised on the original PR).

nishithshah2211 · 2024-06-17T15:31:05Z

The changes should come with a release note in clang/docs/ReleaseNotes.rst so users know about the fix.

Noted. Will include a release note, a lexer test and address any other comments for the next commit.

jansvoboda11 · 2024-06-21T23:33:18Z

clang/lib/Lex/Lexer.cpp

@@ -2068,7 +2068,8 @@ bool Lexer::LexNumericConstant(Token &Result, const char *CurPtr) {
  }

  // If we have a digit separator, continue.
-  if (C == '\'' && (LangOpts.CPlusPlus14 || LangOpts.C23)) {
+  if (C == '\'' &&
+      (LangOpts.CPlusPlus14 || LangOpts.C23 || ParsingPreprocessorDirective)) {


Why is ParsingPreprocessorDirective used here?

I think I would prefer introducing new LangOptions::AllowLiteralDigitSeparator member that would be set to true whenever CPlusPlus14 || C23 evaluates to true and that could be set to true explicitly from the scanner. WDYT?

I'm not certain that's a pattern we should get into -- we don't want a LangOption per language feature shared between languages unless it's something the user can enable or disable themselves via the command line.

I'd argue that LangOptions are options internal to the compiler and not necessarily exposed via -cc1 (let alone the driver). The tooling use case is relevant too, IMO.

There's not a one-to-one mapping of LangOptions to frontend of driver options, yes. But I think it's a stretch to argue that the standard mode tracked by LangOptions is not exposed via flags.

I don't think per-feature LangOptions is viable. There's digit separators, hex float support, binary literal support, different literal suffixes, various keywords, etc and that doesn't seem likely to scale well. Also, it makes it seem like we support a la carte language features when we don't -- we intentionally don't want to let users disable standard features (in general; exceptions exist) and so that approach gives the impression contributors need to check both language mode and feature availability which is a maintenance concern.

I'm still wondering about the question I asked here: #93753 (comment)

But I think it's a stretch to argue that the standard mode tracked by LangOptions is not exposed via flags.

Not sure I understand. I'm suggesting that in the future, parts of the lexer behavior could be controlled in a way that's hidden from users. And I'm saying that we are already hiding existing LangOptions fields from users (e.g. LangOptions::CompilingPCH).

I don't think per-feature LangOptions is viable. There's digit separators, hex float support, binary literal support, different literal suffixes, various keywords, etc and that doesn't seem likely to scale well.

This doesn't need to scale. We'd only do this for options that control lexing. FWIW, code that says "literal digit separators are enabled in C++14+, C23+, or when explicitly allowed" seems more logical to me than checking ParsingPreprocessorDirective, which doesn't really explain why we're suddenly allowing digit separators.

Also, it makes it seem like we support a la carte language features when we don't -- we intentionally don't want to let users disable standard features (in general; exceptions exist) and so that approach gives the impression contributors need to check both language mode and feature availability which is a maintenance concern.

I think documenting the cases where we introduce this extra knob should pretty much clear up any confusion and prevent contributors cargo culting that approach to options that don't need it.

I'm still wondering about the question I asked here: #93753 (comment)

Are there any comments that weren't addressed by my reply?

@jansvoboda11, @cor3ntin -- has your opinion changed at all? I'm still pretty strongly opposed to adding feature-specific flags to LangOptions for supporting this in dependency scanning. CC @erichkeane @Sirraide @Fznamznon for more opinions as well.

No, I still see value in adding orthogonal flags for small features. We don't have to expose them to users via command line flags, they can be purely an implementation detail for implementing the dependency scanner. My argument is still that most of the Clang codebase doesn't need to know about dependency scanning and that having one flag for the dependency scanner makes it way too easy fracture the codebase just for this one Clang-based tool.

Case in point, I recently made a series of changes to the PCM file format in order to remove some needless deserialization overhead the dependency scanner uncovered. If we had flag that signifies a "scanner mode", I would've special-cased my changes for that use case, since it'd be easier to implement and test. But by not having that flag (and by not wanting to introduce it), I was forced to figure out how to make the changes generic, and now Clang itself and all Clang-based tools benefit.

I feel like another issue is that just accepting separators uncoditionally in preprocessor directives (i.e. even outside of dependency scanning) would technically be a language extension... doesn’t that mean that we’d technically have to add a compatibility warning? Because we’re usually pretty strict about doing that for things that are effectively a compiler extension.

Sure, but we don't have to expose that language extension to users, it can remain internal to the compiler.

Also, I’m not really that familiar w/ the dependency scanner, but I’m still confused as to why passing down the LangOpts isn’t an option: in #93753, it was pointed out that that would impair caching the results of the scan—but isn’t it already the case that e.g. what files something depends on is dependent on the language standard? E.g. what if I write:

#if __cplusplus >= WhateverTheValueForC++17IsAgain # include "foo.hh" #else # include "bar.hh" #endif

Does it just note that the file might depend on both foo.hh and bar.hh? Because if so, can’t we just set the language standard in the lang opts for the dependency scanner to the latest e.g. C++ standard? Because that would also fix this issue.

The scanner works in two steps:

The virtual file system "minimizes" source files into sequence of tokens that are relevant for dependency discovery. Your code fragment would be transformed into something like [if, ident, gte, ident, then, include, string, else, include, string, endif].

The scanner instance takes those tokens and the TU LanguageOptions and runs full preprocessor on those tokens. This means that the condition should evaluate correctly.

Having (1) be context-free is important for performance.

On high level, the goal of #93753 is to respect LanguageOptions in the scanner. My argument is that we should be able to do so by representing tokens in the stream produced by (1) such that we're still able to emit context-aware, LanguageOptions-driven diagnostics in (2).

That solves the issue outlined by #93753 without impacting the performance.

I mean, we don’t need a language option for every feature either, but adding them on an as-needed basis doesn’t seem completely untenable.

+1

Sure, but we don't have to expose that language extension to users, it can remain internal to the compiler.

Yeah, that’s fine. What I meant was if we were to just use ParsingPreprocessorDirective, we would be exposing it to users, so that’s probably not the way to go imo.

The scanner works in two steps:

The virtual file system "minimizes" source files into sequence of tokens that are relevant for dependency discovery. Your code fragment would be transformed into something like [if, ident, gte, ident, then, include, string, else, include, string, endif].

The scanner instance takes those tokens and the TU LanguageOptions and runs full preprocessor on those tokens. This means that the condition should evaluate correctly.

I see; in that case being more permissive in what we accept during step 1 makes perfect sense imo. I don’t think there are that many language features that affect this step either since that’s basically just preprocessor directives from what I gather, so adding language options for whatever features we need to enable for this seems reasonable.

Sure, but we don't have to expose that language extension to users, it can remain internal to the compiler.

This still doesn't address my concerns from earlier about how it impacts maintenance of Clang itself:

I don't think per-feature LangOptions is viable. There's digit separators, hex float support, binary literal support, different literal suffixes, various keywords, etc and that doesn't seem likely to scale well. Also, it makes it seem like we support a la carte language features when we don't -- we intentionally don't want to let users disable standard features (in general; exceptions exist) and so that approach gives the impression contributors need to check both language mode and feature availability which is a maintenance concern.

How do we avoid this situation, as that's my primary concern.

Yeah, I still favor a "DependencyScanning" option over feature specific flags, it seems much simpler to maintain and document in the long run, for exactly the same reason as @AaronBallman

llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Jun 17, 2024

AaronBallman reviewed Jun 17, 2024

View reviewed changes

jansvoboda11 reviewed Jun 21, 2024

View reviewed changes

jansvoboda11 mentioned this pull request Nov 21, 2024

Pass LangOpts from CompilerInstance to DependencyScanningWorker #93753

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

clang: Relax LangOpts checks when lexing quoted numbers during preprocessing #95798

clang: Relax LangOpts checks when lexing quoted numbers during preprocessing #95798

Uh oh!

nishithshah2211 commented Jun 17, 2024

Uh oh!

llvmbot commented Jun 17, 2024

Uh oh!

nishithshah2211 commented Jun 17, 2024

Uh oh!

AaronBallman left a comment

Uh oh!

nishithshah2211 commented Jun 17, 2024

Uh oh!

jansvoboda11 Jun 21, 2024

Uh oh!

AaronBallman Jun 24, 2024

Uh oh!

jansvoboda11 Jul 2, 2024

Uh oh!

AaronBallman Jul 3, 2024

Uh oh!

jansvoboda11 Jul 3, 2024

Uh oh!

jansvoboda11 Nov 21, 2024

Uh oh!

jansvoboda11 Nov 21, 2024

Uh oh!

Sirraide Nov 21, 2024

Uh oh!

AaronBallman Nov 22, 2024

Uh oh!

cor3ntin Nov 26, 2024

Uh oh!

Uh oh!

clang: Relax LangOpts checks when lexing quoted numbers during preprocessing #95798

Are you sure you want to change the base?

clang: Relax LangOpts checks when lexing quoted numbers during preprocessing #95798

Uh oh!

Conversation

nishithshah2211 commented Jun 17, 2024

Uh oh!

llvmbot commented Jun 17, 2024

Uh oh!

nishithshah2211 commented Jun 17, 2024

Uh oh!

AaronBallman left a comment

Choose a reason for hiding this comment

Uh oh!

nishithshah2211 commented Jun 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!