Skip to content

clang: Relax LangOpts checks when lexing quoted numbers during preprocessing #95798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nishithshah2211
Copy link
Contributor

When trying to lex numeric constants that use single quotes as separators, if the lexer is parsing a preprocessor directive, we can relax the language options check. These checks will be enforced in a later phase after preprocessing/scanning.

This commit is a second attempt to fix
#88896. The first attempt to fix this is here: #93753, which had to be reverted because of the discussions in that same PR.

…cessing

When trying to lex numeric constants that use single quotes as
separators, if the lexer is parsing a preprocessor directive, we can
relax the language options check. These checks will be enforced in a
later phase after preprocessing/scanning.

This commit is a second attempt to fix
llvm#88896. The first attempt to
fix this is here: llvm#93753, which
had to be reverted because of the discussions in that same PR.
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Jun 17, 2024
@llvmbot
Copy link
Member

llvmbot commented Jun 17, 2024

@llvm/pr-subscribers-clang

Author: Nishith Kumar M Shah (nishithshah2211)

Changes

When trying to lex numeric constants that use single quotes as separators, if the lexer is parsing a preprocessor directive, we can relax the language options check. These checks will be enforced in a later phase after preprocessing/scanning.

This commit is a second attempt to fix
#88896. The first attempt to fix this is here: #93753, which had to be reverted because of the discussions in that same PR.


Full diff: https://github.com/llvm/llvm-project/pull/95798.diff

2 Files Affected:

  • (modified) clang/lib/Lex/Lexer.cpp (+2-1)
  • (modified) clang/unittests/Lex/DependencyDirectivesScannerTest.cpp (+29)
diff --git a/clang/lib/Lex/Lexer.cpp b/clang/lib/Lex/Lexer.cpp
index e59c7805b3862..60c101871be75 100644
--- a/clang/lib/Lex/Lexer.cpp
+++ b/clang/lib/Lex/Lexer.cpp
@@ -2068,7 +2068,8 @@ bool Lexer::LexNumericConstant(Token &Result, const char *CurPtr) {
   }
 
   // If we have a digit separator, continue.
-  if (C == '\'' && (LangOpts.CPlusPlus14 || LangOpts.C23)) {
+  if (C == '\'' &&
+      (LangOpts.CPlusPlus14 || LangOpts.C23 || ParsingPreprocessorDirective)) {
     auto [Next, NextSize] = getCharAndSizeNoWarn(CurPtr + Size, LangOpts);
     if (isAsciiIdentifierContinue(Next)) {
       if (!isLexingRawMode())
diff --git a/clang/unittests/Lex/DependencyDirectivesScannerTest.cpp b/clang/unittests/Lex/DependencyDirectivesScannerTest.cpp
index 59fef9ecbb9c9..0d2d98b16f350 100644
--- a/clang/unittests/Lex/DependencyDirectivesScannerTest.cpp
+++ b/clang/unittests/Lex/DependencyDirectivesScannerTest.cpp
@@ -7,6 +7,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "clang/Lex/DependencyDirectivesScanner.h"
+#include "clang/Basic/TokenKinds.h"
 #include "llvm/ADT/SmallString.h"
 #include "gtest/gtest.h"
 
@@ -1012,4 +1013,32 @@ TEST(MinimizeSourceToDependencyDirectivesTest, TokensBeforeEOF) {
   EXPECT_STREQ("#ifndef A\n#define A\n#endif\n<TokBeforeEOF>\n", Out.data());
 }
 
+TEST(MinimizeSourceToDependencyDirectivesTest, QuoteSeparatedNumber) {
+  SmallVector<char, 128> Out;
+  SmallVector<dependency_directives_scan::Token, 4> Tokens;
+  SmallVector<Directive, 4> Directives;
+
+  StringRef Source = R"(
+#if 123'124
+#endif
+)";
+
+  ASSERT_FALSE(
+      minimizeSourceToDependencyDirectives(Source, Out, Tokens, Directives));
+  EXPECT_STREQ("#if 123'124\n#endif\n", Out.data());
+  ASSERT_EQ(Directives.size(), 3u);
+  EXPECT_EQ(Directives[0].Kind, dependency_directives_scan::pp_if);
+  EXPECT_EQ(Directives[1].Kind, dependency_directives_scan::pp_endif);
+  EXPECT_EQ(Directives[2].Kind, dependency_directives_scan::pp_eof);
+  ASSERT_EQ(Tokens.size(), 7u);
+
+  ASSERT_TRUE(Tokens[0].is(tok::hash));
+  ASSERT_TRUE(Tokens[1].is(tok::raw_identifier));   // "if"
+  ASSERT_TRUE(Tokens[2].is(tok::numeric_constant)); // 123'124
+  ASSERT_TRUE(Tokens[3].is(tok::eod));
+  ASSERT_TRUE(Tokens[4].is(tok::hash));
+  ASSERT_TRUE(Tokens[5].is(tok::raw_identifier)); // #endif
+  ASSERT_TRUE(Tokens[6].is(tok::eod));
+}
+
 } // end anonymous namespace

@nishithshah2211
Copy link
Contributor Author

@jansvoboda11 - I put up this PR to get your thoughts on addressing #88896, since the previous commit was reverted. Is this more along the lines of what you were thinking?

I will push more changes to include a test for the Lexer too.

Copy link
Collaborator

@AaronBallman AaronBallman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes should come with a release note in clang/docs/ReleaseNotes.rst so users know about the fix.

The changes otherwise look correct, but I'd like to hear from @jansvoboda11 as well (especially in light of the question I raised on the original PR).

@nishithshah2211
Copy link
Contributor Author

The changes should come with a release note in clang/docs/ReleaseNotes.rst so users know about the fix.

Noted. Will include a release note, a lexer test and address any other comments for the next commit.

@@ -2068,7 +2068,8 @@ bool Lexer::LexNumericConstant(Token &Result, const char *CurPtr) {
}

// If we have a digit separator, continue.
if (C == '\'' && (LangOpts.CPlusPlus14 || LangOpts.C23)) {
if (C == '\'' &&
(LangOpts.CPlusPlus14 || LangOpts.C23 || ParsingPreprocessorDirective)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is ParsingPreprocessorDirective used here?

I think I would prefer introducing new LangOptions::AllowLiteralDigitSeparator member that would be set to true whenever CPlusPlus14 || C23 evaluates to true and that could be set to true explicitly from the scanner. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain that's a pattern we should get into -- we don't want a LangOption per language feature shared between languages unless it's something the user can enable or disable themselves via the command line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that LangOptions are options internal to the compiler and not necessarily exposed via -cc1 (let alone the driver). The tooling use case is relevant too, IMO.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not a one-to-one mapping of LangOptions to frontend of driver options, yes. But I think it's a stretch to argue that the standard mode tracked by LangOptions is not exposed via flags.

I don't think per-feature LangOptions is viable. There's digit separators, hex float support, binary literal support, different literal suffixes, various keywords, etc and that doesn't seem likely to scale well. Also, it makes it seem like we support a la carte language features when we don't -- we intentionally don't want to let users disable standard features (in general; exceptions exist) and so that approach gives the impression contributors need to check both language mode and feature availability which is a maintenance concern.

I'm still wondering about the question I asked here: #93753 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think it's a stretch to argue that the standard mode tracked by LangOptions is not exposed via flags.

Not sure I understand. I'm suggesting that in the future, parts of the lexer behavior could be controlled in a way that's hidden from users. And I'm saying that we are already hiding existing LangOptions fields from users (e.g. LangOptions::CompilingPCH).

I don't think per-feature LangOptions is viable. There's digit separators, hex float support, binary literal support, different literal suffixes, various keywords, etc and that doesn't seem likely to scale well.

This doesn't need to scale. We'd only do this for options that control lexing. FWIW, code that says "literal digit separators are enabled in C++14+, C23+, or when explicitly allowed" seems more logical to me than checking ParsingPreprocessorDirective, which doesn't really explain why we're suddenly allowing digit separators.

Also, it makes it seem like we support a la carte language features when we don't -- we intentionally don't want to let users disable standard features (in general; exceptions exist) and so that approach gives the impression contributors need to check both language mode and feature availability which is a maintenance concern.

I think documenting the cases where we introduce this extra knob should pretty much clear up any confusion and prevent contributors cargo culting that approach to options that don't need it.

I'm still wondering about the question I asked here: #93753 (comment)

Are there any comments that weren't addressed by my reply?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jansvoboda11, @cor3ntin -- has your opinion changed at all? I'm still pretty strongly opposed to adding feature-specific flags to LangOptions for supporting this in dependency scanning. CC @erichkeane @Sirraide @Fznamznon for more opinions as well.

No, I still see value in adding orthogonal flags for small features. We don't have to expose them to users via command line flags, they can be purely an implementation detail for implementing the dependency scanner. My argument is still that most of the Clang codebase doesn't need to know about dependency scanning and that having one flag for the dependency scanner makes it way too easy fracture the codebase just for this one Clang-based tool.

Case in point, I recently made a series of changes to the PCM file format in order to remove some needless deserialization overhead the dependency scanner uncovered. If we had flag that signifies a "scanner mode", I would've special-cased my changes for that use case, since it'd be easier to implement and test. But by not having that flag (and by not wanting to introduce it), I was forced to figure out how to make the changes generic, and now Clang itself and all Clang-based tools benefit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like another issue is that just accepting separators uncoditionally in preprocessor directives (i.e. even outside of dependency scanning) would technically be a language extension... doesn’t that mean that we’d technically have to add a compatibility warning? Because we’re usually pretty strict about doing that for things that are effectively a compiler extension.

Sure, but we don't have to expose that language extension to users, it can remain internal to the compiler.

Also, I’m not really that familiar w/ the dependency scanner, but I’m still confused as to why passing down the LangOpts isn’t an option: in #93753, it was pointed out that that would impair caching the results of the scan—but isn’t it already the case that e.g. what files something depends on is dependent on the language standard? E.g. what if I write:

#if __cplusplus >= WhateverTheValueForC++17IsAgain
#    include "foo.hh"
#else
#    include "bar.hh"
#endif

Does it just note that the file might depend on both foo.hh and bar.hh? Because if so, can’t we just set the language standard in the lang opts for the dependency scanner to the latest e.g. C++ standard? Because that would also fix this issue.

The scanner works in two steps:

  1. The virtual file system "minimizes" source files into sequence of tokens that are relevant for dependency discovery. Your code fragment would be transformed into something like [if, ident, gte, ident, then, include, string, else, include, string, endif].
  2. The scanner instance takes those tokens and the TU LanguageOptions and runs full preprocessor on those tokens. This means that the condition should evaluate correctly.

Having (1) be context-free is important for performance.

On high level, the goal of #93753 is to respect LanguageOptions in the scanner. My argument is that we should be able to do so by representing tokens in the stream produced by (1) such that we're still able to emit context-aware, LanguageOptions-driven diagnostics in (2).

That solves the issue outlined by #93753 without impacting the performance.

I mean, we don’t need a language option for every feature either, but adding them on an as-needed basis doesn’t seem completely untenable.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but we don't have to expose that language extension to users, it can remain internal to the compiler.

Yeah, that’s fine. What I meant was if we were to just use ParsingPreprocessorDirective, we would be exposing it to users, so that’s probably not the way to go imo.

The scanner works in two steps:

  1. The virtual file system "minimizes" source files into sequence of tokens that are relevant for dependency discovery. Your code fragment would be transformed into something like [if, ident, gte, ident, then, include, string, else, include, string, endif].
  2. The scanner instance takes those tokens and the TU LanguageOptions and runs full preprocessor on those tokens. This means that the condition should evaluate correctly.

I see; in that case being more permissive in what we accept during step 1 makes perfect sense imo. I don’t think there are that many language features that affect this step either since that’s basically just preprocessor directives from what I gather, so adding language options for whatever features we need to enable for this seems reasonable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but we don't have to expose that language extension to users, it can remain internal to the compiler.

This still doesn't address my concerns from earlier about how it impacts maintenance of Clang itself:

I don't think per-feature LangOptions is viable. There's digit separators, hex float support, binary literal support, different literal suffixes, various keywords, etc and that doesn't seem likely to scale well. Also, it makes it seem like we support a la carte language features when we don't -- we intentionally don't want to let users disable standard features (in general; exceptions exist) and so that approach gives the impression contributors need to check both language mode and feature availability which is a maintenance concern.

How do we avoid this situation, as that's my primary concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I still favor a "DependencyScanning" option over feature specific flags, it seems much simpler to maintain and document in the long run, for exactly the same reason as @AaronBallman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants