Skip to content

clang: Relax LangOpts checks when lexing quoted numbers during preprocessing #95798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion clang/lib/Lex/Lexer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2068,7 +2068,8 @@ bool Lexer::LexNumericConstant(Token &Result, const char *CurPtr) {
}

// If we have a digit separator, continue.
if (C == '\'' && (LangOpts.CPlusPlus14 || LangOpts.C23)) {
if (C == '\'' &&
(LangOpts.CPlusPlus14 || LangOpts.C23 || ParsingPreprocessorDirective)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is ParsingPreprocessorDirective used here?

I think I would prefer introducing new LangOptions::AllowLiteralDigitSeparator member that would be set to true whenever CPlusPlus14 || C23 evaluates to true and that could be set to true explicitly from the scanner. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain that's a pattern we should get into -- we don't want a LangOption per language feature shared between languages unless it's something the user can enable or disable themselves via the command line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that LangOptions are options internal to the compiler and not necessarily exposed via -cc1 (let alone the driver). The tooling use case is relevant too, IMO.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not a one-to-one mapping of LangOptions to frontend of driver options, yes. But I think it's a stretch to argue that the standard mode tracked by LangOptions is not exposed via flags.

I don't think per-feature LangOptions is viable. There's digit separators, hex float support, binary literal support, different literal suffixes, various keywords, etc and that doesn't seem likely to scale well. Also, it makes it seem like we support a la carte language features when we don't -- we intentionally don't want to let users disable standard features (in general; exceptions exist) and so that approach gives the impression contributors need to check both language mode and feature availability which is a maintenance concern.

I'm still wondering about the question I asked here: #93753 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think it's a stretch to argue that the standard mode tracked by LangOptions is not exposed via flags.

Not sure I understand. I'm suggesting that in the future, parts of the lexer behavior could be controlled in a way that's hidden from users. And I'm saying that we are already hiding existing LangOptions fields from users (e.g. LangOptions::CompilingPCH).

I don't think per-feature LangOptions is viable. There's digit separators, hex float support, binary literal support, different literal suffixes, various keywords, etc and that doesn't seem likely to scale well.

This doesn't need to scale. We'd only do this for options that control lexing. FWIW, code that says "literal digit separators are enabled in C++14+, C23+, or when explicitly allowed" seems more logical to me than checking ParsingPreprocessorDirective, which doesn't really explain why we're suddenly allowing digit separators.

Also, it makes it seem like we support a la carte language features when we don't -- we intentionally don't want to let users disable standard features (in general; exceptions exist) and so that approach gives the impression contributors need to check both language mode and feature availability which is a maintenance concern.

I think documenting the cases where we introduce this extra knob should pretty much clear up any confusion and prevent contributors cargo culting that approach to options that don't need it.

I'm still wondering about the question I asked here: #93753 (comment)

Are there any comments that weren't addressed by my reply?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jansvoboda11, @cor3ntin -- has your opinion changed at all? I'm still pretty strongly opposed to adding feature-specific flags to LangOptions for supporting this in dependency scanning. CC @erichkeane @Sirraide @Fznamznon for more opinions as well.

No, I still see value in adding orthogonal flags for small features. We don't have to expose them to users via command line flags, they can be purely an implementation detail for implementing the dependency scanner. My argument is still that most of the Clang codebase doesn't need to know about dependency scanning and that having one flag for the dependency scanner makes it way too easy fracture the codebase just for this one Clang-based tool.

Case in point, I recently made a series of changes to the PCM file format in order to remove some needless deserialization overhead the dependency scanner uncovered. If we had flag that signifies a "scanner mode", I would've special-cased my changes for that use case, since it'd be easier to implement and test. But by not having that flag (and by not wanting to introduce it), I was forced to figure out how to make the changes generic, and now Clang itself and all Clang-based tools benefit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like another issue is that just accepting separators uncoditionally in preprocessor directives (i.e. even outside of dependency scanning) would technically be a language extension... doesn’t that mean that we’d technically have to add a compatibility warning? Because we’re usually pretty strict about doing that for things that are effectively a compiler extension.

Sure, but we don't have to expose that language extension to users, it can remain internal to the compiler.

Also, I’m not really that familiar w/ the dependency scanner, but I’m still confused as to why passing down the LangOpts isn’t an option: in #93753, it was pointed out that that would impair caching the results of the scan—but isn’t it already the case that e.g. what files something depends on is dependent on the language standard? E.g. what if I write:

#if __cplusplus >= WhateverTheValueForC++17IsAgain
#    include "foo.hh"
#else
#    include "bar.hh"
#endif

Does it just note that the file might depend on both foo.hh and bar.hh? Because if so, can’t we just set the language standard in the lang opts for the dependency scanner to the latest e.g. C++ standard? Because that would also fix this issue.

The scanner works in two steps:

  1. The virtual file system "minimizes" source files into sequence of tokens that are relevant for dependency discovery. Your code fragment would be transformed into something like [if, ident, gte, ident, then, include, string, else, include, string, endif].
  2. The scanner instance takes those tokens and the TU LanguageOptions and runs full preprocessor on those tokens. This means that the condition should evaluate correctly.

Having (1) be context-free is important for performance.

On high level, the goal of #93753 is to respect LanguageOptions in the scanner. My argument is that we should be able to do so by representing tokens in the stream produced by (1) such that we're still able to emit context-aware, LanguageOptions-driven diagnostics in (2).

That solves the issue outlined by #93753 without impacting the performance.

I mean, we don’t need a language option for every feature either, but adding them on an as-needed basis doesn’t seem completely untenable.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but we don't have to expose that language extension to users, it can remain internal to the compiler.

Yeah, that’s fine. What I meant was if we were to just use ParsingPreprocessorDirective, we would be exposing it to users, so that’s probably not the way to go imo.

The scanner works in two steps:

  1. The virtual file system "minimizes" source files into sequence of tokens that are relevant for dependency discovery. Your code fragment would be transformed into something like [if, ident, gte, ident, then, include, string, else, include, string, endif].
  2. The scanner instance takes those tokens and the TU LanguageOptions and runs full preprocessor on those tokens. This means that the condition should evaluate correctly.

I see; in that case being more permissive in what we accept during step 1 makes perfect sense imo. I don’t think there are that many language features that affect this step either since that’s basically just preprocessor directives from what I gather, so adding language options for whatever features we need to enable for this seems reasonable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but we don't have to expose that language extension to users, it can remain internal to the compiler.

This still doesn't address my concerns from earlier about how it impacts maintenance of Clang itself:

I don't think per-feature LangOptions is viable. There's digit separators, hex float support, binary literal support, different literal suffixes, various keywords, etc and that doesn't seem likely to scale well. Also, it makes it seem like we support a la carte language features when we don't -- we intentionally don't want to let users disable standard features (in general; exceptions exist) and so that approach gives the impression contributors need to check both language mode and feature availability which is a maintenance concern.

How do we avoid this situation, as that's my primary concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I still favor a "DependencyScanning" option over feature specific flags, it seems much simpler to maintain and document in the long run, for exactly the same reason as @AaronBallman

auto [Next, NextSize] = getCharAndSizeNoWarn(CurPtr + Size, LangOpts);
if (isAsciiIdentifierContinue(Next)) {
if (!isLexingRawMode())
Expand Down
29 changes: 29 additions & 0 deletions clang/unittests/Lex/DependencyDirectivesScannerTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
//===----------------------------------------------------------------------===//

#include "clang/Lex/DependencyDirectivesScanner.h"
#include "clang/Basic/TokenKinds.h"
#include "llvm/ADT/SmallString.h"
#include "gtest/gtest.h"

Expand Down Expand Up @@ -1012,4 +1013,32 @@ TEST(MinimizeSourceToDependencyDirectivesTest, TokensBeforeEOF) {
EXPECT_STREQ("#ifndef A\n#define A\n#endif\n<TokBeforeEOF>\n", Out.data());
}

TEST(MinimizeSourceToDependencyDirectivesTest, QuoteSeparatedNumber) {
SmallVector<char, 128> Out;
SmallVector<dependency_directives_scan::Token, 4> Tokens;
SmallVector<Directive, 4> Directives;

StringRef Source = R"(
#if 123'124
#endif
)";

ASSERT_FALSE(
minimizeSourceToDependencyDirectives(Source, Out, Tokens, Directives));
EXPECT_STREQ("#if 123'124\n#endif\n", Out.data());
ASSERT_EQ(Directives.size(), 3u);
EXPECT_EQ(Directives[0].Kind, dependency_directives_scan::pp_if);
EXPECT_EQ(Directives[1].Kind, dependency_directives_scan::pp_endif);
EXPECT_EQ(Directives[2].Kind, dependency_directives_scan::pp_eof);
ASSERT_EQ(Tokens.size(), 7u);

ASSERT_TRUE(Tokens[0].is(tok::hash));
ASSERT_TRUE(Tokens[1].is(tok::raw_identifier)); // "if"
ASSERT_TRUE(Tokens[2].is(tok::numeric_constant)); // 123'124
ASSERT_TRUE(Tokens[3].is(tok::eod));
ASSERT_TRUE(Tokens[4].is(tok::hash));
ASSERT_TRUE(Tokens[5].is(tok::raw_identifier)); // #endif
ASSERT_TRUE(Tokens[6].is(tok::eod));
}

} // end anonymous namespace
Loading