Skip to content

Pass LangOpts from CompilerInstance to DependencyScanningWorker #93753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 3, 2024

Conversation

nishithshah2211
Copy link
Contributor

This commit fixes #88896 by passing LangOpts from the CompilerInstance to
DependencyScanningWorker so that the original LangOpts are preserved/respected.
This makes for more accurate parsing/lexing when certain language versions or features specific to versions are to be used.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels May 30, 2024
@llvmbot
Copy link
Member

llvmbot commented May 30, 2024

@llvm/pr-subscribers-clang

Author: Nishith Kumar M Shah (nishithshah2211)

Changes

This commit fixes #88896 by passing LangOpts from the CompilerInstance to
DependencyScanningWorker so that the original LangOpts are preserved/respected.
This makes for more accurate parsing/lexing when certain language versions or features specific to versions are to be used.


Full diff: https://github.com/llvm/llvm-project/pull/93753.diff

6 Files Affected:

  • (modified) clang/include/clang/Lex/DependencyDirectivesScanner.h (+2-1)
  • (modified) clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h (+2-1)
  • (modified) clang/lib/Frontend/FrontendActions.cpp (+2-2)
  • (modified) clang/lib/Lex/DependencyDirectivesScanner.cpp (+12-8)
  • (modified) clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp (+2-2)
  • (modified) clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp (+3-2)
diff --git a/clang/include/clang/Lex/DependencyDirectivesScanner.h b/clang/include/clang/Lex/DependencyDirectivesScanner.h
index 0e115906fbfe5..2f8354dec939f 100644
--- a/clang/include/clang/Lex/DependencyDirectivesScanner.h
+++ b/clang/include/clang/Lex/DependencyDirectivesScanner.h
@@ -17,6 +17,7 @@
 #ifndef LLVM_CLANG_LEX_DEPENDENCYDIRECTIVESSCANNER_H
 #define LLVM_CLANG_LEX_DEPENDENCYDIRECTIVESSCANNER_H
 
+#include "clang/Basic/LangOptions.h"
 #include "clang/Basic/SourceLocation.h"
 #include "llvm/ADT/ArrayRef.h"
 
@@ -117,7 +118,7 @@ struct Directive {
 bool scanSourceForDependencyDirectives(
     StringRef Input, SmallVectorImpl<dependency_directives_scan::Token> &Tokens,
     SmallVectorImpl<dependency_directives_scan::Directive> &Directives,
-    DiagnosticsEngine *Diags = nullptr,
+    const LangOptions &LangOpts, DiagnosticsEngine *Diags = nullptr,
     SourceLocation InputSourceLoc = SourceLocation());
 
 /// Print the previously scanned dependency directives as minimized source text.
diff --git a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h
index f7b4510d7f7be..9dc20065a09a3 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h
+++ b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h
@@ -363,7 +363,8 @@ class DependencyScanningWorkerFilesystem
   ///
   /// Returns true if the directive tokens are populated for this file entry,
   /// false if not (i.e. this entry is not a file or its scan fails).
-  bool ensureDirectiveTokensArePopulated(EntryRef Entry);
+  bool ensureDirectiveTokensArePopulated(EntryRef Entry,
+                                         const LangOptions &LangOpts);
 
   /// Check whether \p Path exists. By default checks cached result of \c
   /// status(), and falls back on FS if unable to do so.
diff --git a/clang/lib/Frontend/FrontendActions.cpp b/clang/lib/Frontend/FrontendActions.cpp
index 454653a31534c..eddb2ac0c0834 100644
--- a/clang/lib/Frontend/FrontendActions.cpp
+++ b/clang/lib/Frontend/FrontendActions.cpp
@@ -1168,8 +1168,8 @@ void PrintDependencyDirectivesSourceMinimizerAction::ExecuteAction() {
   llvm::SmallVector<dependency_directives_scan::Token, 16> Tokens;
   llvm::SmallVector<dependency_directives_scan::Directive, 32> Directives;
   if (scanSourceForDependencyDirectives(
-          FromFile.getBuffer(), Tokens, Directives, &CI.getDiagnostics(),
-          SM.getLocForStartOfFile(SM.getMainFileID()))) {
+          FromFile.getBuffer(), Tokens, Directives, CI.getLangOpts(),
+          &CI.getDiagnostics(), SM.getLocForStartOfFile(SM.getMainFileID()))) {
     assert(CI.getDiagnostics().hasErrorOccurred() &&
            "no errors reported for failure");
 
diff --git a/clang/lib/Lex/DependencyDirectivesScanner.cpp b/clang/lib/Lex/DependencyDirectivesScanner.cpp
index 0971daa1f3666..f7462aefa4f87 100644
--- a/clang/lib/Lex/DependencyDirectivesScanner.cpp
+++ b/clang/lib/Lex/DependencyDirectivesScanner.cpp
@@ -62,14 +62,17 @@ struct DirectiveWithTokens {
 struct Scanner {
   Scanner(StringRef Input,
           SmallVectorImpl<dependency_directives_scan::Token> &Tokens,
-          DiagnosticsEngine *Diags, SourceLocation InputSourceLoc)
+          DiagnosticsEngine *Diags, SourceLocation InputSourceLoc,
+          const LangOptions &LangOpts)
       : Input(Input), Tokens(Tokens), Diags(Diags),
-        InputSourceLoc(InputSourceLoc), LangOpts(getLangOptsForDepScanning()),
+        InputSourceLoc(InputSourceLoc),
+        LangOpts(getLangOptsForDepScanning(LangOpts)),
         TheLexer(InputSourceLoc, LangOpts, Input.begin(), Input.begin(),
                  Input.end()) {}
 
-  static LangOptions getLangOptsForDepScanning() {
-    LangOptions LangOpts;
+  static LangOptions
+  getLangOptsForDepScanning(const LangOptions &invocationLangOpts) {
+    LangOptions LangOpts(invocationLangOpts);
     // Set the lexer to use 'tok::at' for '@', instead of 'tok::unknown'.
     LangOpts.ObjC = true;
     LangOpts.LineComment = true;
@@ -700,7 +703,7 @@ bool Scanner::lex_Pragma(const char *&First, const char *const End) {
   SmallVector<dependency_directives_scan::Token> DiscardTokens;
   const char *Begin = Buffer.c_str();
   Scanner PragmaScanner{StringRef(Begin, Buffer.size()), DiscardTokens, Diags,
-                        InputSourceLoc};
+                        InputSourceLoc, LangOptions()};
 
   PragmaScanner.TheLexer.setParsingPreprocessorDirective(true);
   if (PragmaScanner.lexPragma(Begin, Buffer.end()))
@@ -950,9 +953,10 @@ bool Scanner::scan(SmallVectorImpl<Directive> &Directives) {
 
 bool clang::scanSourceForDependencyDirectives(
     StringRef Input, SmallVectorImpl<dependency_directives_scan::Token> &Tokens,
-    SmallVectorImpl<Directive> &Directives, DiagnosticsEngine *Diags,
-    SourceLocation InputSourceLoc) {
-  return Scanner(Input, Tokens, Diags, InputSourceLoc).scan(Directives);
+    SmallVectorImpl<Directive> &Directives, const LangOptions &LangOpts,
+    DiagnosticsEngine *Diags, SourceLocation InputSourceLoc) {
+  return Scanner(Input, Tokens, Diags, InputSourceLoc, LangOpts)
+      .scan(Directives);
 }
 
 void clang::printDependencyDirectivesAsSource(
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp b/clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
index 0cab17a342440..66a2f6e0acb63 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
+++ b/clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
@@ -42,7 +42,7 @@ DependencyScanningWorkerFilesystem::readFile(StringRef Filename) {
 }
 
 bool DependencyScanningWorkerFilesystem::ensureDirectiveTokensArePopulated(
-    EntryRef Ref) {
+    EntryRef Ref, const LangOptions &LangOpts) {
   auto &Entry = Ref.Entry;
 
   if (Entry.isError() || Entry.isDirectory())
@@ -66,7 +66,7 @@ bool DependencyScanningWorkerFilesystem::ensureDirectiveTokensArePopulated(
   // dependencies.
   if (scanSourceForDependencyDirectives(Contents->Original->getBuffer(),
                                         Contents->DepDirectiveTokens,
-                                        Directives)) {
+                                        Directives, LangOpts)) {
     Contents->DepDirectiveTokens.clear();
     // FIXME: Propagate the diagnostic if desired by the client.
     Contents->DepDirectives.store(new std::optional<DependencyDirectivesTy>());
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
index 0c047b6c5da2f..c3d63c3f890e8 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
+++ b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
@@ -366,11 +366,12 @@ class DependencyScanningAction : public tooling::ToolAction {
     // Use the dependency scanning optimized file system if requested to do so.
     if (DepFS)
       ScanInstance.getPreprocessorOpts().DependencyDirectivesForFile =
-          [LocalDepFS = DepFS](FileEntryRef File)
+          [LocalDepFS = DepFS,
+           &LangOpts = ScanInstance.getLangOpts()](FileEntryRef File)
           -> std::optional<ArrayRef<dependency_directives_scan::Directive>> {
         if (llvm::ErrorOr<EntryRef> Entry =
                 LocalDepFS->getOrCreateFileSystemEntry(File.getName()))
-          if (LocalDepFS->ensureDirectiveTokensArePopulated(*Entry))
+          if (LocalDepFS->ensureDirectiveTokensArePopulated(*Entry, LangOpts))
             return Entry->getDirectiveTokens();
         return std::nullopt;
       };

@nishithshah2211
Copy link
Contributor Author

nishithshah2211 commented May 30, 2024

@cor3ntin @Sirraide I put up a draft PR to get some feedback and see if the changes here make sense or not.

Also tagging @pogo59 since I received good feedback/pointers here: https://discourse.llvm.org/t/looking-for-help-with-accessing-langopts-from-the-actual-compiler-invocation/79228.

My main concern is mostly around introducing LangOpts within ensureDirectiveTokensArePopulated API.

I'll add new tests and fix existing tests in the next commit on this PR if the source changes seem reasonable. Build will fail until then.

@cor3ntin
Copy link
Contributor

Thanks for working on that. Your change is pretty much what i expected yes, it's great!

@nishithshah2211
Copy link
Contributor Author

@cor3ntin Thanks for reviewing. I updated the PR to get the tests/build passing. Please take a look when you get a chance.

@cor3ntin
Copy link
Contributor

Can you add a test for #88896 ?

This commit fixes llvm#88896
by passing LangOpts from the CompilerInstance to
DependencyScanningWorker so that the original LangOpts are
preserved/respected. This makes for more accurate parsing/lexing when
certain language versions or features specific to versions are to be used.
@nishithshah2211
Copy link
Contributor Author

Yes, sorry I missed committing that change. Let me know if there is a better way to test out the changes or add any additional changes.

Copy link
Contributor

@cor3ntin cor3ntin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense.
Thanks a lot for the fix

@nishithshah2211
Copy link
Contributor Author

Thanks for the pointers. What would be the process to merge this in?

@cor3ntin cor3ntin merged commit 9862080 into llvm:main Jun 3, 2024
7 checks passed
Copy link

github-actions bot commented Jun 3, 2024

@nishithshah2211 Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested
by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as
the builds can include changes from many authors. It is not uncommon for your
change to be included in a build that fails due to someone else's changes, or
infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself.
This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

@nishithshah2211 nishithshah2211 deleted the gh-88896 branch June 3, 2024 15:27
@jansvoboda11
Copy link
Contributor

I don't think this is correct. If you concurrently scan the same file under two language standards with the same scanning service, it becomes non-deterministic which one gets cached in the filesystem cache. For subsequent FS queries the cache might return wrong results, ignoring language options. You need to make the language standard a part of the cache key.

An alternative solution (that I prefer) is to set up the scanner in a way that always accepts ' in integer constants. This works around the caching issue and the build can still fail at compile-time where the language standard kicks in as usual.

@nishithshah2211
Copy link
Contributor Author

Thanks for the comments @jansvoboda11 . I am new to all these different moving parts and want to understand better. I have a few questions.

If you concurrently scan the same file under two language standards with the same scanning service, it becomes non-deterministic which one gets cached in the filesystem cache.

That is true. But until your comment, I did not even know that it is possible (and supported) to be able to invoke the same scanning service on the same file under two language options (say c++14 and c++11). How would someone do that? Asking so I can test this out locally and try to come up with a better solution. (Also, why would someone do that?)

You need to make the language standard a part of the cache key.

This was kind of one of my concerns that I had called out here: https://discourse.llvm.org/t/looking-for-help-with-accessing-langopts-from-the-actual-compiler-invocation/79228/3. Specifically:

would it look a bit off to someone if they were to look at the header for DependencyScanningWorkerFilesystem and see that the ensureDirectiveTokensArePopulated API took a LangOpts specifically

Given that LangOpts is kind of becoming a feature within DependencyScanningWorkerFilesystem's APIs, I am kind of inclined towards having LangOpts as part of the cache key for disambiguation - but again, I am very very new to this.

An alternative solution (that I prefer) is to set up the scanner in a way that always accepts ' in integer constants.

Would this be considered "hacky"? Because if we go this way, the Scanner would technically be operating in a different language mode for integers, potentially overriding the language mode arg that was passed in during invocation. I am not opposed to it - just trying to understand the implications better. We do turn on specific LangOpts (like Objc) for the lexer during the Scanning phase as can be seen here -

static LangOptions getLangOptsForDepScanning() {
LangOptions LangOpts;
// Set the lexer to use 'tok::at' for '@', instead of 'tok::unknown'.
LangOpts.ObjC = true;
LangOpts.LineComment = true;
// FIXME: we do not enable C11 or C++11, so we are missing u/u8/U"" and
// R"()" literals.
return LangOpts;
}
.

I guess the general question is - is it acceptable to have the Scanner operating in a language standard different than the passed in language mode and different than the compiler language standard?

@jansvoboda11
Copy link
Contributor

Thanks for the comments @jansvoboda11 . I am new to all these different moving parts and want to understand better. I have a few questions.

If you concurrently scan the same file under two language standards with the same scanning service, it becomes non-deterministic which one gets cached in the filesystem cache.

That is true. But until your comment, I did not even know that it is possible (and supported) to be able to invoke the same scanning service on the same file under two language options (say c++14 and c++11). How would someone do that? Asking so I can test this out locally and try to come up with a better solution. (Also, why would someone do that?)

You can have a project that has both C and C++ implementation files that end up including the same header files from the C standard library. One can be compiled under C11 (without separator support), the other under C++14 (with separator support).

You need to make the language standard a part of the cache key.

This was kind of one of my concerns that I had called out here: https://discourse.llvm.org/t/looking-for-help-with-accessing-langopts-from-the-actual-compiler-invocation/79228/3. Specifically:

would it look a bit off to someone if they were to look at the header for DependencyScanningWorkerFilesystem and see that the ensureDirectiveTokensArePopulated API took a LangOpts specifically

Given that LangOpts is kind of becoming a feature within DependencyScanningWorkerFilesystem's APIs, I am kind of inclined towards having LangOpts as part of the cache key for disambiguation - but again, I am very very new to this.

Scanning is often the choke point in builds, so any change that slows it down needs to make up for it in correctness. Adding language options (or just the standard) to the cache key could trigger multiple scans for a single file, which would change the performance quite a bit. Since we have an alternative solution that's still correct, I'd prefer that.

An alternative solution (that I prefer) is to set up the scanner in a way that always accepts ' in integer constants.

Would this be considered "hacky"? Because if we go this way, the Scanner would technically be operating in a different language mode for integers, potentially overriding the language mode arg that was passed in during invocation. I am not opposed to it - just trying to understand the implications better. We do turn on specific LangOpts (like Objc) for the lexer during the Scanning phase as can be seen here -

static LangOptions getLangOptsForDepScanning() {
LangOptions LangOpts;
// Set the lexer to use 'tok::at' for '@', instead of 'tok::unknown'.
LangOpts.ObjC = true;
LangOpts.LineComment = true;
// FIXME: we do not enable C11 or C++11, so we are missing u/u8/U"" and
// R"()" literals.
return LangOpts;
}

.
I guess the general question is - is it acceptable to have the Scanner operating in a language standard different than the passed in language mode and different than the compiler language standard?

I think that is acceptable. It is kinda hacky, but the lexer and preprocessor are largely independent of the language and the standard. When they do depend on those settings, taking the union of the features and letting the compiler trim it down is still a perfectly sound thing to do.

@nishithshah2211
Copy link
Contributor Author

You can have a project that has both C and C++ implementation files that end up including the same header files from the C standard library. One can be compiled under C11 (without separator support), the other under C++14 (with separator support).

Thanks. I had considered this very lightly, and in my mind, I thought that this would result in two separate passes of scanning - and so, two separate scanning services. Maybe I am wrong?

Thank you for the additional context and bits of knowledge, super helpful. I'll put up a separate PR that does the following:

  1. reverts the changes within this PR
  2. checks if the lexer has ParsingPreprocessorDirective property true or false and allow for the numeric lexing to happen in that case.

Let me know if it is preferable for history etc. to have two separate PRs - one for the revert and one for the lexer to relax the CPP14/C23 constraint during the preprocessing phase.

@jansvoboda11
Copy link
Contributor

You can have a project that has both C and C++ implementation files that end up including the same header files from the C standard library. One can be compiled under C11 (without separator support), the other under C++14 (with separator support).

Thanks. I had considered this very lightly, and in my mind, I thought that this would result in two separate passes of scanning - and so, two separate scanning services. Maybe I am wrong?

Single scanning service will be used across the entire build, across build targets, regardless of language and the standard of the compilation.

Thank you for the additional context and bits of knowledge, super helpful. I'll put up a separate PR that does the following:

  1. reverts the changes within this PR
  2. checks if the lexer has ParsingPreprocessorDirective property true or false and allow for the numeric lexing to happen in that case.

Let me know if it is preferable for history etc. to have two separate PRs - one for the revert and one for the lexer to relax the CPP14/C23 constraint during the preprocessing phase.

Yes, splitting this into two PRs would be great.

@nishithshah2211
Copy link
Contributor Author

Here is the revert PR: #94488

@zygoloid
Copy link
Collaborator

zygoloid commented Jun 5, 2024

I guess the general question is - is it acceptable to have the Scanner operating in a language standard different than the passed in language mode and different than the compiler language standard?

I think that is acceptable. It is kinda hacky, but the lexer and preprocessor are largely independent of the language and the standard. When they do depend on those settings, taking the union of the features and letting the compiler trim it down is still a perfectly sound thing to do.

You can certainly construct cases where the different lexing rules in different language modes allow you to detect which language you're in from within the preprocessor (1 2 3) or where enabling more language mode flags may reject valid code. It may be good enough for what the scanner is trying to do, but I think it's a stretch to say that it's sound.

nishithshah2211 added a commit to nishithshah2211/llvm-project that referenced this pull request Jun 17, 2024
…cessing

When trying to lex numeric constants that use single quotes as
separators, if the lexer is parsing a preprocessor directive, we can
relax the language options check. These checks will be enforced in a
later phase after preprocessing/scanning.

This commit is a second attempt to fix
llvm#88896. The first attempt to
fix this is here: llvm#93753, which
had to be reverted because of the discussions in that same PR.
@AaronBallman
Copy link
Collaborator

I guess the general question is - is it acceptable to have the Scanner operating in a language standard different than the passed in language mode and different than the compiler language standard?

I think that is acceptable. It is kinda hacky, but the lexer and preprocessor are largely independent of the language and the standard. When they do depend on those settings, taking the union of the features and letting the compiler trim it down is still a perfectly sound thing to do.

You can certainly construct cases where the different lexing rules in different language modes allow you to detect which language you're in from within the preprocessor (1 2 3) or where enabling more language mode flags may reject valid code. It may be good enough for what the scanner is trying to do, but I think it's a stretch to say that it's sound.

+1; it's definitely not sound to go this approach, though it may work well enough in practice to be worth the tradeoff. That said, I think the behavior of the dependency scanner ignoring the language options used is actually kind of problematic, unless I'm misunderstanding something. This means command line options that impact dependency scanning may fail (for example, the user might enable SYCL or OpenMP on the command line and that may have different dependencies, trigraph support may be important for things like ??=include, freestanding vs not impacts which preprocessor macros are predefined, etc. How do we handle that sort of stuff, or do we just ignore it?

@AaronBallman
Copy link
Collaborator

I guess the general question is - is it acceptable to have the Scanner operating in a language standard different than the passed in language mode and different than the compiler language standard?

I think that is acceptable. It is kinda hacky, but the lexer and preprocessor are largely independent of the language and the standard. When they do depend on those settings, taking the union of the features and letting the compiler trim it down is still a perfectly sound thing to do.

You can certainly construct cases where the different lexing rules in different language modes allow you to detect which language you're in from within the preprocessor (1 2 3) or where enabling more language mode flags may reject valid code. It may be good enough for what the scanner is trying to do, but I think it's a stretch to say that it's sound.

+1; it's definitely not sound to go this approach, though it may work well enough in practice to be worth the tradeoff. That said, I think the behavior of the dependency scanner ignoring the language options used is actually kind of problematic, unless I'm misunderstanding something. This means command line options that impact dependency scanning may fail (for example, the user might enable SYCL or OpenMP on the command line and that may have different dependencies, trigraph support may be important for things like ??=include, freestanding vs not impacts which preprocessor macros are predefined, etc. How do we handle that sort of stuff, or do we just ignore it?

Ping @jansvoboda11

@jansvoboda11
Copy link
Contributor

You can certainly construct cases where the different lexing rules in different language modes allow you to detect which language you're in from within the preprocessor (1 2 3) or where enabling more language mode flags may reject valid code. It may be good enough for what the scanner is trying to do, but I think it's a stretch to say that it's sound.

I'll defer this to @Bigcheese, I don't think we've hit any of these yet.

+1; it's definitely not sound to go this approach, though it may work well enough in practice to be worth the tradeoff. That said, I think the behavior of the dependency scanner ignoring the language options used is actually kind of problematic, unless I'm misunderstanding something. This means command line options that impact dependency scanning may fail (for example, the user might enable SYCL or OpenMP on the command line and that may have different dependencies, trigraph support may be important for things like ??=include, freestanding vs not impacts which preprocessor macros are predefined, etc. How do we handle that sort of stuff, or do we just ignore it?

This layer of the dependency scanner only lexes the source into a list of tokens. Preprocessing and dependency discovery happen later and do respect the command-line options.

Good point on trigraphs. I think that's similar to the digit separator issue: we can unconditionally enable this feature on this layer of the dependency scanner and let the build fail later during compilation itself.

@AaronBallman
Copy link
Collaborator

You can certainly construct cases where the different lexing rules in different language modes allow you to detect which language you're in from within the preprocessor (1 2 3) or where enabling more language mode flags may reject valid code. It may be good enough for what the scanner is trying to do, but I think it's a stretch to say that it's sound.

I'll defer this to @Bigcheese, I don't think we've hit any of these yet.

+1; it's definitely not sound to go this approach, though it may work well enough in practice to be worth the tradeoff. That said, I think the behavior of the dependency scanner ignoring the language options used is actually kind of problematic, unless I'm misunderstanding something. This means command line options that impact dependency scanning may fail (for example, the user might enable SYCL or OpenMP on the command line and that may have different dependencies, trigraph support may be important for things like ??=include, freestanding vs not impacts which preprocessor macros are predefined, etc. How do we handle that sort of stuff, or do we just ignore it?

This layer of the dependency scanner only lexes the source into a list of tokens. Preprocessing and dependency discovery happen later and do respect the command-line options.

But the list of tokens depends on things like what features are enabled, right? e.g., -fchar8_t introduces a new keyword, while _Atomic isn't a keyword in OpenCL, and bool is only a keyword in C23 and later but is always a keyword in C++, etc. Ah, but in terms of dependency scanning, whether those are keywords or not probably doesn't matter? And because we expose things like embed as a keyword in all language modes, not just C23, the language standard doesn't matter either?

Good point on trigraphs. I think that's similar to the digit separator issue: we can unconditionally enable this feature on this layer of the dependency scanner and let the build fail later during compilation itself.

Seems reasonable to me.

@jansvoboda11
Copy link
Contributor

But the list of tokens depends on things like what features are enabled, right? e.g., -fchar8_t introduces a new keyword, while _Atomic isn't a keyword in OpenCL, and bool is only a keyword in C23 and later but is always a keyword in C++, etc. Ah, but in terms of dependency scanning, whether those are keywords or not probably doesn't matter? And because we expose things like embed as a keyword in all language modes, not just C23, the language standard doesn't matter either?

I touched on this in #95798 (comment). We could represent these cases in the token list such that when the scanner feeds it to the LanguageOptions-aware preprocessor, it can still emit corresponding diagnostics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[clang-scan-deps] Error if integer literal contains a single quote in #if directive
6 participants