Pass LangOpts from CompilerInstance to DependencyScanningWorker #93753

nishithshah2211 · 2024-05-30T00:02:36Z

This commit fixes #88896 by passing LangOpts from the CompilerInstance to
DependencyScanningWorker so that the original LangOpts are preserved/respected.
This makes for more accurate parsing/lexing when certain language versions or features specific to versions are to be used.

github-actions · 2024-05-30T00:02:56Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2024-05-30T00:03:31Z

@llvm/pr-subscribers-clang

Author: Nishith Kumar M Shah (nishithshah2211)

Changes

This commit fixes #88896 by passing LangOpts from the CompilerInstance to
DependencyScanningWorker so that the original LangOpts are preserved/respected.
This makes for more accurate parsing/lexing when certain language versions or features specific to versions are to be used.

Full diff: https://github.com/llvm/llvm-project/pull/93753.diff

6 Files Affected:

(modified) clang/include/clang/Lex/DependencyDirectivesScanner.h (+2-1)
(modified) clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h (+2-1)
(modified) clang/lib/Frontend/FrontendActions.cpp (+2-2)
(modified) clang/lib/Lex/DependencyDirectivesScanner.cpp (+12-8)
(modified) clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp (+2-2)
(modified) clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp (+3-2)

diff --git a/clang/include/clang/Lex/DependencyDirectivesScanner.h b/clang/include/clang/Lex/DependencyDirectivesScanner.h
index 0e115906fbfe5..2f8354dec939f 100644
--- a/clang/include/clang/Lex/DependencyDirectivesScanner.h
+++ b/clang/include/clang/Lex/DependencyDirectivesScanner.h
@@ -17,6 +17,7 @@
 #ifndef LLVM_CLANG_LEX_DEPENDENCYDIRECTIVESSCANNER_H
 #define LLVM_CLANG_LEX_DEPENDENCYDIRECTIVESSCANNER_H
 
+#include "clang/Basic/LangOptions.h"
 #include "clang/Basic/SourceLocation.h"
 #include "llvm/ADT/ArrayRef.h"
 
@@ -117,7 +118,7 @@ struct Directive {
 bool scanSourceForDependencyDirectives(
     StringRef Input, SmallVectorImpl<dependency_directives_scan::Token> &Tokens,
     SmallVectorImpl<dependency_directives_scan::Directive> &Directives,
-    DiagnosticsEngine *Diags = nullptr,
+    const LangOptions &LangOpts, DiagnosticsEngine *Diags = nullptr,
     SourceLocation InputSourceLoc = SourceLocation());
 
 /// Print the previously scanned dependency directives as minimized source text.
diff --git a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h
index f7b4510d7f7be..9dc20065a09a3 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h
+++ b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h
@@ -363,7 +363,8 @@ class DependencyScanningWorkerFilesystem
   ///
   /// Returns true if the directive tokens are populated for this file entry,
   /// false if not (i.e. this entry is not a file or its scan fails).
-  bool ensureDirectiveTokensArePopulated(EntryRef Entry);
+  bool ensureDirectiveTokensArePopulated(EntryRef Entry,
+                                         const LangOptions &LangOpts);
 
   /// Check whether \p Path exists. By default checks cached result of \c
   /// status(), and falls back on FS if unable to do so.
diff --git a/clang/lib/Frontend/FrontendActions.cpp b/clang/lib/Frontend/FrontendActions.cpp
index 454653a31534c..eddb2ac0c0834 100644
--- a/clang/lib/Frontend/FrontendActions.cpp
+++ b/clang/lib/Frontend/FrontendActions.cpp
@@ -1168,8 +1168,8 @@ void PrintDependencyDirectivesSourceMinimizerAction::ExecuteAction() {
   llvm::SmallVector<dependency_directives_scan::Token, 16> Tokens;
   llvm::SmallVector<dependency_directives_scan::Directive, 32> Directives;
   if (scanSourceForDependencyDirectives(
-          FromFile.getBuffer(), Tokens, Directives, &CI.getDiagnostics(),
-          SM.getLocForStartOfFile(SM.getMainFileID()))) {
+          FromFile.getBuffer(), Tokens, Directives, CI.getLangOpts(),
+          &CI.getDiagnostics(), SM.getLocForStartOfFile(SM.getMainFileID()))) {
     assert(CI.getDiagnostics().hasErrorOccurred() &&
            "no errors reported for failure");
 
diff --git a/clang/lib/Lex/DependencyDirectivesScanner.cpp b/clang/lib/Lex/DependencyDirectivesScanner.cpp
index 0971daa1f3666..f7462aefa4f87 100644
--- a/clang/lib/Lex/DependencyDirectivesScanner.cpp
+++ b/clang/lib/Lex/DependencyDirectivesScanner.cpp
@@ -62,14 +62,17 @@ struct DirectiveWithTokens {
 struct Scanner {
   Scanner(StringRef Input,
           SmallVectorImpl<dependency_directives_scan::Token> &Tokens,
-          DiagnosticsEngine *Diags, SourceLocation InputSourceLoc)
+          DiagnosticsEngine *Diags, SourceLocation InputSourceLoc,
+          const LangOptions &LangOpts)
       : Input(Input), Tokens(Tokens), Diags(Diags),
-        InputSourceLoc(InputSourceLoc), LangOpts(getLangOptsForDepScanning()),
+        InputSourceLoc(InputSourceLoc),
+        LangOpts(getLangOptsForDepScanning(LangOpts)),
         TheLexer(InputSourceLoc, LangOpts, Input.begin(), Input.begin(),
                  Input.end()) {}
 
-  static LangOptions getLangOptsForDepScanning() {
-    LangOptions LangOpts;
+  static LangOptions
+  getLangOptsForDepScanning(const LangOptions &invocationLangOpts) {
+    LangOptions LangOpts(invocationLangOpts);
     // Set the lexer to use 'tok::at' for '@', instead of 'tok::unknown'.
     LangOpts.ObjC = true;
     LangOpts.LineComment = true;
@@ -700,7 +703,7 @@ bool Scanner::lex_Pragma(const char *&First, const char *const End) {
   SmallVector<dependency_directives_scan::Token> DiscardTokens;
   const char *Begin = Buffer.c_str();
   Scanner PragmaScanner{StringRef(Begin, Buffer.size()), DiscardTokens, Diags,
-                        InputSourceLoc};
+                        InputSourceLoc, LangOptions()};
 
   PragmaScanner.TheLexer.setParsingPreprocessorDirective(true);
   if (PragmaScanner.lexPragma(Begin, Buffer.end()))
@@ -950,9 +953,10 @@ bool Scanner::scan(SmallVectorImpl<Directive> &Directives) {
 
 bool clang::scanSourceForDependencyDirectives(
     StringRef Input, SmallVectorImpl<dependency_directives_scan::Token> &Tokens,
-    SmallVectorImpl<Directive> &Directives, DiagnosticsEngine *Diags,
-    SourceLocation InputSourceLoc) {
-  return Scanner(Input, Tokens, Diags, InputSourceLoc).scan(Directives);
+    SmallVectorImpl<Directive> &Directives, const LangOptions &LangOpts,
+    DiagnosticsEngine *Diags, SourceLocation InputSourceLoc) {
+  return Scanner(Input, Tokens, Diags, InputSourceLoc, LangOpts)
+      .scan(Directives);
 }
 
 void clang::printDependencyDirectivesAsSource(
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp b/clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
index 0cab17a342440..66a2f6e0acb63 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
+++ b/clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
@@ -42,7 +42,7 @@ DependencyScanningWorkerFilesystem::readFile(StringRef Filename) {
 }
 
 bool DependencyScanningWorkerFilesystem::ensureDirectiveTokensArePopulated(
-    EntryRef Ref) {
+    EntryRef Ref, const LangOptions &LangOpts) {
   auto &Entry = Ref.Entry;
 
   if (Entry.isError() || Entry.isDirectory())
@@ -66,7 +66,7 @@ bool DependencyScanningWorkerFilesystem::ensureDirectiveTokensArePopulated(
   // dependencies.
   if (scanSourceForDependencyDirectives(Contents->Original->getBuffer(),
                                         Contents->DepDirectiveTokens,
-                                        Directives)) {
+                                        Directives, LangOpts)) {
     Contents->DepDirectiveTokens.clear();
     // FIXME: Propagate the diagnostic if desired by the client.
     Contents->DepDirectives.store(new std::optional<DependencyDirectivesTy>());
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
index 0c047b6c5da2f..c3d63c3f890e8 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
+++ b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
@@ -366,11 +366,12 @@ class DependencyScanningAction : public tooling::ToolAction {
     // Use the dependency scanning optimized file system if requested to do so.
     if (DepFS)
       ScanInstance.getPreprocessorOpts().DependencyDirectivesForFile =
-          [LocalDepFS = DepFS](FileEntryRef File)
+          [LocalDepFS = DepFS,
+           &LangOpts = ScanInstance.getLangOpts()](FileEntryRef File)
           -> std::optional<ArrayRef<dependency_directives_scan::Directive>> {
         if (llvm::ErrorOr<EntryRef> Entry =
                 LocalDepFS->getOrCreateFileSystemEntry(File.getName()))
-          if (LocalDepFS->ensureDirectiveTokensArePopulated(*Entry))
+          if (LocalDepFS->ensureDirectiveTokensArePopulated(*Entry, LangOpts))
             return Entry->getDirectiveTokens();
         return std::nullopt;
       };

nishithshah2211 · 2024-05-30T00:07:16Z

@cor3ntin @Sirraide I put up a draft PR to get some feedback and see if the changes here make sense or not.

Also tagging @pogo59 since I received good feedback/pointers here: https://discourse.llvm.org/t/looking-for-help-with-accessing-langopts-from-the-actual-compiler-invocation/79228.

My main concern is mostly around introducing LangOpts within ensureDirectiveTokensArePopulated API.

I'll add new tests and fix existing tests in the next commit on this PR if the source changes seem reasonable. Build will fail until then.

cor3ntin · 2024-05-30T05:40:17Z

Thanks for working on that. Your change is pretty much what i expected yes, it's great!

nishithshah2211 · 2024-05-31T06:23:39Z

@cor3ntin Thanks for reviewing. I updated the PR to get the tests/build passing. Please take a look when you get a chance.

cor3ntin · 2024-05-31T06:48:41Z

Can you add a test for #88896 ?

This commit fixes llvm#88896 by passing LangOpts from the CompilerInstance to DependencyScanningWorker so that the original LangOpts are preserved/respected. This makes for more accurate parsing/lexing when certain language versions or features specific to versions are to be used.

nishithshah2211 · 2024-05-31T18:27:52Z

Yes, sorry I missed committing that change. Let me know if there is a better way to test out the changes or add any additional changes.

cor3ntin

I think this makes sense.
Thanks a lot for the fix

nishithshah2211 · 2024-06-03T15:15:41Z

Thanks for the pointers. What would be the process to merge this in?

github-actions · 2024-06-03T15:21:02Z

@nishithshah2211 Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested
by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as
the builds can include changes from many authors. It is not uncommon for your
change to be included in a build that fails due to someone else's changes, or
infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself.
This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

jansvoboda11 · 2024-06-03T17:08:51Z

I don't think this is correct. If you concurrently scan the same file under two language standards with the same scanning service, it becomes non-deterministic which one gets cached in the filesystem cache. For subsequent FS queries the cache might return wrong results, ignoring language options. You need to make the language standard a part of the cache key.

An alternative solution (that I prefer) is to set up the scanner in a way that always accepts ' in integer constants. This works around the caching issue and the build can still fail at compile-time where the language standard kicks in as usual.

nishithshah2211 · 2024-06-03T17:44:27Z

Thanks for the comments @jansvoboda11 . I am new to all these different moving parts and want to understand better. I have a few questions.

If you concurrently scan the same file under two language standards with the same scanning service, it becomes non-deterministic which one gets cached in the filesystem cache.

That is true. But until your comment, I did not even know that it is possible (and supported) to be able to invoke the same scanning service on the same file under two language options (say c++14 and c++11). How would someone do that? Asking so I can test this out locally and try to come up with a better solution. (Also, why would someone do that?)

You need to make the language standard a part of the cache key.

This was kind of one of my concerns that I had called out here: https://discourse.llvm.org/t/looking-for-help-with-accessing-langopts-from-the-actual-compiler-invocation/79228/3. Specifically:

would it look a bit off to someone if they were to look at the header for DependencyScanningWorkerFilesystem and see that the ensureDirectiveTokensArePopulated API took a LangOpts specifically

Given that LangOpts is kind of becoming a feature within DependencyScanningWorkerFilesystem's APIs, I am kind of inclined towards having LangOpts as part of the cache key for disambiguation - but again, I am very very new to this.

An alternative solution (that I prefer) is to set up the scanner in a way that always accepts ' in integer constants.

Would this be considered "hacky"? Because if we go this way, the Scanner would technically be operating in a different language mode for integers, potentially overriding the language mode arg that was passed in during invocation. I am not opposed to it - just trying to understand the implications better. We do turn on specific LangOpts (like Objc) for the lexer during the Scanning phase as can be seen here -

llvm-project/clang/lib/Lex/DependencyDirectivesScanner.cpp

Lines 71 to 79 in 8364659

    
           static LangOptions getLangOptsForDepScanning() { 
        
             LangOptions LangOpts; 
        
             // Set the lexer to use 'tok::at' for '@', instead of 'tok::unknown'. 
        
             LangOpts.ObjC = true; 
        
             LangOpts.LineComment = true; 
        
             // FIXME: we do not enable C11 or C++11, so we are missing u/u8/U"" and 
        
             // R"()" literals. 
        
             return LangOpts; 
        
           }

.

I guess the general question is - is it acceptable to have the Scanner operating in a language standard different than the passed in language mode and different than the compiler language standard?

jansvoboda11 · 2024-06-03T18:13:24Z

Thanks for the comments @jansvoboda11 . I am new to all these different moving parts and want to understand better. I have a few questions.

If you concurrently scan the same file under two language standards with the same scanning service, it becomes non-deterministic which one gets cached in the filesystem cache.

That is true. But until your comment, I did not even know that it is possible (and supported) to be able to invoke the same scanning service on the same file under two language options (say c++14 and c++11). How would someone do that? Asking so I can test this out locally and try to come up with a better solution. (Also, why would someone do that?)

You can have a project that has both C and C++ implementation files that end up including the same header files from the C standard library. One can be compiled under C11 (without separator support), the other under C++14 (with separator support).

You need to make the language standard a part of the cache key.

This was kind of one of my concerns that I had called out here: https://discourse.llvm.org/t/looking-for-help-with-accessing-langopts-from-the-actual-compiler-invocation/79228/3. Specifically:

would it look a bit off to someone if they were to look at the header for DependencyScanningWorkerFilesystem and see that the ensureDirectiveTokensArePopulated API took a LangOpts specifically

Given that LangOpts is kind of becoming a feature within DependencyScanningWorkerFilesystem's APIs, I am kind of inclined towards having LangOpts as part of the cache key for disambiguation - but again, I am very very new to this.

Scanning is often the choke point in builds, so any change that slows it down needs to make up for it in correctness. Adding language options (or just the standard) to the cache key could trigger multiple scans for a single file, which would change the performance quite a bit. Since we have an alternative solution that's still correct, I'd prefer that.

An alternative solution (that I prefer) is to set up the scanner in a way that always accepts ' in integer constants.

Would this be considered "hacky"? Because if we go this way, the Scanner would technically be operating in a different language mode for integers, potentially overriding the language mode arg that was passed in during invocation. I am not opposed to it - just trying to understand the implications better. We do turn on specific LangOpts (like Objc) for the lexer during the Scanning phase as can be seen here -

llvm-project/clang/lib/Lex/DependencyDirectivesScanner.cpp

Lines 71 to 79 in 8364659

static LangOptions getLangOptsForDepScanning() {

LangOptions LangOpts;

// Set the lexer to use 'tok::at' for '@', instead of 'tok::unknown'.

LangOpts.ObjC = true;

LangOpts.LineComment = true;

// FIXME: we do not enable C11 or C++11, so we are missing u/u8/U"" and

// R"()" literals.

return LangOpts;

}

.
I guess the general question is - is it acceptable to have the Scanner operating in a language standard different than the passed in language mode and different than the compiler language standard?

I think that is acceptable. It is kinda hacky, but the lexer and preprocessor are largely independent of the language and the standard. When they do depend on those settings, taking the union of the features and letting the compiler trim it down is still a perfectly sound thing to do.

nishithshah2211 · 2024-06-03T18:44:07Z

You can have a project that has both C and C++ implementation files that end up including the same header files from the C standard library. One can be compiled under C11 (without separator support), the other under C++14 (with separator support).

Thanks. I had considered this very lightly, and in my mind, I thought that this would result in two separate passes of scanning - and so, two separate scanning services. Maybe I am wrong?

Thank you for the additional context and bits of knowledge, super helpful. I'll put up a separate PR that does the following:

reverts the changes within this PR
checks if the lexer has ParsingPreprocessorDirective property true or false and allow for the numeric lexing to happen in that case.

Let me know if it is preferable for history etc. to have two separate PRs - one for the revert and one for the lexer to relax the CPP14/C23 constraint during the preprocessing phase.

jansvoboda11 · 2024-06-05T15:24:09Z

You can have a project that has both C and C++ implementation files that end up including the same header files from the C standard library. One can be compiled under C11 (without separator support), the other under C++14 (with separator support).

Thanks. I had considered this very lightly, and in my mind, I thought that this would result in two separate passes of scanning - and so, two separate scanning services. Maybe I am wrong?

Single scanning service will be used across the entire build, across build targets, regardless of language and the standard of the compilation.

Thank you for the additional context and bits of knowledge, super helpful. I'll put up a separate PR that does the following:

reverts the changes within this PR

checks if the lexer has ParsingPreprocessorDirective property true or false and allow for the numeric lexing to happen in that case.

Let me know if it is preferable for history etc. to have two separate PRs - one for the revert and one for the lexer to relax the CPP14/C23 constraint during the preprocessing phase.

Yes, splitting this into two PRs would be great.

…er (llvm#93753)" This reverts commit 9862080.

nishithshah2211 · 2024-06-05T15:32:54Z

Here is the revert PR: #94488

…er (#93753)" (#94488) This reverts commit 9862080.

zygoloid · 2024-06-05T22:43:07Z

I guess the general question is - is it acceptable to have the Scanner operating in a language standard different than the passed in language mode and different than the compiler language standard?

I think that is acceptable. It is kinda hacky, but the lexer and preprocessor are largely independent of the language and the standard. When they do depend on those settings, taking the union of the features and letting the compiler trim it down is still a perfectly sound thing to do.

You can certainly construct cases where the different lexing rules in different language modes allow you to detect which language you're in from within the preprocessor (1 2 3) or where enabling more language mode flags may reject valid code. It may be good enough for what the scanner is trying to do, but I think it's a stretch to say that it's sound.

…cessing When trying to lex numeric constants that use single quotes as separators, if the lexer is parsing a preprocessor directive, we can relax the language options check. These checks will be enforced in a later phase after preprocessing/scanning. This commit is a second attempt to fix llvm#88896. The first attempt to fix this is here: llvm#93753, which had to be reverted because of the discussions in that same PR.

AaronBallman · 2024-06-17T15:07:18Z

I guess the general question is - is it acceptable to have the Scanner operating in a language standard different than the passed in language mode and different than the compiler language standard?

I think that is acceptable. It is kinda hacky, but the lexer and preprocessor are largely independent of the language and the standard. When they do depend on those settings, taking the union of the features and letting the compiler trim it down is still a perfectly sound thing to do.

You can certainly construct cases where the different lexing rules in different language modes allow you to detect which language you're in from within the preprocessor (1 2 3) or where enabling more language mode flags may reject valid code. It may be good enough for what the scanner is trying to do, but I think it's a stretch to say that it's sound.

+1; it's definitely not sound to go this approach, though it may work well enough in practice to be worth the tradeoff. That said, I think the behavior of the dependency scanner ignoring the language options used is actually kind of problematic, unless I'm misunderstanding something. This means command line options that impact dependency scanning may fail (for example, the user might enable SYCL or OpenMP on the command line and that may have different dependencies, trigraph support may be important for things like ??=include, freestanding vs not impacts which preprocessor macros are predefined, etc. How do we handle that sort of stuff, or do we just ignore it?

AaronBallman · 2024-06-24T12:29:48Z

I guess the general question is - is it acceptable to have the Scanner operating in a language standard different than the passed in language mode and different than the compiler language standard?

I think that is acceptable. It is kinda hacky, but the lexer and preprocessor are largely independent of the language and the standard. When they do depend on those settings, taking the union of the features and letting the compiler trim it down is still a perfectly sound thing to do.

You can certainly construct cases where the different lexing rules in different language modes allow you to detect which language you're in from within the preprocessor (1 2 3) or where enabling more language mode flags may reject valid code. It may be good enough for what the scanner is trying to do, but I think it's a stretch to say that it's sound.

+1; it's definitely not sound to go this approach, though it may work well enough in practice to be worth the tradeoff. That said, I think the behavior of the dependency scanner ignoring the language options used is actually kind of problematic, unless I'm misunderstanding something. This means command line options that impact dependency scanning may fail (for example, the user might enable SYCL or OpenMP on the command line and that may have different dependencies, trigraph support may be important for things like ??=include, freestanding vs not impacts which preprocessor macros are predefined, etc. How do we handle that sort of stuff, or do we just ignore it?

Ping @jansvoboda11

jansvoboda11 · 2024-07-02T22:36:44Z

You can certainly construct cases where the different lexing rules in different language modes allow you to detect which language you're in from within the preprocessor (1 2 3) or where enabling more language mode flags may reject valid code. It may be good enough for what the scanner is trying to do, but I think it's a stretch to say that it's sound.

I'll defer this to @Bigcheese, I don't think we've hit any of these yet.

+1; it's definitely not sound to go this approach, though it may work well enough in practice to be worth the tradeoff. That said, I think the behavior of the dependency scanner ignoring the language options used is actually kind of problematic, unless I'm misunderstanding something. This means command line options that impact dependency scanning may fail (for example, the user might enable SYCL or OpenMP on the command line and that may have different dependencies, trigraph support may be important for things like ??=include, freestanding vs not impacts which preprocessor macros are predefined, etc. How do we handle that sort of stuff, or do we just ignore it?

This layer of the dependency scanner only lexes the source into a list of tokens. Preprocessing and dependency discovery happen later and do respect the command-line options.

Good point on trigraphs. I think that's similar to the digit separator issue: we can unconditionally enable this feature on this layer of the dependency scanner and let the build fail later during compilation itself.

AaronBallman · 2024-07-03T17:22:21Z

You can certainly construct cases where the different lexing rules in different language modes allow you to detect which language you're in from within the preprocessor (1 2 3) or where enabling more language mode flags may reject valid code. It may be good enough for what the scanner is trying to do, but I think it's a stretch to say that it's sound.

I'll defer this to @Bigcheese, I don't think we've hit any of these yet.

+1; it's definitely not sound to go this approach, though it may work well enough in practice to be worth the tradeoff. That said, I think the behavior of the dependency scanner ignoring the language options used is actually kind of problematic, unless I'm misunderstanding something. This means command line options that impact dependency scanning may fail (for example, the user might enable SYCL or OpenMP on the command line and that may have different dependencies, trigraph support may be important for things like ??=include, freestanding vs not impacts which preprocessor macros are predefined, etc. How do we handle that sort of stuff, or do we just ignore it?

This layer of the dependency scanner only lexes the source into a list of tokens. Preprocessing and dependency discovery happen later and do respect the command-line options.

But the list of tokens depends on things like what features are enabled, right? e.g., -fchar8_t introduces a new keyword, while _Atomic isn't a keyword in OpenCL, and bool is only a keyword in C23 and later but is always a keyword in C++, etc. Ah, but in terms of dependency scanning, whether those are keywords or not probably doesn't matter? And because we expose things like embed as a keyword in all language modes, not just C23, the language standard doesn't matter either?

Good point on trigraphs. I think that's similar to the digit separator issue: we can unconditionally enable this feature on this layer of the dependency scanner and let the build fail later during compilation itself.

Seems reasonable to me.

jansvoboda11 · 2024-11-21T21:06:11Z

But the list of tokens depends on things like what features are enabled, right? e.g., -fchar8_t introduces a new keyword, while _Atomic isn't a keyword in OpenCL, and bool is only a keyword in C23 and later but is always a keyword in C++, etc. Ah, but in terms of dependency scanning, whether those are keywords or not probably doesn't matter? And because we expose things like embed as a keyword in all language modes, not just C23, the language standard doesn't matter either?

I touched on this in #95798 (comment). We could represent these cases in the token list such that when the scanner feeds it to the LanguageOptions-aware preprocessor, it can still emit corresponding diagnostics.

llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels May 30, 2024

nishithshah2211 force-pushed the gh-88896 branch from 01323e5 to ae79ebe Compare May 30, 2024 23:12

nishithshah2211 force-pushed the gh-88896 branch from ae79ebe to 46a2503 Compare May 31, 2024 18:27

cor3ntin approved these changes Jun 3, 2024

View reviewed changes

cor3ntin merged commit 9862080 into llvm:main Jun 3, 2024
7 checks passed

nishithshah2211 deleted the gh-88896 branch June 3, 2024 15:27

nishithshah2211 added a commit to nishithshah2211/llvm-project that referenced this pull request Jun 5, 2024

Revert "Pass LangOpts from CompilerInstance to DependencyScanningWork…

3bc4567

…er (llvm#93753)" This reverts commit 9862080.

nishithshah2211 mentioned this pull request Jun 5, 2024

Revert "Pass LangOpts from CompilerInstance to DependencyScanningWork… #94488

Merged

nishithshah2211 mentioned this pull request Jun 5, 2024

[clang-scan-deps] Error if integer literal contains a single quote in #if directive #88896

Open

jansvoboda11 pushed a commit that referenced this pull request Jun 5, 2024

Revert "Pass LangOpts from CompilerInstance to DependencyScanningWork…

0559eaf

…er (#93753)" (#94488) This reverts commit 9862080.

nishithshah2211 mentioned this pull request Jun 17, 2024

clang: Relax LangOpts checks when lexing quoted numbers during preprocessing #95798

Open

AaronBallman mentioned this pull request Jun 20, 2024

[Clang] Allow raw string literals in C as an extension #88265

Merged

Pass LangOpts from CompilerInstance to DependencyScanningWorker #93753

Pass LangOpts from CompilerInstance to DependencyScanningWorker #93753

Uh oh!

Conversation

nishithshah2211 commented May 30, 2024

Uh oh!

github-actions bot commented May 30, 2024

Uh oh!

llvmbot commented May 30, 2024

Uh oh!

nishithshah2211 commented May 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cor3ntin commented May 30, 2024

Uh oh!

nishithshah2211 commented May 31, 2024

Uh oh!

cor3ntin commented May 31, 2024

Uh oh!

nishithshah2211 commented May 31, 2024

Uh oh!

cor3ntin left a comment

Choose a reason for hiding this comment

Uh oh!

nishithshah2211 commented Jun 3, 2024

Uh oh!

Uh oh!

github-actions bot commented Jun 3, 2024

Uh oh!

jansvoboda11 commented Jun 3, 2024

Uh oh!

nishithshah2211 commented Jun 3, 2024

Uh oh!

jansvoboda11 commented Jun 3, 2024

Uh oh!

nishithshah2211 commented Jun 3, 2024

Uh oh!

jansvoboda11 commented Jun 5, 2024

Uh oh!

nishithshah2211 commented Jun 5, 2024

Uh oh!

zygoloid commented Jun 5, 2024

Uh oh!

AaronBallman commented Jun 17, 2024

Uh oh!

AaronBallman commented Jun 24, 2024

Uh oh!

jansvoboda11 commented Jul 2, 2024

Uh oh!

AaronBallman commented Jul 3, 2024

Uh oh!

jansvoboda11 commented Nov 21, 2024

Uh oh!

Uh oh!

nishithshah2211 commented May 30, 2024 •

edited

Loading