[ASTWriter] Do not allocate source location space for module maps used only for textual headers #116374

ilya-biryukov · 2024-11-15T11:27:49Z

This is a follow up to #112015 and it reduces the unnecessary duplication of source locations further.

We do not need to allocate source location space in the serialized PCMs for module maps used only to find textual headers. Those module maps are never referenced from anywhere in the serialized ASTs and are re-read in other compilations.
This change should not affect correctness of Clang compilations or clang-scan-deps in any way.

We do need the InputFile entry in the serialized AST because clang-scan-deps relies on it. The previous patch introduced a mechanism to do exactly that.

We have found that to finally remove any duplication of module maps we use internally in our build system.

…d only for textual headers This is a follow up to llvm#112015 and it reduces the duplication of source locations further. We do not need to allocate source location space in the serialized PCMs for module maps used only to find textual headers. Those module maps are never referenced from anywhere in the serialized ASTs and are re-read in other compilations. We do need the InputFile entry for the in the serialized AST because clang-scan-deps relies on it. The previous patch introduced a mechanism to do exactly that.

github-actions · 2024-11-15T11:31:11Z

✅ With the latest revision this PR passed the C/C++ code formatter.

…aps used only for textual headers Format the code

llvmbot · 2024-11-16T06:31:31Z

@llvm/pr-subscribers-clang-modules

@llvm/pr-subscribers-clang

Author: Ilya Biryukov (ilya-biryukov)

Changes

This is a follow up to #112015 and it reduces the unnecessary duplication of source locations further.

We do not need to allocate source location space in the serialized PCMs for module maps used only to find textual headers. Those module maps are never referenced from anywhere in the serialized ASTs and are re-read in other compilations.
This change should not affect correctness of Clang compilations or clang-scan-deps in any way.

We do need the InputFile entry in the serialized AST because clang-scan-deps relies on it. The previous patch introduced a mechanism to do exactly that.

We have found that to finally remove any duplication of module maps we use internally in our build system.

Full diff: https://github.com/llvm/llvm-project/pull/116374.diff

2 Files Affected:

(modified) clang/lib/Serialization/ASTWriter.cpp (+33-14)
(added) clang/test/Modules/prune-non-affecting-module-map-repeated-textual.cpp (+104)

diff --git a/clang/lib/Serialization/ASTWriter.cpp b/clang/lib/Serialization/ASTWriter.cpp
index 88b3e649a5d466..9b81d7c53fc170 100644
--- a/clang/lib/Serialization/ASTWriter.cpp
+++ b/clang/lib/Serialization/ASTWriter.cpp
@@ -184,14 +184,30 @@ GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
   const SourceManager &SM = PP.getSourceManager();
   const ModuleMap &MM = HS.getModuleMap();
 
-  llvm::DenseSet<FileID> ModuleMaps;
-
-  llvm::DenseSet<const Module *> ProcessedModules;
-  auto CollectModuleMapsForHierarchy = [&](const Module *M) {
+  // Module maps used only by textual headers are special. Their FileID is
+  // non-affecting, but their FileEntry is (i.e. must be written as InputFile).
+  enum AffectedReason : bool {
+    ARTextualHeader = 0,
+    ARImportOrTextualHeader = 1,
+  };
+  auto AssignMostImportant = [](AffectedReason &L, AffectedReason R) {
+    L = std::max(L, R);
+  };
+  llvm::DenseMap<FileID, AffectedReason> ModuleMaps;
+  llvm::DenseMap<const Module *, AffectedReason> ProcessedModules;
+  auto CollectModuleMapsForHierarchy = [&](const Module *M,
+                                           AffectedReason Reason) {
     M = M->getTopLevelModule();
 
-    if (!ProcessedModules.insert(M).second)
+    // We need to process the header either when it was not present of when we
+    // previously flagged module map as textual headers and now we found a
+    // proper import.
+    if (auto [It, Inserted] = ProcessedModules.insert({M, Reason});
+        !Inserted && Reason <= It->second) {
       return;
+    } else {
+      It->second = Reason;
+    }
 
     std::queue<const Module *> Q;
     Q.push(M);
@@ -202,12 +218,12 @@ GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
       // The containing module map is affecting, because it's being pointed
       // into by Module::DefinitionLoc.
       if (auto F = MM.getContainingModuleMapFileID(Mod); F.isValid())
-        ModuleMaps.insert(F);
+        AssignMostImportant(ModuleMaps[F], Reason);
       // For inferred modules, the module map that allowed inferring is not
       // related to the virtual containing module map file. It did affect the
       // compilation, though.
       if (auto UniqF = MM.getModuleMapFileIDForUniquing(Mod); UniqF.isValid())
-        ModuleMaps.insert(UniqF);
+        AssignMostImportant(ModuleMaps[UniqF], Reason);
 
       for (auto *SubM : Mod->submodules())
         Q.push(SubM);
@@ -216,7 +232,7 @@ GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
 
   // Handle all the affecting modules referenced from the root module.
 
-  CollectModuleMapsForHierarchy(RootModule);
+  CollectModuleMapsForHierarchy(RootModule, ARImportOrTextualHeader);
 
   std::queue<const Module *> Q;
   Q.push(RootModule);
@@ -225,9 +241,9 @@ GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
     Q.pop();
 
     for (const Module *ImportedModule : CurrentModule->Imports)
-      CollectModuleMapsForHierarchy(ImportedModule);
+      CollectModuleMapsForHierarchy(ImportedModule, ARImportOrTextualHeader);
     for (const Module *UndeclaredModule : CurrentModule->UndeclaredUses)
-      CollectModuleMapsForHierarchy(UndeclaredModule);
+      CollectModuleMapsForHierarchy(UndeclaredModule, ARImportOrTextualHeader);
 
     for (auto *M : CurrentModule->submodules())
       Q.push(M);
@@ -256,7 +272,7 @@ GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
 
     for (const auto &KH : HS.findResolvedModulesForHeader(*File))
       if (const Module *M = KH.getModule())
-        CollectModuleMapsForHierarchy(M);
+        CollectModuleMapsForHierarchy(M, ARTextualHeader);
   }
 
   // FIXME: This algorithm is not correct for module map hierarchies where
@@ -278,13 +294,16 @@ GetAffectingModuleMaps(const Preprocessor &PP, Module *RootModule) {
   // includes a module map defining a module that's not a submodule of X.
 
   llvm::DenseSet<const FileEntry *> ModuleFileEntries;
-  for (FileID MM : ModuleMaps) {
-    if (auto *FE = SM.getFileEntryForID(MM))
+  llvm::DenseSet<FileID> ModuleFileIDs;
+  for (auto [FID, Reason] : ModuleMaps) {
+    if (Reason == ARImportOrTextualHeader)
+      ModuleFileIDs.insert(FID);
+    if (auto *FE = SM.getFileEntryForID(FID))
       ModuleFileEntries.insert(FE);
   }
 
   AffectingModuleMaps R;
-  R.DefinitionFileIDs = std::move(ModuleMaps);
+  R.DefinitionFileIDs = std::move(ModuleFileIDs);
   R.DefinitionFiles = std::move(ModuleFileEntries);
   return std::move(R);
 }
diff --git a/clang/test/Modules/prune-non-affecting-module-map-repeated-textual.cpp b/clang/test/Modules/prune-non-affecting-module-map-repeated-textual.cpp
new file mode 100644
index 00000000000000..30914056c55fa2
--- /dev/null
+++ b/clang/test/Modules/prune-non-affecting-module-map-repeated-textual.cpp
@@ -0,0 +1,104 @@
+// Same as prune-non-affecting-module-map-repeated.cpp, but check that textual-only
+// inclusions do not cause duplication of the module map files they are defined in.
+
+// RUN: rm -rf %t && mkdir %t
+// RUN: split-file %s %t
+
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN:   -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mixed.map\
+// RUN:   -fmodule-map-file=%t/mod1.map \
+// RUN:   -fmodule-name=mod1 -emit-module %t/mod1.map -o %t/mod1.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN:   -fmodule-map-file=%t/mixed.map\
+// RUN:   -fmodule-name=mixed -emit-module %t/mixed.map -o %t/mixed.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN:   -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod1.map -fmodule-map-file=%t/mod2.map \
+// RUN:   -fmodule-file=%t/mod1.pcm -fmodule-file=%t/mixed.pcm \
+// RUN:   -fmodule-name=mod2 -emit-module %t/mod2.map -o %t/mod2.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN:   -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod2.map -fmodule-map-file=%t/mod3.map \
+// RUN:   -fmodule-file=%t/mod2.pcm -fmodule-file=%t/mixed.pcm \
+// RUN:   -fmodule-name=mod3 -emit-module %t/mod3.map -o %t/mod3.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules \
+// RUN:   -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod3.map -fmodule-map-file=%t/mod4.map \
+// RUN:   -fmodule-file=%t/mod3.pcm \
+// RUN:   -fmodule-name=mod4 -emit-module %t/mod4.map -o %t/mod4.pcm
+// RUN: %clang_cc1 -xc++ -fmodules -fno-implicit-modules -fmodule-map-file=%t/base.map -fmodule-map-file=%t/mod4.map -fmodule-file=%t/mod4.pcm -fsyntax-only -verify %t/check_slocs.cc
+
+//--- base.map
+module base { textual header "vector.h" }
+//--- mixed.map
+module mixed { textual header "mixed_text.h" header "mixed_mod.h"}
+//--- mod1.map
+module mod1 { header "mod1.h" }
+//--- mod2.map
+module mod2 { header "mod2.h" }
+//--- mod3.map
+module mod3 { header "mod3.h" }
+//--- mod4.map
+module mod4 { header "mod4.h" }
+//--- check_slocs.cc
+#include "mod4.h"
+#include "vector.h"
+#pragma clang __debug sloc_usage // expected-remark {{source manager location address space usage}}
+// expected-note@* {{% of available space}}
+
+// base.map must only be present once, despite being used in each module.
+// Because its location in every module compile should be non-affecting.
+
+// [email protected]:1 {{file entered 1 time}}
+
+// different modules use either only textual header from mixed.map or both textual and modular
+// headers. Either combination must lead to only 1 use at the end, because the module is ultimately
+// in the import chain and any textual uses should not change that.
+
+// [email protected]:1 {{file entered 1 time}}
+
+// expected-note@* + {{file entered}}
+
+
+//--- vector.h
+#ifndef VECTOR_H
+#define VECTOR_H
+#endif
+
+//--- mixed_text.h
+#ifndef MIXED_TEXT_H
+#define MIXED_TEXT_H
+#endif
+//--- mixed_mod.h
+#ifndef MIXED_MOD_H
+#define MIXED_MOD_H
+#endif
+
+//--- mod1.h
+#ifndef MOD1
+#define MOD1
+#include "vector.h"
+#include "mixed_text.h"
+int mod1();
+#endif
+//--- mod2.h
+#ifndef MOD2
+#define MOD2
+#include "vector.h"
+#include "mod1.h"
+#include "mixed_mod.h"
+int mod2();
+#endif
+//--- mod3.h
+#ifndef MOD3
+#define MOD3
+#include "vector.h"
+#include "mod2.h"
+#include "mixed_text.h"
+#include "mixed_mod.h"
+int mod3();
+#endif
+//--- mod4.h
+#ifndef MOD4
+#define MOD4
+#include "vector.h"
+#include "mod3.h"
+int mod4();
+#endif

ilya-biryukov · 2024-11-19T16:22:46Z

Added @bricknerb to help with reviewing this (esp. the code style aspects)

bricknerb · 2024-11-20T14:02:29Z

Added @bricknerb to help with reviewing this (esp. the code style aspects)

Focused on code style, please let me know if you want me to dive in more to correctness.

bricknerb · 2024-11-20T13:51:39Z

clang/lib/Serialization/ASTWriter.cpp

+    } else {
+      It->second = Reason;
+    }


Avoid else after return.

Suggested change

} else {

It->second = Reason;

}

}

It->second = Reason;

I cannot in that case, because It is only valid for the scope of the if.
Increasing the visibility scope of It seems like a worse trade-off. But let me know if you have other suggestions how to rewrite the code better, I might be missing some ideas.

clang/lib/Serialization/ASTWriter.cpp

bricknerb · 2024-11-20T14:01:26Z

clang/lib/Serialization/ASTWriter.cpp

+  auto AssignMostImportant = [](AffectedReason &L, AffectedReason R) {
+    L = std::max(L, R);
+  };


Seems like using LHS and RHS is more common (and more readable IMO) in Clang than L and R.

Suggested change

auto AssignMostImportant = [](AffectedReason &L, AffectedReason R) {

L = std::max(L, R);

};

auto AssignMostImportant = [](AffectedReason &LHS, AffectedReason RHS) {

LHS = std::max(LHS, RHS);

};

I've updated to use LHS and RHS. However, I actually wanted to argue that L and R are better choices for small functions. I have found that because of the HS suffix, it is actually easier to confuse the two and misread the code.
It usually does not matter, but if there's an error, it becomes harder to spot, e.g.

LHS.a < RHS.a && LHS.b < RHS.b && LHS.c < LHS.c && LHS.d < RHS.d

vs

L.a < R.a && L.b < R.b && L.c < L.c && L.d < R.d

At least for me, the second variant stands out much more clearly.
That's the reason why I started using this pattern instead of the more common LHS and RHS.

jansvoboda11 · 2024-11-21T02:24:31Z

Does preprocessing from AST files (ab75597) with decluse checking (f3f8461) still work with this patch? I'm surprised that removing of module maps just because they only provided a textual header doesn't have any consequences in that area.

ilya-biryukov · 2024-11-25T17:17:09Z

Does preprocessing from AST files (ab75597) with decluse checking (f3f8461) still work with this patch? I'm surprised that removing of module maps just because they only provided a textual header doesn't have any consequences in that area.

Sorry for the late reply, I had to dive into the code to understand what this does a little better. I believe the answer is yes, it still works.

The reason behind it is that we still keep those module maps in the InputFile structures that we write to the PCMs, we only avoid taking the source location space (i.e. do not writeFileID) for them.
However, I think we should add another module with a textual header to the test that checks the preprocessing logic as a sanity check and to avoid breaking this in the future. I will do just that in this PR (likely tomorrow)

…aps used only for textual headers fix a typo in the comment

…aps used only for textual headers Add _ to the enumerator names

…aps used only for textual headers Change to LHS and RHS

ilya-biryukov · 2024-11-29T16:12:28Z

Finally got to it, there is already a test that checks this, see:

llvm-project/clang/test/Modules/preprocess-decluse.cpp

Line 1 in ef86a31

// RUN: rm -rf %t

and it passes successfully.

I've confirmed why it works too. The module map file gets reread here, this function still correctly reports those module map files.

@jansvoboda11 please let me know if you have any other concerns.

ilya-biryukov · 2024-12-03T09:41:45Z

@jansvoboda11 friendly ping. We need this internally to relieve the pressure from the source location exhaustions happening on our side and it'd be great to land it.
I am certain your concerns are addressed, but maybe you had other comments.

ilya-biryukov · 2024-12-04T16:10:57Z

@jansvoboda11 I plan to land this tomorrow/slightly later as I believe all concerns are addressed. And I would be happy to revert if it causes problems or follow up with fixes if you'll have more comments.

However, feel free to say something until that happens if you feel like I'm missing something.

jansvoboda11 · 2025-02-11T21:18:13Z

Hi @ilya-biryukov, I wanted to apologize for the late reply (I was on a leave) and thank you for going through the test case and taking time to understand why it works. Sounds good!

ilya-biryukov · 2025-02-12T17:35:54Z

@jansvoboda11 No worries at all, thanks for getting back to me.

ilya-biryukov requested a review from jansvoboda11 November 15, 2024 11:27

fixup! [ASTWriter] Do not allocate source location space for module m…

e0a1835

…aps used only for textual headers Format the code

llvmbot added clang Clang issues not falling into any other category clang:modules C++20 modules and Clang Header Modules labels Nov 16, 2024

ilya-biryukov requested a review from bricknerb November 19, 2024 16:22

bricknerb approved these changes Nov 20, 2024

View reviewed changes

ilya-biryukov added 3 commits November 29, 2024 17:04

fixup! [ASTWriter] Do not allocate source location space for module m…

5611185

…aps used only for textual headers fix a typo in the comment

fixup! [ASTWriter] Do not allocate source location space for module m…

9cad088

…aps used only for textual headers Add _ to the enumerator names

fixup! [ASTWriter] Do not allocate source location space for module m…

441efe6

…aps used only for textual headers Change to LHS and RHS

Merge branch 'main' into dedup3

8899db3

ilya-biryukov merged commit f1d81db into llvm:main Dec 5, 2024
8 checks passed

[ASTWriter] Do not allocate source location space for module maps used only for textual headers #116374

[ASTWriter] Do not allocate source location space for module maps used only for textual headers #116374

Uh oh!

Conversation

ilya-biryukov commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Nov 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilya-biryukov commented Nov 19, 2024

Uh oh!

bricknerb commented Nov 20, 2024

Uh oh!

bricknerb Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

ilya-biryukov Nov 29, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bricknerb Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

ilya-biryukov Nov 29, 2024

Choose a reason for hiding this comment

Uh oh!

jansvoboda11 commented Nov 21, 2024

Uh oh!

ilya-biryukov commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilya-biryukov commented Nov 29, 2024

Uh oh!

ilya-biryukov commented Dec 3, 2024

Uh oh!

ilya-biryukov commented Dec 4, 2024

Uh oh!

Uh oh!

jansvoboda11 commented Feb 11, 2025

Uh oh!

ilya-biryukov commented Feb 12, 2025

Uh oh!

Uh oh!

ilya-biryukov commented Nov 15, 2024 •

edited

Loading

github-actions bot commented Nov 15, 2024 •

edited

Loading

llvmbot commented Nov 16, 2024 •

edited

Loading

ilya-biryukov commented Nov 25, 2024 •

edited

Loading