Skip to content

[Safe Buffers] Serialize unsafe_buffer_usage pragmas #92031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 14, 2024

Conversation

ziqingluo-90
Copy link
Contributor

A pair of #pragma clang unsafe_buffer_usage begin/end pragmas marks a warning-opt-out region. The begin and end locations (opt-out regions) are stored in preprocessor instances (PP) and will be queried by the -Wunsafe-buffer-usage analyzer.

The commit adds serialization and de-serialization implementations for the stored regions. Basically, the serialized representation of the regions of a PP is a (ordered) sequence of source location encodings. During serialization, it only serializes regions of the current translation unit. Regions from loaded files are not serialized. During de-serialization, regions from loaded files are stored by their source files. When later one queries if a loaded location l is in an opt-out region, PP looks up regions by the source file of l.

The reported issue at upstream: #90501
rdar://124035402

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:modules C++20 modules and Clang Header Modules labels May 13, 2024
@llvmbot
Copy link
Member

llvmbot commented May 13, 2024

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-modules

Author: Ziqing Luo (ziqingluo-90)

Changes

A pair of #pragma clang unsafe_buffer_usage begin/end pragmas marks a warning-opt-out region. The begin and end locations (opt-out regions) are stored in preprocessor instances (PP) and will be queried by the -Wunsafe-buffer-usage analyzer.

The commit adds serialization and de-serialization implementations for the stored regions. Basically, the serialized representation of the regions of a PP is a (ordered) sequence of source location encodings. During serialization, it only serializes regions of the current translation unit. Regions from loaded files are not serialized. During de-serialization, regions from loaded files are stored by their source files. When later one queries if a loaded location l is in an opt-out region, PP looks up regions by the source file of l.

The reported issue at upstream: #90501
rdar://124035402


Full diff: https://github.com/llvm/llvm-project/pull/92031.diff

12 Files Affected:

  • (modified) clang/include/clang/Lex/Preprocessor.h (+18-4)
  • (modified) clang/include/clang/Serialization/ASTBitCodes.h (+3)
  • (modified) clang/lib/Lex/Preprocessor.cpp (+87-19)
  • (modified) clang/lib/Serialization/ASTReader.cpp (+11)
  • (modified) clang/lib/Serialization/ASTWriter.cpp (+7)
  • (added) clang/test/Modules/Inputs/SafeBuffers/base.h (+9)
  • (added) clang/test/Modules/Inputs/SafeBuffers/safe_buffers_test.modulemap (+10)
  • (added) clang/test/Modules/Inputs/SafeBuffers/test_sub1.h (+20)
  • (added) clang/test/Modules/Inputs/SafeBuffers/test_sub2.h (+11)
  • (added) clang/test/Modules/safe_buffers_optout.cpp (+39)
  • (added) clang/test/SemaCXX/warn-unsafe-buffer-usage-pragma-pch-complex.cpp (+72)
  • (added) clang/test/SemaCXX/warn-unsafe-buffer-usage-pragma-pch.cpp (+27)
diff --git a/clang/include/clang/Lex/Preprocessor.h b/clang/include/clang/Lex/Preprocessor.h
index e89b4a2c5230e..8d6884ebe7597 100644
--- a/clang/include/clang/Lex/Preprocessor.h
+++ b/clang/include/clang/Lex/Preprocessor.h
@@ -2883,11 +2883,15 @@ class Preprocessor {
   /// otherwise.
   SourceLocation CurrentSafeBufferOptOutStart; // It is used to report the start location of an never-closed region.
 
-  // An ordered sequence of "-Wunsafe-buffer-usage" opt-out regions in one
-  // translation unit. Each region is represented by a pair of start and end
-  // locations.  A region is "open" if its' start and end locations are
+  using SafeBufferOptOutMapTy =
+      SmallVector<std::pair<SourceLocation, SourceLocation>, 16>;
+  // An ordered sequence of "-Wunsafe-buffer-usage" opt-out regions in this
+  // translation unit. Each region is represented by a pair of start and
+  // end locations.  A region is "open" if its' start and end locations are
   // identical.
-  SmallVector<std::pair<SourceLocation, SourceLocation>, 8> SafeBufferOptOutMap;
+  SafeBufferOptOutMapTy SafeBufferOptOutMap;
+  // `SafeBufferOptOutMap`s of loaded files:
+  llvm::DenseMap<FileID, SafeBufferOptOutMapTy> LoadedSafeBufferOptOutMap;
 
 public:
   /// \return true iff the given `Loc` is in a "-Wunsafe-buffer-usage" opt-out
@@ -2918,6 +2922,16 @@ class Preprocessor {
   ///          opt-out region
   bool isPPInSafeBufferOptOutRegion(SourceLocation &StartLoc);
 
+  /// \return a sequence of SourceLocations representing ordered opt-out regions
+  /// specified by
+  /// `\#pragma clang unsafe_buffer_usage begin/end`s of this translation unit.
+  SmallVector<SourceLocation, 64> serializeSafeBufferOptOutMap() const;
+
+  /// \param SrcLocSeqs a sequence of SourceLocations deserialized from a
+  /// record of code `PP_UNSAFE_BUFFER_USAGE`.
+  void setDeserializedSafeBufferOptOutMap(
+      const SmallVectorImpl<SourceLocation> &SrcLocSeqs);
+
 private:
   /// Helper functions to forward lexing to the actual lexer. They all share the
   /// same signature.
diff --git a/clang/include/clang/Serialization/ASTBitCodes.h b/clang/include/clang/Serialization/ASTBitCodes.h
index dcfa4ac0c1967..d1a0eba943039 100644
--- a/clang/include/clang/Serialization/ASTBitCodes.h
+++ b/clang/include/clang/Serialization/ASTBitCodes.h
@@ -775,6 +775,9 @@ enum ASTRecordTypes {
   /// Record code for lexical and visible block for delayed namespace in
   /// reduced BMI.
   DELAYED_NAMESPACE_LEXICAL_VISIBLE_RECORD = 68,
+
+  /// Record code for \#pragma clang unsafe_buffer_usage begin/end
+  PP_UNSAFE_BUFFER_USAGE = 69,
 };
 
 /// Record types used within a source manager block.
diff --git a/clang/lib/Lex/Preprocessor.cpp b/clang/lib/Lex/Preprocessor.cpp
index 0b70192743a39..6a41e3d4138aa 100644
--- a/clang/lib/Lex/Preprocessor.cpp
+++ b/clang/lib/Lex/Preprocessor.cpp
@@ -58,6 +58,7 @@
 #include "llvm/ADT/SmallString.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/StringRef.h"
+#include "llvm/ADT/iterator_range.h"
 #include "llvm/Support/Capacity.h"
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/MemoryBuffer.h"
@@ -1483,26 +1484,41 @@ void Preprocessor::emitFinalMacroWarning(const Token &Identifier,
 }
 
 bool Preprocessor::isSafeBufferOptOut(const SourceManager &SourceMgr,
-                                           const SourceLocation &Loc) const {
-  // Try to find a region in `SafeBufferOptOutMap` where `Loc` is in:
-  auto FirstRegionEndingAfterLoc = llvm::partition_point(
-      SafeBufferOptOutMap,
-      [&SourceMgr,
-       &Loc](const std::pair<SourceLocation, SourceLocation> &Region) {
-        return SourceMgr.isBeforeInTranslationUnit(Region.second, Loc);
-      });
+                                      const SourceLocation &Loc) const {
+  // The lambda that tests if a `Loc` is in an opt-out region given one opt-out
+  // region map:
+  auto TestInMap = [&SourceMgr](const SafeBufferOptOutMapTy &Map,
+                                const SourceLocation &Loc) -> bool {
+    // Try to find a region in `SafeBufferOptOutMap` where `Loc` is in:
+    auto FirstRegionEndingAfterLoc = llvm::partition_point(
+        Map, [&SourceMgr,
+              &Loc](const std::pair<SourceLocation, SourceLocation> &Region) {
+          return SourceMgr.isBeforeInTranslationUnit(Region.second, Loc);
+        });
+
+    if (FirstRegionEndingAfterLoc != Map.end()) {
+      // To test if the start location of the found region precedes `Loc`:
+      return SourceMgr.isBeforeInTranslationUnit(
+          FirstRegionEndingAfterLoc->first, Loc);
+    }
+    // If we do not find a region whose end location passes `Loc`, we want to
+    // check if the current region is still open:
+    if (!Map.empty() && Map.back().first == Map.back().second)
+      return SourceMgr.isBeforeInTranslationUnit(Map.back().first, Loc);
+    return false;
+  };
 
-  if (FirstRegionEndingAfterLoc != SafeBufferOptOutMap.end()) {
-    // To test if the start location of the found region precedes `Loc`:
-    return SourceMgr.isBeforeInTranslationUnit(FirstRegionEndingAfterLoc->first,
-                                               Loc);
-  }
-  // If we do not find a region whose end location passes `Loc`, we want to
-  // check if the current region is still open:
-  if (!SafeBufferOptOutMap.empty() &&
-      SafeBufferOptOutMap.back().first == SafeBufferOptOutMap.back().second)
-    return SourceMgr.isBeforeInTranslationUnit(SafeBufferOptOutMap.back().first,
-                                               Loc);
+  if (SourceMgr.isLocalSourceLocation(Loc))
+    return TestInMap(SafeBufferOptOutMap, Loc);
+
+  // Expand macros (if any) until it gets to a file location:
+  SourceLocation FLoc = SourceMgr.getFileLoc(Loc);
+  FileID FlocFID = SourceMgr.getDecomposedLoc(FLoc).first;
+  auto LoadedMap = LoadedSafeBufferOptOutMap.find(FlocFID);
+
+  if (LoadedMap != LoadedSafeBufferOptOutMap.end())
+    return TestInMap(LoadedMap->getSecond(), Loc);
+  // FIXME: think out if this is unreachable?
   return false;
 }
 
@@ -1551,6 +1567,58 @@ bool Preprocessor::isPPInSafeBufferOptOutRegion(SourceLocation &StartLoc) {
   return InSafeBufferOptOutRegion;
 }
 
+SmallVector<SourceLocation, 64>
+Preprocessor::serializeSafeBufferOptOutMap() const {
+  assert(!InSafeBufferOptOutRegion &&
+         "Attempt to serialize safe buffer opt-out regions before file being "
+         "completely preprocessed");
+
+  SmallVector<SourceLocation, 64> SrcSeq;
+
+  for (const auto &[begin, end] : SafeBufferOptOutMap) {
+    SrcSeq.push_back(begin);
+    SrcSeq.push_back(end);
+  }
+  // Only `SafeBufferOptOutMap` gets serialized. No need to serialize
+  // `LoadedSafeBufferOptOutMap` because if this TU loads a pch/module, every
+  // pch/module in the pch-chain/module-DAG will be loaded one by one in order.
+  // It means that for each loading pch/module m, it just needs to load m's own
+  // `SafeBufferOptOutMap`.
+  return SrcSeq;
+}
+
+void Preprocessor::setDeserializedSafeBufferOptOutMap(
+    const SmallVectorImpl<SourceLocation> &SourceLocations) {
+  auto It = SourceLocations.begin();
+
+  assert(SourceLocations.size() % 2 == 0 &&
+         "ill-formed SourceLocation sequence");
+  while (It != SourceLocations.end()) {
+    SourceLocation begin = *It++;
+    SourceLocation end = *It++;
+    SourceLocation FileLoc = SourceMgr.getFileLoc(begin);
+    FileID FID = SourceMgr.getDecomposedLoc(FileLoc).first;
+
+    if (FID.isInvalid()) {
+      // I suppose this should not happen:
+      assert(false && "Attempted to read a safe buffer opt-out region whose "
+                      "begin location is associated to an invalid File ID.");
+      break;
+    }
+    assert(!SourceMgr.isLocalFileID(FID) && "Expected a pch/module file");
+    // Here we assume that
+    // `SourceMgr.getFileLoc(begin) == SourceMgr.getFileLoc(end)`.
+    // Though it may not hold in very rare and strange cases, i.e., a pair of
+    // opt-out pragams written in different files but are always used in a way
+    // that they match in a TU (otherwise PP will complaint).
+    // FIXME: PP should complaint about cross file safe buffer opt-out pragmas.
+    assert(FID == SourceMgr.getDecomposedLoc(SourceMgr.getFileLoc(end)).first &&
+           "Maybe a safe buffer opt-out region starts and ends at different "
+           "files.");
+    LoadedSafeBufferOptOutMap[FID].emplace_back(begin, end);
+  }
+}
+
 ModuleLoader::~ModuleLoader() = default;
 
 CommentHandler::~CommentHandler() = default;
diff --git a/clang/lib/Serialization/ASTReader.cpp b/clang/lib/Serialization/ASTReader.cpp
index 43b69045bb054..4e22b3e1187c0 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -3586,6 +3586,17 @@ llvm::Error ASTReader::ReadASTBlock(ModuleFile &F,
       break;
     }
 
+    case PP_UNSAFE_BUFFER_USAGE: {
+      if (!Record.empty()) {
+        SmallVector<SourceLocation, 64> SrcLocs;
+        unsigned Idx = 0;
+        while (Idx < Record.size())
+          SrcLocs.push_back(ReadSourceLocation(F, Record, Idx));
+        PP.setDeserializedSafeBufferOptOutMap(SrcLocs);
+      }
+      break;
+    }
+
     case PP_CONDITIONAL_STACK:
       if (!Record.empty()) {
         unsigned Idx = 0, End = Record.size() - 1;
diff --git a/clang/lib/Serialization/ASTWriter.cpp b/clang/lib/Serialization/ASTWriter.cpp
index 30195868ca996..3203173640d18 100644
--- a/clang/lib/Serialization/ASTWriter.cpp
+++ b/clang/lib/Serialization/ASTWriter.cpp
@@ -911,6 +911,7 @@ void ASTWriter::WriteBlockInfoBlock() {
   RECORD(PP_CONDITIONAL_STACK);
   RECORD(DECLS_TO_CHECK_FOR_DEFERRED_DIAGS);
   RECORD(PP_ASSUME_NONNULL_LOC);
+  RECORD(PP_UNSAFE_BUFFER_USAGE);
 
   // SourceManager Block.
   BLOCK(SOURCE_MANAGER_BLOCK);
@@ -2439,6 +2440,12 @@ void ASTWriter::WritePreprocessor(const Preprocessor &PP, bool IsModule) {
     Record.clear();
   }
 
+  // Write the safe buffer opt-out region map in PP
+  for (SourceLocation &S : PP.serializeSafeBufferOptOutMap())
+    AddSourceLocation(std::move(S), Record);
+  Stream.EmitRecord(PP_UNSAFE_BUFFER_USAGE, Record);
+  Record.clear();
+
   // Enter the preprocessor block.
   Stream.EnterSubblock(PREPROCESSOR_BLOCK_ID, 3);
 
diff --git a/clang/test/Modules/Inputs/SafeBuffers/base.h b/clang/test/Modules/Inputs/SafeBuffers/base.h
new file mode 100644
index 0000000000000..660df17998a48
--- /dev/null
+++ b/clang/test/Modules/Inputs/SafeBuffers/base.h
@@ -0,0 +1,9 @@
+#ifdef __cplusplus
+int base(int *p) {
+  int x = p[5];
+#pragma clang unsafe_buffer_usage begin
+  int y = p[5];
+#pragma clang unsafe_buffer_usage end
+  return x + y;
+}
+#endif
diff --git a/clang/test/Modules/Inputs/SafeBuffers/safe_buffers_test.modulemap b/clang/test/Modules/Inputs/SafeBuffers/safe_buffers_test.modulemap
new file mode 100644
index 0000000000000..a4826c6514b98
--- /dev/null
+++ b/clang/test/Modules/Inputs/SafeBuffers/safe_buffers_test.modulemap
@@ -0,0 +1,10 @@
+module safe_buffers_test_base {
+  header "base.h"
+}
+
+module safe_buffers_test_optout {
+  explicit module test_sub1 {  header "test_sub1.h" }
+  explicit module test_sub2 {  textual header "test_sub2.h" }
+  use safe_buffers_test_base
+}
+
diff --git a/clang/test/Modules/Inputs/SafeBuffers/test_sub1.h b/clang/test/Modules/Inputs/SafeBuffers/test_sub1.h
new file mode 100644
index 0000000000000..c2b105c6e748a
--- /dev/null
+++ b/clang/test/Modules/Inputs/SafeBuffers/test_sub1.h
@@ -0,0 +1,20 @@
+#include "base.h"
+
+#ifdef __cplusplus
+int sub1(int *p) {
+  int x = p[5];
+#pragma clang unsafe_buffer_usage begin  
+  int y = p[5];
+#pragma clang unsafe_buffer_usage end
+  return x + y;
+}
+
+template <typename T>
+T sub1_T(T *p) {
+  T x = p[5];
+#pragma clang unsafe_buffer_usage begin  
+  T y = p[5];
+#pragma clang unsafe_buffer_usage end
+  return x + y;
+}
+#endif
diff --git a/clang/test/Modules/Inputs/SafeBuffers/test_sub2.h b/clang/test/Modules/Inputs/SafeBuffers/test_sub2.h
new file mode 100644
index 0000000000000..11eb9591ab600
--- /dev/null
+++ b/clang/test/Modules/Inputs/SafeBuffers/test_sub2.h
@@ -0,0 +1,11 @@
+#include "base.h"
+
+#ifdef __cplusplus
+int sub2(int *p) {  
+  int x = p[5];
+#pragma clang unsafe_buffer_usage begin
+  int y = p[5];
+#pragma clang unsafe_buffer_usage end
+  return x + y;
+}
+#endif
diff --git a/clang/test/Modules/safe_buffers_optout.cpp b/clang/test/Modules/safe_buffers_optout.cpp
new file mode 100644
index 0000000000000..dd8b1cca5ec89
--- /dev/null
+++ b/clang/test/Modules/safe_buffers_optout.cpp
@@ -0,0 +1,39 @@
+// RUN: rm -rf %t
+// RUN: %clang_cc1 -fmodules -fimplicit-module-maps -emit-module -fmodule-name=safe_buffers_test_base -x c++ %S/Inputs/SafeBuffers/safe_buffers_test.modulemap -std=c++20\
+// RUN:     -o %t/safe_buffers_test_base.pcm -Wunsafe-buffer-usage
+// RUN: %clang_cc1 -fmodules -fimplicit-module-maps -emit-module -fmodule-name=safe_buffers_test_optout -x c++ %S/Inputs/SafeBuffers/safe_buffers_test.modulemap -std=c++20\
+// RUN:     -o %t/safe_buffers_test_optout.pcm -fmodule-file=%t/safe_buffers_test_base.pcm -Wunsafe-buffer-usage
+// RUN: %clang_cc1 -fmodules -fimplicit-module-maps -verify -fmodule-file=%t/safe_buffers_test_optout.pcm -I %S/Inputs/SafeBuffers %s\
+// RUN:     -std=c++20 -Wunsafe-buffer-usage
+
+// RUN: %clang_cc1 -fmodules -fimplicit-module-maps -emit-module -fmodules-cache-path=%t -fmodule-name=safe_buffers_test_base -x c++\
+// RUN:     %S/Inputs/SafeBuffers/safe_buffers_test.modulemap -std=c++20 -Wunsafe-buffer-usage
+// RUN: %clang_cc1 -fmodules -fimplicit-module-maps -emit-module -fmodules-cache-path=%t -fmodule-name=safe_buffers_test_optout -x c++\
+// RUN:     %S/Inputs/SafeBuffers/safe_buffers_test.modulemap -std=c++20 -Wunsafe-buffer-usage
+// RUN: %clang_cc1 -fmodules -fimplicit-module-maps -verify -fmodules-cache-path=%t -I %S/Inputs/SafeBuffers %s -x c++\
+// RUN:     -std=c++20 -Wunsafe-buffer-usage
+
+#include "test_sub1.h"
+#include "test_sub2.h"
+
+// Testing safe buffers opt-out region serialization with modules: this
+// file loads 2 submodules from top-level module
+// `safe_buffers_test_optout`, which uses another top-level module
+// `safe_buffers_test_base`. (So the module dependencies form a DAG.)
+
+// [email protected]:3{{unsafe buffer access}}
+// [email protected]:3{{pass -fsafe-buffer-usage-suggestions to receive code hardening suggestions}}
+// expected-warning@test_sub1.h:5{{unsafe buffer access}}
+// expected-note@test_sub1.h:5{{pass -fsafe-buffer-usage-suggestions to receive code hardening suggestions}}
+// expected-warning@test_sub1.h:14{{unsafe buffer access}}
+// expected-note@test_sub1.h:14{{pass -fsafe-buffer-usage-suggestions to receive code hardening suggestions}}
+// expected-warning@test_sub2.h:5{{unsafe buffer access}}
+// expected-note@test_sub2.h:5{{pass -fsafe-buffer-usage-suggestions to receive code hardening suggestions}}
+int foo(int * p) {
+  int x = p[5]; // expected-warning{{unsafe buffer access}} expected-note{{pass -fsafe-buffer-usage-suggestions to receive code hardening suggestions}}
+#pragma clang unsafe_buffer_usage begin
+  int y = p[5];
+#pragma clang unsafe_buffer_usage end
+  sub1_T(p); // instantiate template
+  return base(p) + sub1(p) + sub2(p);
+}
diff --git a/clang/test/SemaCXX/warn-unsafe-buffer-usage-pragma-pch-complex.cpp b/clang/test/SemaCXX/warn-unsafe-buffer-usage-pragma-pch-complex.cpp
new file mode 100644
index 0000000000000..04ac77b3f342e
--- /dev/null
+++ b/clang/test/SemaCXX/warn-unsafe-buffer-usage-pragma-pch-complex.cpp
@@ -0,0 +1,72 @@
+// A more complex example than warn-unsafe-buffer-usage-pragma-pch.cpp:
+//   MAIN - includes INC_H_1 
+//        \ loads    PCH_H_1 - includes INC_H_2
+//                           \ loads    PCH_H_2   
+
+// Test with PCH
+// RUN: %clang_cc1 -Wno-unused-value -std=c++20 -emit-pch -o %t-1 -DPCH_H_2 %s
+// RUN: %clang_cc1 -Wno-unused-value -std=c++20 -include-pch %t-1 -emit-pch -o %t-2 -DPCH_H_1 %s
+// RUN: %clang_cc1 -Wno-unused-value -Wunsafe-buffer-usage -std=c++20 -include-pch %t-2 -DMAIN -verify %s
+
+#define UNSAFE_BEGIN _Pragma("clang unsafe_buffer_usage begin")
+#define UNSAFE_END   _Pragma("clang unsafe_buffer_usage end")
+
+
+#ifdef INC_H_1
+#undef INC_H_1
+int a(int *s) {
+  s[2];  // <- expected warning here
+#pragma clang unsafe_buffer_usage begin
+  return s[1];
+#pragma clang unsafe_buffer_usage end
+}
+#endif
+
+#ifdef INC_H_2
+#undef INC_H_2
+int b(int *s) {
+  s[2];  // <- expected warning here  
+#pragma clang unsafe_buffer_usage begin
+  return s[1];
+#pragma clang unsafe_buffer_usage end
+}
+#endif
+
+#ifdef PCH_H_1
+#undef PCH_H_1
+#define INC_H_2
+#include "warn-unsafe-buffer-usage-pragma-pch-complex.cpp"
+
+int c(int *s) {
+  s[2];  // <- expected warning here  
+UNSAFE_BEGIN
+  return s[1];
+UNSAFE_END
+}
+#endif
+
+#ifdef PCH_H_2
+#undef PCH_H_2
+int d(int *s) {
+  s[2];  // <- expected warning here  
+#pragma clang unsafe_buffer_usage begin
+  return s[1];
+#pragma clang unsafe_buffer_usage end
+}
+#endif
+
+#ifdef MAIN
+#undef MAIN
+#define INC_H_1
+#include "warn-unsafe-buffer-usage-pragma-pch-complex.cpp"
+
+// expected-warning@-45{{unsafe buffer access}} expected-note@-45{{pass -fsafe-buffer-usage-suggestions to receive code hardening suggestions}}
+// expected-warning@-36{{unsafe buffer access}} expected-note@-36{{pass -fsafe-buffer-usage-suggestions to receive code hardening suggestions}}
+// expected-warning@-24{{unsafe buffer access}} expected-note@-24{{pass -fsafe-buffer-usage-suggestions to receive code hardening suggestions}}
+// expected-warning@-15{{unsafe buffer access}} expected-note@-15{{pass -fsafe-buffer-usage-suggestions to receive code hardening suggestions}}
+int main() {
+  int s[] = {1, 2, 3};
+  return a(s) + b(s) + c(s) + d(s);
+}
+#undef MAIN
+#endif
diff --git a/clang/test/SemaCXX/warn-unsafe-buffer-usage-pragma-pch.cpp b/clang/test/SemaCXX/warn-unsafe-buffer-usage-pragma-pch.cpp
new file mode 100644
index 0000000000000..abd3f0ffe9565
--- /dev/null
+++ b/clang/test/SemaCXX/warn-unsafe-buffer-usage-pragma-pch.cpp
@@ -0,0 +1,27 @@
+// The original example from https://github.com/llvm/llvm-project/issues/90501
+
+// Test without PCH
+// RUN: %clang_cc1 -Wno-unused-value -Wunsafe-buffer-usage -std=c++20 -include %s -verify %s
+// Test with PCH
+// RUN: %clang_cc1 -Wno-unused-value -std=c++20 -emit-pch -o %t %s
+// RUN: %clang_cc1 -Wno-unused-value -Wunsafe-buffer-usage -std=c++20 -include-pch %t -verify %s
+
+#ifndef A_H
+#define A_H
+
+int a(int *s) {
+  s[2];  // <- expected warning here
+#pragma clang unsafe_buffer_usage begin
+  return s[1];
+#pragma clang unsafe_buffer_usage end
+}
+
+#else
+// expected-warning@-7{{unsafe buffer access}}
+// expected-note@-8{{pass -fsafe-buffer-usage-suggestions to receive code hardening suggestions}}
+int main() {
+  int s[] = {1, 2, 3};
+  return a(s);
+}
+
+#endif

Copy link
Collaborator

@haoNoQ haoNoQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very happy that this is going somewhere! Everything makes sense to me but I also don't know a lot about this stuff.

During serialization, it only serializes regions of the current translation unit. Regions from loaded files are not serialized.

Hmm how does this work? When a precompiled header includes another precompiled header, do they both get loaded independently and then you get to combine the maps?

@@ -1551,6 +1567,58 @@ bool Preprocessor::isPPInSafeBufferOptOutRegion(SourceLocation &StartLoc) {
return InSafeBufferOptOutRegion;
}

SmallVector<SourceLocation, 64>
Preprocessor::serializeSafeBufferOptOutMap() const {
assert(!InSafeBufferOptOutRegion &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We require the pragma to be closed before the end of TU right? And that's a hard error, not a warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a warning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah in this case we need to support the situation when the warning was ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm sorry, they are not warnings, they are errors:

PP.Diag(Loc, diag::err_pp_double_begin_pragma_unsafe_buffer_usage);

@ziqingluo-90
Copy link
Contributor Author

I'm very happy that this is going somewhere! Everything makes sense to me but I also don't know a lot about this stuff.

During serialization, it only serializes regions of the current translation unit. Regions from loaded files are not serialized.

Hmm how does this work? When a precompiled header includes another precompiled header, do they both get loaded independently and then you get to combine the maps?

For loading multiple PCHs, it is a chain. Suppose TU includes pch B and B includes pch C. So one would compile and emit PCHs starting from C, then do it for B with C being loaded. When compile A, it includes pch B from the command line and clang actually loads C and B separately.

It is the same story for modules, except that modules form a DAG instead of a chain. Clang will load each module file in an order with respect to the DAG. So for each module/PCH emitted by clang, only their own opt-out regions need to be serialized.

@danakj
Copy link
Contributor

danakj commented May 15, 2024

I am curious so I will ask: why did this take an approach of serializing the output of the pragmas (the maps) instead of serializing the pragmas themselves as AST nodes, so that the maps would be constructed in the same way as from parsing the text directly? that's how other pragmas were serialized that I have seen.

@haoNoQ
Copy link
Collaborator

haoNoQ commented May 15, 2024

serializing the pragmas themselves as AST nodes

These pragmas don't really translate very well into the AST. Similarly to #pragma clang diagnostic, they can be placed weirdly "across" other AST nodes, like:

#pragma clang unsafe_buffer_usage begin
void foo() {
#pragma clang unsafe_buffer_usage end
}

We could support it in the same way that clang supports Duff's Device but I don't think this is really a good idea. We ultimately want them to be textual, a preprocessor-level construct, because that's what security audits are looking for: an extremely clear sign that this is where unchecked pointer operations live. So that they could tell it without looking at anything else in the code.

@ziqingluo-90 ziqingluo-90 changed the title [WIP][Safe Buffers] Serialize unsafe_buffer_usage pragmas [Safe Buffers] Serialize unsafe_buffer_usage pragmas May 15, 2024
@ziqingluo-90
Copy link
Contributor Author

serializing the pragmas themselves as AST nodes

These pragmas don't really translate very well into the AST. Similarly to #pragma clang diagnostic, ....

yeah, #pragma clang unsafe_buffer_usage begin/end is an extremely simpler implementation of #pragma clang diagnostic for -Wunsafe-buffer-usage that can be queried at the early stage of the analysis to avoid unnecessary analysis. That's why we didn't implement it into the existing diagnostic map mechanism.

@ziqingluo-90 ziqingluo-90 self-assigned this May 15, 2024
@danakj
Copy link
Contributor

danakj commented May 15, 2024

Thanks for the explanation :)

Throw an error if clang deserializes a safe buffer opt-out region that
begins and ends at different files.
@ziqingluo-90 ziqingluo-90 changed the title [Safe Buffers] Serialize unsafe_buffer_usage pragmas [WIP][Safe Buffers] Serialize unsafe_buffer_usage pragmas May 31, 2024
@ziqingluo-90
Copy link
Contributor Author

Just realized how stupid I am (O_o)---the current implementation only compares source locations with opt-out regions that are in the same source file, for all source locations from loaded files. This is not going to work for cases like below.

// in a pch header:
#pragma clang unsafe_buffer_usage begin
#include "other_header.h"  // -Wunsafe-buffer-usage in "other_header.h" is NOT suppressed !
#pragma clang unsafe_buffer_usage end

Add [WIP] back to this PR. More work is needed.

Figured out how to identify a loaded AST from a source location.  With
that, we can eliminate the restriction that a de-serialized safe
buffer opt-out region cannot span over multiple files. (Of course
those files need to be in the same TU.)
@ziqingluo-90
Copy link
Contributor Author

Learned more about how loaded files & ASTs are managed in clang. Now opt-out regions in a loaded AST can span over multiple files (within one TU). They now behave in a consistent way when they are from loaded files or local files.

@ziqingluo-90 ziqingluo-90 changed the title [WIP][Safe Buffers] Serialize unsafe_buffer_usage pragmas [Safe Buffers] Serialize unsafe_buffer_usage pragmas Jun 4, 2024
@ziqingluo-90 ziqingluo-90 requested a review from ChuanqiXu9 June 4, 2024 23:02
Copy link
Member

@ChuanqiXu9 ChuanqiXu9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serialization changes look good to me. I'll leave the formal approve chance to others.

// RUN: %clang_cc1 -fmodules -fimplicit-module-maps -verify -fmodules-cache-path=%t -fmodule-map-file=%S/Inputs/SafeBuffers/safe_buffers_test.modulemap -I %S/Inputs/SafeBuffers %s\
// RUN: -x c++ -std=c++20 -Wunsafe-buffer-usage

#include "test_sub1.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we prefer using split-file in tests if there are multiple inputs. You can find examples in the same folder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment! split-file is really neat!!

@@ -2439,6 +2440,12 @@ void ASTWriter::WritePreprocessor(const Preprocessor &PP, bool IsModule) {
Record.clear();
}

// Write the safe buffer opt-out region map in PP
for (SourceLocation &S : PP.serializeSafeBufferOptOutMap())
AddSourceLocation(std::move(S), Record);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we won't move source location.

if (!Map.empty() && Map.back().first == Map.back().second)
return SourceMgr.isBeforeInTranslationUnit(Map.back().first, Loc);
return false;
};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok @ziqingluo-90 has just walked me through this offline and I think I understand what's happening now.

What was unobvious to me: shouldn't we TestInMap over the entire stack of includes? And the answer is:

  • Regular textual #includes will just show up in the current map normally; there's no need to go up the stack.
  • PCHes are going to be included with the -include-pch flag, not directly. There's not really much of a stack going on at this point and there's no way to "put pragmas around the compiler flag".
  • In case of modules, we don't care what this function returns deep inside a module, because we probably shouldn't be analyzing that code in the first place. Instead, the warning should be naturally enabled in the clang invocation that builds the module itself, and that's where the warning would be emitted, taking only the current module into account.
    • As a matter of fact, as of today we're analyzing that nested code. But we shouldn't be, that's not how modules are intended to work. We consider it a bug and have a separate plan to fix this. Right now this means that the warning will be emitted multiple times for the same line of code (both by module compilation and by .cpp file compilation, and possibly by everything in between). And the worst that could happen if this function misbehaves, is that some of these duplicates will disappear. Which is annoying but harmless.

Let's add a comment explaining this!


auto [FID, Ignore] = getDecomposedLoc(Loc);
const FileID *FirstFID =
llvm::lower_bound(LoadedSLocEntryAllocBegin, FID, std::greater<FileID>{});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite magical to me because I don't understand how FileIDs are assigned and in what order they show up in the LoadedSLocEntryAllocBegin list. But that's probably ok, this looks like completely normal SourceManager code to me.

@haoNoQ
Copy link
Collaborator

haoNoQ commented Jun 13, 2024

Ok I think I'm completely happy with the patch now. It makes perfect sense to me and appears to be doing the right thing. Thank you @ziqingluo-90 for figuring this out! Let's add that comment and land.

const FileID *FirstFID =
llvm::lower_bound(LoadedSLocEntryAllocBegin, FID, std::greater<FileID>{});

assert(FirstFID &&
"The failure to find the first FileID of a "
"loaded AST from a loaded source location was unexpected.");
return *FirstFID;
return FirstFID->getHashValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can we go back to returning FileID here? We really shouldn't rely on the fact that the hash value is equal to the actual thing we're looking for.

Copy link
Collaborator

@haoNoQ haoNoQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok LGTM!! Thanks again for figuring this all out!

@ziqingluo-90 ziqingluo-90 merged commit 2e7b95e into llvm:main Jun 14, 2024
4 of 7 checks passed
ziqingluo-90 added a commit to ziqingluo-90/apple-llvm-project that referenced this pull request Jun 14, 2024
The commit adds serialization and de-serialization implementations for
the stored regions. Basically, the serialized representation of the
regions of a PP is a (ordered) sequence of source location encodings.
For de-serialization, regions from loaded files are stored by their ASTs.
When later one queries if a loaded location L is in an opt-out
region, PP looks up the regions of the loaded AST where L is at.

(Background if helps: a pair of `#pragma clang unsafe_buffer_usage begin/end` pragmas marks a
warning-opt-out region. The begin and end locations (opt-out regions)
are stored in preprocessor instances (PP) and will be queried by the
`-Wunsafe-buffer-usage` analyzer.)

The reported issue at upstream: llvm#90501
rdar://124035402

(cherry picked from commit 2e7b95e)
tbkka added a commit to swiftlang/llvm-project that referenced this pull request Jun 20, 2024
…lease/6.0

🍒[Safe Buffers] Serialize unsafe_buffer_usage pragmas  (llvm#92031)
EthanLuisMcDonough pushed a commit to EthanLuisMcDonough/llvm-project that referenced this pull request Aug 13, 2024
The commit adds serialization and de-serialization implementations for
the stored regions. Basically, the serialized representation of the
regions of a PP is a (ordered) sequence of source location encodings.
For de-serialization, regions from loaded files are stored by their ASTs.
When later one queries if a loaded location L is in an opt-out
region, PP looks up the regions of the loaded AST where L is at.

(Background if helps: a pair of `#pragma clang unsafe_buffer_usage begin/end` pragmas marks a
warning-opt-out region. The begin and end locations (opt-out regions)
are stored in preprocessor instances (PP) and will be queried by the
`-Wunsafe-buffer-usage` analyzer.)

The reported issue at upstream: llvm#90501
rdar://124035402
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:modules C++20 modules and Clang Header Modules clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants