Skip to content

[TySan] Add initial Type Sanitizer (LLVM) #76259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Dec 17, 2024

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Dec 22, 2023

This patch introduces the LLVM components of a type sanitizer: a sanitizer for type-based aliasing violations.

It is based on Hal Finkel's https://reviews.llvm.org/D32198.

C/C++ have type-based aliasing rules, and LLVM's optimizer can exploit these given TBAA metadata added by Clang. Roughly, a pointer of given type cannot be used to access an object of a different type (with, of course, certain exceptions). Unfortunately, there's a lot of code in the wild that violates these rules (e.g. for type punning), and such code often must be built with -fno-strict-aliasing. Performance is often sacrificed as a result. Part of the problem is the difficulty of finding TBAA violations. Hopefully, this sanitizer will help.

For each TBAA type-access descriptor, encoded in LLVM's IR using metadata, the corresponding instrumentation pass generates descriptor tables. Thus, for each type (and access descriptor), we have a unique pointer representation. Excepting anonymous-namespace types, these tables are comdat, so the pointer values should be unique across the program. The descriptors refer to other descriptors to form a type aliasing tree (just like LLVM's TBAA metadata does). The instrumentation handles the "fast path" (where the types match exactly and no partial-overlaps are detected), and defers to the runtime to handle all of the more-complicated cases. The runtime, of course, is also responsible for reporting errors when those are detected.

The runtime uses essentially the same shadow memory region as tsan, and we use 8 bytes of shadow memory, the size of the pointer to the type descriptor, for every byte of accessed data in the program. The value 0 is used to represent an unknown type. The value -1 is used to represent an interior byte (a byte that is part of a type, but not the first byte). The instrumentation first checks for an exact match between the type of the current access and the type for that address recorded in the shadow memory. If it matches, it then checks the shadow for the remainder of the bytes in the type to make sure that they're all -1. If not, we call the runtime. If the exact match fails, we next check if the value is 0 (i.e. unknown). If it is, then we check the shadow for the remainder of the byes in the type (to make sure they're all 0). If they're not, we call the runtime. We then set the shadow for the access address and set the shadow for the remaining bytes in the type to -1 (i.e. marking them as interior bytes). If the type indicated by the shadow memory for the access address is neither an exact match nor 0, we call the runtime.

The instrumentation pass inserts calls to the memset intrinsic to set the memory updated by memset, memcpy, and memmove, as well as allocas/byval (and for lifetime.start/end) to reset the shadow memory to reflect that the type is now unknown. The runtime intercepts memset, memcpy, etc. to perform the same function for the library calls.

The runtime essentially repeats these checks, but uses the full TBAA algorithm, just as the compiler does, to determine when two types are permitted to alias. In a situation where access overlap has occurred and aliasing is not permitted, an error is generated.

Clang's TBAA representation currently has a problem representing unions, as demonstrated by the one XFAIL'd test in the runtime patch. We'll update the TBAA representation to fix this, and at the same time, update the sanitizer.

When the sanitizer is active, we disable actually using the TBAA metadata for AA. This way we're less likely to use TBAA to remove memory accesses that we'd like to verify.

As a note, this implementation does not use the compressed shadow-memory scheme discussed previously (http://lists.llvm.org/pipermail/llvm-dev/2017-April/111766.html). That scheme would not handle the struct-path (i.e. structure offset) information that our TBAA represents. I expect we'll want to further work on compressing the shadow-memory representation, but I think it makes sense to do that as follow-up work.

It goes together with the corresponding clang changes (#76260) and compiler-rt changes (#76261)

@llvmbot llvmbot added compiler-rt:sanitizer llvm:ir llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Dec 22, 2023
@llvmbot
Copy link
Member

llvmbot commented Dec 22, 2023

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-compiler-rt-sanitizer

Author: Florian Hahn (fhahn)

Changes

This patch introduces the LLVM components of a type sanitizer: a sanitizer for type-based aliasing violations.

C/C++ have type-based aliasing rules, and LLVM's optimizer can exploit these given TBAA metadata added by Clang. Roughly, a pointer of given type cannot be used to access an object of a different type (with, of course, certain exceptions). Unfortunately, there's a lot of code in the wild that violates these rules (e.g. for type punning), and such code often must be built with -fno-strict-aliasing. Performance is often sacrificed as a result. Part of the problem is the difficulty of finding TBAA violations. Hopefully, this sanitizer will help.

For each TBAA type-access descriptor, encoded in LLVM's IR using metadata, the corresponding instrumentation pass generates descriptor tables. Thus, for each type (and access descriptor), we have a unique pointer representation. Excepting anonymous-namespace types, these tables are comdat, so the pointer values should be unique across the program. The descriptors refer to other descriptors to form a type aliasing tree (just like LLVM's TBAA metadata does). The instrumentation handles the "fast path" (where the types match exactly and no partial-overlaps are detected), and defers to the runtime to handle all of the more-complicated cases. The runtime, of course, is also responsible for reporting errors when those are detected.

The runtime uses essentially the same shadow memory region as tsan, and we use 8 bytes of shadow memory, the size of the pointer to the type descriptor, for every byte of accessed data in the program. The value 0 is used to represent an unknown type. The value -1 is used to represent an interior byte (a byte that is part of a type, but not the first byte). The instrumentation first checks for an exact match between the type of the current access and the type for that address recorded in the shadow memory. If it matches, it then checks the shadow for the remainder of the bytes in the type to make sure that they're all -1. If not, we call the runtime. If the exact match fails, we next check if the value is 0 (i.e. unknown). If it is, then we check the shadow for the remainder of the byes in the type (to make sure they're all 0). If they're not, we call the runtime. We then set the shadow for the access address and set the shadow for the remaining bytes in the type to -1 (i.e. marking them as interior bytes). If the type indicated by the shadow memory for the access address is neither an exact match nor 0, we call the runtime.

The instrumentation pass inserts calls to the memset intrinsic to set the memory updated by memset, memcpy, and memmove, as well as allocas/byval (and for lifetime.start/end) to reset the shadow memory to reflect that the type is now unknown. The runtime intercepts memset, memcpy, etc. to perform the same function for the library calls.

The runtime essentially repeats these checks, but uses the full TBAA algorithm, just as the compiler does, to determine when two types are permitted to alias. In a situation where access overlap has occurred and aliasing is not permitted, an error is generated.

Clang's TBAA representation currently has a problem representing unions, as demonstrated by the one XFAIL'd test in the runtime patch. We'll update the TBAA representation to fix this, and at the same time, update the sanitizer.

When the sanitizer is active, we disable actually using the TBAA metadata for AA. This way we're less likely to use TBAA to remove memory accesses that we'd like to verify.

As a note, this implementation does not use the compressed shadow-memory scheme discussed previously (http://lists.llvm.org/pipermail/llvm-dev/2017-April/111766.html). That scheme would not handle the struct-path (i.e. structure offset) information that our TBAA represents. I expect we'll want to further work on compressing the shadow-memory representation, but I think it makes sense to do that as follow-up work.
[TySan] LLVM part.

Move from https://reviews.llvm.org/D32198


Patch is 115.26 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/76259.diff

24 Files Affected:

  • (modified) llvm/include/llvm/Bitcode/LLVMBitCodes.h (+1)
  • (modified) llvm/include/llvm/IR/Attributes.td (+4)
  • (added) llvm/include/llvm/Transforms/Instrumentation/TypeSanitizer.h (+38)
  • (modified) llvm/lib/Analysis/TypeBasedAliasAnalysis.cpp (+22-6)
  • (modified) llvm/lib/Bitcode/Reader/BitcodeReader.cpp (+2)
  • (modified) llvm/lib/Bitcode/Writer/BitcodeWriter.cpp (+2)
  • (modified) llvm/lib/CodeGen/ShrinkWrap.cpp (+1)
  • (modified) llvm/lib/Passes/PassBuilder.cpp (+1)
  • (modified) llvm/lib/Passes/PassRegistry.def (+2)
  • (modified) llvm/lib/Transforms/Instrumentation/CMakeLists.txt (+1)
  • (added) llvm/lib/Transforms/Instrumentation/TypeSanitizer.cpp (+873)
  • (modified) llvm/lib/Transforms/Utils/CodeExtractor.cpp (+1)
  • (added) llvm/test/Instrumentation/TypeSanitizer/access-with-offfset.ll (+71)
  • (added) llvm/test/Instrumentation/TypeSanitizer/alloca.ll (+29)
  • (added) llvm/test/Instrumentation/TypeSanitizer/anon.ll (+283)
  • (added) llvm/test/Instrumentation/TypeSanitizer/basic-nosan.ll (+93)
  • (added) llvm/test/Instrumentation/TypeSanitizer/basic.ll (+214)
  • (added) llvm/test/Instrumentation/TypeSanitizer/byval.ll (+88)
  • (added) llvm/test/Instrumentation/TypeSanitizer/globals.ll (+66)
  • (added) llvm/test/Instrumentation/TypeSanitizer/invalid-metadata.ll (+25)
  • (added) llvm/test/Instrumentation/TypeSanitizer/memintrinsics.ll (+77)
  • (added) llvm/test/Instrumentation/TypeSanitizer/nosanitize.ll (+39)
  • (added) llvm/test/Instrumentation/TypeSanitizer/sanitize-no-tbaa.ll (+180)
  • (added) llvm/test/Instrumentation/TypeSanitizer/swifterror.ll (+24)
diff --git a/llvm/include/llvm/Bitcode/LLVMBitCodes.h b/llvm/include/llvm/Bitcode/LLVMBitCodes.h
index c6f0ddf29a6da8..8d3259336abba9 100644
--- a/llvm/include/llvm/Bitcode/LLVMBitCodes.h
+++ b/llvm/include/llvm/Bitcode/LLVMBitCodes.h
@@ -724,6 +724,7 @@ enum AttributeKindCodes {
   ATTR_KIND_WRITABLE = 89,
   ATTR_KIND_CORO_ONLY_DESTROY_WHEN_COMPLETE = 90,
   ATTR_KIND_DEAD_ON_UNWIND = 91,
+  ATTR_KIND_SANITIZE_TYPE = 92,
 };
 
 enum ComdatSelectionKindCodes {
diff --git a/llvm/include/llvm/IR/Attributes.td b/llvm/include/llvm/IR/Attributes.td
index 864f87f3383891..beffa63bd81298 100644
--- a/llvm/include/llvm/IR/Attributes.td
+++ b/llvm/include/llvm/IR/Attributes.td
@@ -270,6 +270,9 @@ def SanitizeAddress : EnumAttr<"sanitize_address", [FnAttr]>;
 /// ThreadSanitizer is on.
 def SanitizeThread : EnumAttr<"sanitize_thread", [FnAttr]>;
 
+/// TypeSanitizer is on.
+def SanitizeType : EnumAttr<"sanitize_type", [FnAttr]>;
+
 /// MemorySanitizer is on.
 def SanitizeMemory : EnumAttr<"sanitize_memory", [FnAttr]>;
 
@@ -354,6 +357,7 @@ def : CompatRule<"isEqual<SanitizeThreadAttr>">;
 def : CompatRule<"isEqual<SanitizeMemoryAttr>">;
 def : CompatRule<"isEqual<SanitizeHWAddressAttr>">;
 def : CompatRule<"isEqual<SanitizeMemTagAttr>">;
+def : CompatRule<"isEqual<SanitizeTypeAttr>">;
 def : CompatRule<"isEqual<SafeStackAttr>">;
 def : CompatRule<"isEqual<ShadowCallStackAttr>">;
 def : CompatRule<"isEqual<UseSampleProfileAttr>">;
diff --git a/llvm/include/llvm/Transforms/Instrumentation/TypeSanitizer.h b/llvm/include/llvm/Transforms/Instrumentation/TypeSanitizer.h
new file mode 100644
index 00000000000000..a6cc56df35f14d
--- /dev/null
+++ b/llvm/include/llvm/Transforms/Instrumentation/TypeSanitizer.h
@@ -0,0 +1,38 @@
+//===- Transforms/Instrumentation/TypeSanitizer.h - TySan Pass -----------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file defines the type sanitizer pass.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_TRANSFORMS_INSTRUMENTATION_TYPESANITIZER_H
+#define LLVM_TRANSFORMS_INSTRUMENTATION_TYPESANITIZER_H
+
+#include "llvm/IR/PassManager.h"
+
+namespace llvm {
+class Function;
+class FunctionPass;
+class Module;
+
+/// A function pass for tysan instrumentation.
+struct TypeSanitizerPass : public PassInfoMixin<TypeSanitizerPass> {
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &FAM);
+  static bool isRequired() { return true; }
+};
+
+/// A module pass for tysan instrumentation.
+///
+/// Create ctor and init functions.
+struct ModuleTypeSanitizerPass : public PassInfoMixin<ModuleTypeSanitizerPass> {
+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+  static bool isRequired() { return true; }
+};
+
+} // namespace llvm
+#endif /* LLVM_TRANSFORMS_INSTRUMENTATION_TYPESANITIZER_H */
diff --git a/llvm/lib/Analysis/TypeBasedAliasAnalysis.cpp b/llvm/lib/Analysis/TypeBasedAliasAnalysis.cpp
index e4dc1a867f6f0c..bfa8a80518ccda 100644
--- a/llvm/lib/Analysis/TypeBasedAliasAnalysis.cpp
+++ b/llvm/lib/Analysis/TypeBasedAliasAnalysis.cpp
@@ -371,11 +371,27 @@ static bool isStructPathTBAA(const MDNode *MD) {
   return isa<MDNode>(MD->getOperand(0)) && MD->getNumOperands() >= 3;
 }
 
+// When using the TypeSanitizer, don't use TBAA information for alias analysis.
+// This might cause us to remove memory accesses that we need to verify at
+// runtime.
+static bool usingSanitizeType(const Value *V) {
+  const Function *F;
+
+  if (auto *I = dyn_cast<Instruction>(V))
+    F = I->getParent()->getParent();
+  else if (auto *A = dyn_cast<Argument>(V))
+    F = A->getParent();
+  else
+    return false;
+
+  return F->hasFnAttribute(Attribute::SanitizeType);
+}
+
 AliasResult TypeBasedAAResult::alias(const MemoryLocation &LocA,
                                      const MemoryLocation &LocB,
                                      AAQueryInfo &AAQI, const Instruction *) {
-  if (!EnableTBAA)
-    return AliasResult::MayAlias;
+  if (!EnableTBAA || usingSanitizeType(LocA.Ptr) || usingSanitizeType(LocB.Ptr))
+    return AAResultBase::alias(LocA, LocB, AAQI, nullptr);
 
   if (Aliases(LocA.AATags.TBAA, LocB.AATags.TBAA))
     return AliasResult::MayAlias;
@@ -425,8 +441,8 @@ MemoryEffects TypeBasedAAResult::getMemoryEffects(const Function *F) {
 ModRefInfo TypeBasedAAResult::getModRefInfo(const CallBase *Call,
                                             const MemoryLocation &Loc,
                                             AAQueryInfo &AAQI) {
-  if (!EnableTBAA)
-    return ModRefInfo::ModRef;
+  if (!EnableTBAA || usingSanitizeType(Call))
+    return AAResultBase::getModRefInfo(Call, Loc, AAQI);
 
   if (const MDNode *L = Loc.AATags.TBAA)
     if (const MDNode *M = Call->getMetadata(LLVMContext::MD_tbaa))
@@ -439,8 +455,8 @@ ModRefInfo TypeBasedAAResult::getModRefInfo(const CallBase *Call,
 ModRefInfo TypeBasedAAResult::getModRefInfo(const CallBase *Call1,
                                             const CallBase *Call2,
                                             AAQueryInfo &AAQI) {
-  if (!EnableTBAA)
-    return ModRefInfo::ModRef;
+  if (!EnableTBAA || usingSanitizeType(Call1))
+    return AAResultBase::getModRefInfo(Call1, Call2, AAQI);
 
   if (const MDNode *M1 = Call1->getMetadata(LLVMContext::MD_tbaa))
     if (const MDNode *M2 = Call2->getMetadata(LLVMContext::MD_tbaa))
diff --git a/llvm/lib/Bitcode/Reader/BitcodeReader.cpp b/llvm/lib/Bitcode/Reader/BitcodeReader.cpp
index 8907f6fa4ff3fd..31e2d74fe570e8 100644
--- a/llvm/lib/Bitcode/Reader/BitcodeReader.cpp
+++ b/llvm/lib/Bitcode/Reader/BitcodeReader.cpp
@@ -2058,6 +2058,8 @@ static Attribute::AttrKind getAttrFromCode(uint64_t Code) {
     return Attribute::SanitizeHWAddress;
   case bitc::ATTR_KIND_SANITIZE_THREAD:
     return Attribute::SanitizeThread;
+  case bitc::ATTR_KIND_SANITIZE_TYPE:
+    return Attribute::SanitizeType;
   case bitc::ATTR_KIND_SANITIZE_MEMORY:
     return Attribute::SanitizeMemory;
   case bitc::ATTR_KIND_SPECULATIVE_LOAD_HARDENING:
diff --git a/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp b/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
index 8fca569a391baf..c925c367d17aaf 100644
--- a/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
+++ b/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
@@ -789,6 +789,8 @@ static uint64_t getAttrKindEncoding(Attribute::AttrKind Kind) {
     return bitc::ATTR_KIND_SANITIZE_HWADDRESS;
   case Attribute::SanitizeThread:
     return bitc::ATTR_KIND_SANITIZE_THREAD;
+  case Attribute::SanitizeType:
+    return bitc::ATTR_KIND_SANITIZE_TYPE;
   case Attribute::SanitizeMemory:
     return bitc::ATTR_KIND_SANITIZE_MEMORY;
   case Attribute::SpeculativeLoadHardening:
diff --git a/llvm/lib/CodeGen/ShrinkWrap.cpp b/llvm/lib/CodeGen/ShrinkWrap.cpp
index ab57d08e527e4d..f4e6f8001debf2 100644
--- a/llvm/lib/CodeGen/ShrinkWrap.cpp
+++ b/llvm/lib/CodeGen/ShrinkWrap.cpp
@@ -984,6 +984,7 @@ bool ShrinkWrap::isShrinkWrapEnabled(const MachineFunction &MF) {
            !(MF.getFunction().hasFnAttribute(Attribute::SanitizeAddress) ||
              MF.getFunction().hasFnAttribute(Attribute::SanitizeThread) ||
              MF.getFunction().hasFnAttribute(Attribute::SanitizeMemory) ||
+             MF.getFunction().hasFnAttribute(Attribute::SanitizeType) ||
              MF.getFunction().hasFnAttribute(Attribute::SanitizeHWAddress));
   // If EnableShrinkWrap is set, it takes precedence on whatever the
   // target sets. The rational is that we assume we want to test
diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index f94bd422c6b592..fb56a65e41ec99 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -167,6 +167,7 @@
 #include "llvm/Transforms/Instrumentation/SanitizerBinaryMetadata.h"
 #include "llvm/Transforms/Instrumentation/SanitizerCoverage.h"
 #include "llvm/Transforms/Instrumentation/ThreadSanitizer.h"
+#include "llvm/Transforms/Instrumentation/TypeSanitizer.h"
 #include "llvm/Transforms/ObjCARC.h"
 #include "llvm/Transforms/Scalar/ADCE.h"
 #include "llvm/Transforms/Scalar/AlignmentFromAssumptions.h"
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index 82ce040c649626..5c9e16435f0ae9 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -138,6 +138,7 @@ MODULE_PASS("synthetic-counts-propagation", SyntheticCountsPropagation())
 MODULE_PASS("trigger-crash", TriggerCrashPass())
 MODULE_PASS("trigger-verifier-error", TriggerVerifierErrorPass())
 MODULE_PASS("tsan-module", ModuleThreadSanitizerPass())
+MODULE_PASS("tysan-module", ModuleTypeSanitizerPass())
 MODULE_PASS("verify", VerifierPass())
 MODULE_PASS("view-callgraph", CallGraphViewerPass())
 MODULE_PASS("wholeprogramdevirt", WholeProgramDevirtPass())
@@ -417,6 +418,7 @@ FUNCTION_PASS("tlshoist", TLSVariableHoistPass())
 FUNCTION_PASS("transform-warning", WarnMissedTransformationsPass())
 FUNCTION_PASS("trigger-verifier-error", TriggerVerifierErrorPass())  
 FUNCTION_PASS("tsan", ThreadSanitizerPass())
+FUNCTION_PASS("tysan", TypeSanitizerPass())
 FUNCTION_PASS("typepromotion", TypePromotionPass(TM))
 FUNCTION_PASS("unify-loop-exits", UnifyLoopExitsPass())
 FUNCTION_PASS("vector-combine", VectorCombinePass())
diff --git a/llvm/lib/Transforms/Instrumentation/CMakeLists.txt b/llvm/lib/Transforms/Instrumentation/CMakeLists.txt
index 424f1d43360677..54211236e18643 100644
--- a/llvm/lib/Transforms/Instrumentation/CMakeLists.txt
+++ b/llvm/lib/Transforms/Instrumentation/CMakeLists.txt
@@ -20,6 +20,7 @@ add_llvm_component_library(LLVMInstrumentation
   SanitizerBinaryMetadata.cpp
   ValueProfileCollector.cpp
   ThreadSanitizer.cpp
+  TypeSanitizer.cpp
   HWAddressSanitizer.cpp
 
   ADDITIONAL_HEADER_DIRS
diff --git a/llvm/lib/Transforms/Instrumentation/TypeSanitizer.cpp b/llvm/lib/Transforms/Instrumentation/TypeSanitizer.cpp
new file mode 100644
index 00000000000000..ed4aba4ad612d9
--- /dev/null
+++ b/llvm/lib/Transforms/Instrumentation/TypeSanitizer.cpp
@@ -0,0 +1,873 @@
+//===----- TypeSanitizer.cpp - type-based-aliasing-violation detector -----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file is a part of TypeSanitizer, a type-based-aliasing-violation
+// detector.
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Transforms/Instrumentation/TypeSanitizer.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallSet.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/Statistic.h"
+#include "llvm/ADT/StringExtras.h"
+#include "llvm/Analysis/MemoryLocation.h"
+#include "llvm/Analysis/TargetLibraryInfo.h"
+#include "llvm/IR/DataLayout.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/LLVMContext.h"
+#include "llvm/IR/MDBuilder.h"
+#include "llvm/IR/Metadata.h"
+#include "llvm/IR/Module.h"
+#include "llvm/IR/Type.h"
+#include "llvm/ProfileData/InstrProf.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/MD5.h"
+#include "llvm/Support/MathExtras.h"
+#include "llvm/Support/Regex.h"
+#include "llvm/Support/raw_ostream.h"
+#include "llvm/Transforms/Instrumentation.h"
+#include "llvm/Transforms/Utils/BasicBlockUtils.h"
+#include "llvm/Transforms/Utils/Local.h"
+#include "llvm/Transforms/Utils/ModuleUtils.h"
+
+#include <cctype>
+
+using namespace llvm;
+
+#define DEBUG_TYPE "tysan"
+
+static const char *const kTysanModuleCtorName = "tysan.module_ctor";
+static const char *const kTysanInitName = "__tysan_init";
+static const char *const kTysanCheckName = "__tysan_check";
+static const char *const kTysanGVNamePrefix = "__tysan_v1_";
+
+static const char *const kTysanShadowMemoryAddress =
+    "__tysan_shadow_memory_address";
+static const char *const kTysanAppMemMask = "__tysan_app_memory_mask";
+
+static cl::opt<bool>
+    ClWritesAlwaysSetType("tysan-writes-always-set-type",
+                          cl::desc("Writes always set the type"), cl::Hidden,
+                          cl::init(false));
+
+STATISTIC(NumInstrumentedAccesses, "Number of instrumented accesses");
+
+static Regex AnonNameRegex("^_ZTS.*N[1-9][0-9]*_GLOBAL__N");
+
+namespace {
+
+/// TypeSanitizer: instrument the code in module to find  type-based aliasing
+/// violations.
+struct TypeSanitizer {
+  TypeSanitizer(Module &M);
+  bool run(Function &F, const TargetLibraryInfo &TLI);
+  void instrumentGlobals();
+
+private:
+  typedef SmallDenseMap<const MDNode *, GlobalVariable *, 8>
+      TypeDescriptorsMapTy;
+  typedef SmallDenseMap<const MDNode *, std::string, 8> TypeNameMapTy;
+
+  void initializeCallbacks(Module &M);
+
+  Value *getShadowBase(Function &F);
+  Value *getAppMemMask(Function &F);
+
+  bool instrumentWithShadowUpdate(IRBuilder<> &IRB, const MDNode *TBAAMD,
+                                  Value *Ptr, uint64_t AccessSize, bool IsRead,
+                                  bool IsWrite, Value *&ShadowBase,
+                                  Value *&AppMemMask, bool ForceSetType,
+                                  bool SanitizeFunction,
+                                  TypeDescriptorsMapTy &TypeDescriptors,
+                                  const DataLayout &DL);
+  bool instrumentMemoryAccess(Instruction *I, MemoryLocation &MLoc,
+                              Value *&ShadowBase, Value *&AppMemMask,
+                              bool SanitizeFunction,
+                              TypeDescriptorsMapTy &TypeDescriptors,
+                              const DataLayout &DL);
+  bool instrumentMemInst(Value *I, Value *&ShadowBase, Value *&AppMemMask,
+                         const DataLayout &DL);
+
+  std::string getAnonymousStructIdentifier(const MDNode *MD,
+                                           TypeNameMapTy &TypeNames);
+  bool generateTypeDescriptor(const MDNode *MD,
+                              TypeDescriptorsMapTy &TypeDescriptors,
+                              TypeNameMapTy &TypeNames, Module &M);
+  bool generateBaseTypeDescriptor(const MDNode *MD,
+                                  TypeDescriptorsMapTy &TypeDescriptors,
+                                  TypeNameMapTy &TypeNames, Module &M);
+
+  const Triple TargetTriple;
+  Type *IntptrTy;
+  uint64_t PtrShift;
+  IntegerType *OrdTy;
+
+  // Callbacks to run-time library are computed in doInitialization.
+  Function *TysanCheck;
+  Function *TysanCtorFunction;
+  Function *TysanGlobalsSetTypeFunction;
+};
+} // namespace
+
+TypeSanitizer::TypeSanitizer(Module &M)
+    : TargetTriple(Triple(M.getTargetTriple())) {
+  const DataLayout &DL = M.getDataLayout();
+  IntptrTy = DL.getIntPtrType(M.getContext());
+  PtrShift = countr_zero(IntptrTy->getPrimitiveSizeInBits() / 8);
+
+  TysanGlobalsSetTypeFunction = M.getFunction("__tysan_set_globals_types");
+  initializeCallbacks(M);
+}
+
+void TypeSanitizer::initializeCallbacks(Module &M) {
+  IRBuilder<> IRB(M.getContext());
+  OrdTy = IRB.getInt32Ty();
+
+  AttributeList Attr;
+  Attr = Attr.addFnAttribute(M.getContext(), Attribute::NoUnwind);
+  // Initialize the callbacks.
+  TysanCheck = cast<Function>(
+      M.getOrInsertFunction(kTysanCheckName, Attr, IRB.getVoidTy(),
+                            IRB.getPtrTy(), // Pointer to data to be read.
+                            OrdTy,              // Size of the data in bytes.
+                            IRB.getPtrTy(), // Pointer to type descriptor.
+                            OrdTy               // Flags.
+                            )
+          .getCallee());
+
+  TysanCtorFunction = cast<Function>(
+      M.getOrInsertFunction(kTysanModuleCtorName, Attr, IRB.getVoidTy())
+          .getCallee());
+}
+
+void TypeSanitizer::instrumentGlobals() {
+  Module &M = *TysanCtorFunction->getParent();
+  initializeCallbacks(M);
+  TysanGlobalsSetTypeFunction = nullptr;
+
+  NamedMDNode *Globals = M.getNamedMetadata("llvm.tysan.globals");
+  if (!Globals)
+    return;
+
+  const DataLayout &DL = M.getDataLayout();
+  Value *ShadowBase = nullptr, *AppMemMask = nullptr;
+  TypeDescriptorsMapTy TypeDescriptors;
+  TypeNameMapTy TypeNames;
+
+  for (const auto &GMD : Globals->operands()) {
+    auto *GV = mdconst::dyn_extract_or_null<GlobalVariable>(GMD->getOperand(0));
+    if (!GV)
+      continue;
+    const MDNode *TBAAMD = cast<MDNode>(GMD->getOperand(1));
+    if (!generateBaseTypeDescriptor(TBAAMD, TypeDescriptors, TypeNames, M))
+      continue;
+
+    if (!TysanGlobalsSetTypeFunction) {
+      TysanGlobalsSetTypeFunction = Function::Create(
+          FunctionType::get(Type::getVoidTy(M.getContext()), false),
+          GlobalValue::InternalLinkage, "__tysan_set_globals_types", &M);
+      BasicBlock *BB =
+          BasicBlock::Create(M.getContext(), "", TysanGlobalsSetTypeFunction);
+      ReturnInst::Create(M.getContext(), BB);
+    }
+
+    IRBuilder<> IRB(
+        TysanGlobalsSetTypeFunction->getEntryBlock().getTerminator());
+    Type *AccessTy = GV->getValueType();
+    assert(AccessTy->isSized());
+    uint64_t AccessSize = DL.getTypeStoreSize(AccessTy);
+    instrumentWithShadowUpdate(IRB, TBAAMD, GV, AccessSize, false, false,
+                               ShadowBase, AppMemMask, true, false,
+                               TypeDescriptors, DL);
+  }
+
+  if (TysanGlobalsSetTypeFunction) {
+    IRBuilder<> IRB(TysanCtorFunction->getEntryBlock().getTerminator());
+    IRB.CreateCall(TysanGlobalsSetTypeFunction, {});
+  }
+}
+
+static void insertModuleCtor(Module &M) {
+  Function *TysanCtorFunction;
+  std::tie(TysanCtorFunction, std::ignore) =
+      createSanitizerCtorAndInitFunctions(M, kTysanModuleCtorName,
+                                          kTysanInitName, /*InitArgTypes=*/{},
+                                          /*InitArgs=*/{});
+
+  TypeSanitizer TySan(M);
+  TySan.instrumentGlobals();
+  appendToGlobalCtors(M, TysanCtorFunction, 0);
+}
+
+static const char LUT[] = "0123456789abcdef";
+
+static std::string encodeName(StringRef Name) {
+  size_t Length = Name.size();
+  std::string Output = kTysanGVNamePrefix;
+  Output.reserve(Output.size() + 3 * Length);
+  for (size_t i = 0; i < Length; ++i) {
+    const unsigned char c = Name[i];
+    if (isalnum((int)c)) {
+      Output.push_back(c);
+      continue;
+    }
+
+    if (c == '_') {
+      Output.append("__");
+      continue;
+    }
+
+    Output.push_back('_');
+    Output.push_back(LUT[c >> 4]);
+    Output.push_back(LUT[c & 15]);
+  }
+
+  return Output;
+}
+
+static bool isAnonymousNamespaceName(StringRef Name) {
+  // Types that are in an anonymous namespace are local to this module.
+  // FIXME: This should really be marked by the frontend in the metadata
+  // instead of having us guess this from the mangled name. Moreover, the regex
+  // here can pick up (unlikely) names in the non-reserved namespace (because
+  // it needs to search into the type to pick up cases where the type in the
+  // anonymous namespace is a template parameter, etc.).
+  return AnonNameRegex.match(Name);
+}
+
+std::string
+TypeSanitizer::getAnonymousStructIdentifier(const MDNode *MD,
+                                            TypeNameMapTy &TypeNames) {
+  MD5 Hash;
+
+  for (int i = 1, e = MD->getNumOperands(); i < e; i += 2) {
+    const MDNode *MemberNode = dyn_cast<MDNode>(MD->getOperand(i));
+    if (!MemberNode)
+      return "";
+
+    auto TNI = TypeNames.find(MemberNode);
+    std::string MemberName;
+    if (TNI != TypeNames.end()) {
+      MemberName = TNI->second;
+    } else {
+      if (MemberNode->getNumOperands() < 1)
+        return "";
+      MDString *MemberNameNode = dyn_cast<MDString>(MemberNode->getOperand(0));
+      if (!MemberNameNode)
+        return "";
+      MemberName = MemberNameNode->getString().str();
+      if (MemberName.empty())
+        MemberName = getAnonymousStructIdentifier(MemberNode, TypeNames);
+      if (MemberName.empty())
+        r...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Dec 22, 2023

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

This patch introduces the LLVM components of a type sanitizer: a sanitizer for type-based aliasing violations.

C/C++ have type-based aliasing rules, and LLVM's optimizer can exploit these given TBAA metadata added by Clang. Roughly, a pointer of given type cannot be used to access an object of a different type (with, of course, certain exceptions). Unfortunately, there's a lot of code in the wild that violates these rules (e.g. for type punning), and such code often must be built with -fno-strict-aliasing. Performance is often sacrificed as a result. Part of the problem is the difficulty of finding TBAA violations. Hopefully, this sanitizer will help.

For each TBAA type-access descriptor, encoded in LLVM's IR using metadata, the corresponding instrumentation pass generates descriptor tables. Thus, for each type (and access descriptor), we have a unique pointer representation. Excepting anonymous-namespace types, these tables are comdat, so the pointer values should be unique across the program. The descriptors refer to other descriptors to form a type aliasing tree (just like LLVM's TBAA metadata does). The instrumentation handles the "fast path" (where the types match exactly and no partial-overlaps are detected), and defers to the runtime to handle all of the more-complicated cases. The runtime, of course, is also responsible for reporting errors when those are detected.

The runtime uses essentially the same shadow memory region as tsan, and we use 8 bytes of shadow memory, the size of the pointer to the type descriptor, for every byte of accessed data in the program. The value 0 is used to represent an unknown type. The value -1 is used to represent an interior byte (a byte that is part of a type, but not the first byte). The instrumentation first checks for an exact match between the type of the current access and the type for that address recorded in the shadow memory. If it matches, it then checks the shadow for the remainder of the bytes in the type to make sure that they're all -1. If not, we call the runtime. If the exact match fails, we next check if the value is 0 (i.e. unknown). If it is, then we check the shadow for the remainder of the byes in the type (to make sure they're all 0). If they're not, we call the runtime. We then set the shadow for the access address and set the shadow for the remaining bytes in the type to -1 (i.e. marking them as interior bytes). If the type indicated by the shadow memory for the access address is neither an exact match nor 0, we call the runtime.

The instrumentation pass inserts calls to the memset intrinsic to set the memory updated by memset, memcpy, and memmove, as well as allocas/byval (and for lifetime.start/end) to reset the shadow memory to reflect that the type is now unknown. The runtime intercepts memset, memcpy, etc. to perform the same function for the library calls.

The runtime essentially repeats these checks, but uses the full TBAA algorithm, just as the compiler does, to determine when two types are permitted to alias. In a situation where access overlap has occurred and aliasing is not permitted, an error is generated.

Clang's TBAA representation currently has a problem representing unions, as demonstrated by the one XFAIL'd test in the runtime patch. We'll update the TBAA representation to fix this, and at the same time, update the sanitizer.

When the sanitizer is active, we disable actually using the TBAA metadata for AA. This way we're less likely to use TBAA to remove memory accesses that we'd like to verify.

As a note, this implementation does not use the compressed shadow-memory scheme discussed previously (http://lists.llvm.org/pipermail/llvm-dev/2017-April/111766.html). That scheme would not handle the struct-path (i.e. structure offset) information that our TBAA represents. I expect we'll want to further work on compressing the shadow-memory representation, but I think it makes sense to do that as follow-up work.
[TySan] LLVM part.

Move from https://reviews.llvm.org/D32198


Patch is 115.26 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/76259.diff

24 Files Affected:

  • (modified) llvm/include/llvm/Bitcode/LLVMBitCodes.h (+1)
  • (modified) llvm/include/llvm/IR/Attributes.td (+4)
  • (added) llvm/include/llvm/Transforms/Instrumentation/TypeSanitizer.h (+38)
  • (modified) llvm/lib/Analysis/TypeBasedAliasAnalysis.cpp (+22-6)
  • (modified) llvm/lib/Bitcode/Reader/BitcodeReader.cpp (+2)
  • (modified) llvm/lib/Bitcode/Writer/BitcodeWriter.cpp (+2)
  • (modified) llvm/lib/CodeGen/ShrinkWrap.cpp (+1)
  • (modified) llvm/lib/Passes/PassBuilder.cpp (+1)
  • (modified) llvm/lib/Passes/PassRegistry.def (+2)
  • (modified) llvm/lib/Transforms/Instrumentation/CMakeLists.txt (+1)
  • (added) llvm/lib/Transforms/Instrumentation/TypeSanitizer.cpp (+873)
  • (modified) llvm/lib/Transforms/Utils/CodeExtractor.cpp (+1)
  • (added) llvm/test/Instrumentation/TypeSanitizer/access-with-offfset.ll (+71)
  • (added) llvm/test/Instrumentation/TypeSanitizer/alloca.ll (+29)
  • (added) llvm/test/Instrumentation/TypeSanitizer/anon.ll (+283)
  • (added) llvm/test/Instrumentation/TypeSanitizer/basic-nosan.ll (+93)
  • (added) llvm/test/Instrumentation/TypeSanitizer/basic.ll (+214)
  • (added) llvm/test/Instrumentation/TypeSanitizer/byval.ll (+88)
  • (added) llvm/test/Instrumentation/TypeSanitizer/globals.ll (+66)
  • (added) llvm/test/Instrumentation/TypeSanitizer/invalid-metadata.ll (+25)
  • (added) llvm/test/Instrumentation/TypeSanitizer/memintrinsics.ll (+77)
  • (added) llvm/test/Instrumentation/TypeSanitizer/nosanitize.ll (+39)
  • (added) llvm/test/Instrumentation/TypeSanitizer/sanitize-no-tbaa.ll (+180)
  • (added) llvm/test/Instrumentation/TypeSanitizer/swifterror.ll (+24)
diff --git a/llvm/include/llvm/Bitcode/LLVMBitCodes.h b/llvm/include/llvm/Bitcode/LLVMBitCodes.h
index c6f0ddf29a6da8..8d3259336abba9 100644
--- a/llvm/include/llvm/Bitcode/LLVMBitCodes.h
+++ b/llvm/include/llvm/Bitcode/LLVMBitCodes.h
@@ -724,6 +724,7 @@ enum AttributeKindCodes {
   ATTR_KIND_WRITABLE = 89,
   ATTR_KIND_CORO_ONLY_DESTROY_WHEN_COMPLETE = 90,
   ATTR_KIND_DEAD_ON_UNWIND = 91,
+  ATTR_KIND_SANITIZE_TYPE = 92,
 };
 
 enum ComdatSelectionKindCodes {
diff --git a/llvm/include/llvm/IR/Attributes.td b/llvm/include/llvm/IR/Attributes.td
index 864f87f3383891..beffa63bd81298 100644
--- a/llvm/include/llvm/IR/Attributes.td
+++ b/llvm/include/llvm/IR/Attributes.td
@@ -270,6 +270,9 @@ def SanitizeAddress : EnumAttr<"sanitize_address", [FnAttr]>;
 /// ThreadSanitizer is on.
 def SanitizeThread : EnumAttr<"sanitize_thread", [FnAttr]>;
 
+/// TypeSanitizer is on.
+def SanitizeType : EnumAttr<"sanitize_type", [FnAttr]>;
+
 /// MemorySanitizer is on.
 def SanitizeMemory : EnumAttr<"sanitize_memory", [FnAttr]>;
 
@@ -354,6 +357,7 @@ def : CompatRule<"isEqual<SanitizeThreadAttr>">;
 def : CompatRule<"isEqual<SanitizeMemoryAttr>">;
 def : CompatRule<"isEqual<SanitizeHWAddressAttr>">;
 def : CompatRule<"isEqual<SanitizeMemTagAttr>">;
+def : CompatRule<"isEqual<SanitizeTypeAttr>">;
 def : CompatRule<"isEqual<SafeStackAttr>">;
 def : CompatRule<"isEqual<ShadowCallStackAttr>">;
 def : CompatRule<"isEqual<UseSampleProfileAttr>">;
diff --git a/llvm/include/llvm/Transforms/Instrumentation/TypeSanitizer.h b/llvm/include/llvm/Transforms/Instrumentation/TypeSanitizer.h
new file mode 100644
index 00000000000000..a6cc56df35f14d
--- /dev/null
+++ b/llvm/include/llvm/Transforms/Instrumentation/TypeSanitizer.h
@@ -0,0 +1,38 @@
+//===- Transforms/Instrumentation/TypeSanitizer.h - TySan Pass -----------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file defines the type sanitizer pass.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_TRANSFORMS_INSTRUMENTATION_TYPESANITIZER_H
+#define LLVM_TRANSFORMS_INSTRUMENTATION_TYPESANITIZER_H
+
+#include "llvm/IR/PassManager.h"
+
+namespace llvm {
+class Function;
+class FunctionPass;
+class Module;
+
+/// A function pass for tysan instrumentation.
+struct TypeSanitizerPass : public PassInfoMixin<TypeSanitizerPass> {
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &FAM);
+  static bool isRequired() { return true; }
+};
+
+/// A module pass for tysan instrumentation.
+///
+/// Create ctor and init functions.
+struct ModuleTypeSanitizerPass : public PassInfoMixin<ModuleTypeSanitizerPass> {
+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+  static bool isRequired() { return true; }
+};
+
+} // namespace llvm
+#endif /* LLVM_TRANSFORMS_INSTRUMENTATION_TYPESANITIZER_H */
diff --git a/llvm/lib/Analysis/TypeBasedAliasAnalysis.cpp b/llvm/lib/Analysis/TypeBasedAliasAnalysis.cpp
index e4dc1a867f6f0c..bfa8a80518ccda 100644
--- a/llvm/lib/Analysis/TypeBasedAliasAnalysis.cpp
+++ b/llvm/lib/Analysis/TypeBasedAliasAnalysis.cpp
@@ -371,11 +371,27 @@ static bool isStructPathTBAA(const MDNode *MD) {
   return isa<MDNode>(MD->getOperand(0)) && MD->getNumOperands() >= 3;
 }
 
+// When using the TypeSanitizer, don't use TBAA information for alias analysis.
+// This might cause us to remove memory accesses that we need to verify at
+// runtime.
+static bool usingSanitizeType(const Value *V) {
+  const Function *F;
+
+  if (auto *I = dyn_cast<Instruction>(V))
+    F = I->getParent()->getParent();
+  else if (auto *A = dyn_cast<Argument>(V))
+    F = A->getParent();
+  else
+    return false;
+
+  return F->hasFnAttribute(Attribute::SanitizeType);
+}
+
 AliasResult TypeBasedAAResult::alias(const MemoryLocation &LocA,
                                      const MemoryLocation &LocB,
                                      AAQueryInfo &AAQI, const Instruction *) {
-  if (!EnableTBAA)
-    return AliasResult::MayAlias;
+  if (!EnableTBAA || usingSanitizeType(LocA.Ptr) || usingSanitizeType(LocB.Ptr))
+    return AAResultBase::alias(LocA, LocB, AAQI, nullptr);
 
   if (Aliases(LocA.AATags.TBAA, LocB.AATags.TBAA))
     return AliasResult::MayAlias;
@@ -425,8 +441,8 @@ MemoryEffects TypeBasedAAResult::getMemoryEffects(const Function *F) {
 ModRefInfo TypeBasedAAResult::getModRefInfo(const CallBase *Call,
                                             const MemoryLocation &Loc,
                                             AAQueryInfo &AAQI) {
-  if (!EnableTBAA)
-    return ModRefInfo::ModRef;
+  if (!EnableTBAA || usingSanitizeType(Call))
+    return AAResultBase::getModRefInfo(Call, Loc, AAQI);
 
   if (const MDNode *L = Loc.AATags.TBAA)
     if (const MDNode *M = Call->getMetadata(LLVMContext::MD_tbaa))
@@ -439,8 +455,8 @@ ModRefInfo TypeBasedAAResult::getModRefInfo(const CallBase *Call,
 ModRefInfo TypeBasedAAResult::getModRefInfo(const CallBase *Call1,
                                             const CallBase *Call2,
                                             AAQueryInfo &AAQI) {
-  if (!EnableTBAA)
-    return ModRefInfo::ModRef;
+  if (!EnableTBAA || usingSanitizeType(Call1))
+    return AAResultBase::getModRefInfo(Call1, Call2, AAQI);
 
   if (const MDNode *M1 = Call1->getMetadata(LLVMContext::MD_tbaa))
     if (const MDNode *M2 = Call2->getMetadata(LLVMContext::MD_tbaa))
diff --git a/llvm/lib/Bitcode/Reader/BitcodeReader.cpp b/llvm/lib/Bitcode/Reader/BitcodeReader.cpp
index 8907f6fa4ff3fd..31e2d74fe570e8 100644
--- a/llvm/lib/Bitcode/Reader/BitcodeReader.cpp
+++ b/llvm/lib/Bitcode/Reader/BitcodeReader.cpp
@@ -2058,6 +2058,8 @@ static Attribute::AttrKind getAttrFromCode(uint64_t Code) {
     return Attribute::SanitizeHWAddress;
   case bitc::ATTR_KIND_SANITIZE_THREAD:
     return Attribute::SanitizeThread;
+  case bitc::ATTR_KIND_SANITIZE_TYPE:
+    return Attribute::SanitizeType;
   case bitc::ATTR_KIND_SANITIZE_MEMORY:
     return Attribute::SanitizeMemory;
   case bitc::ATTR_KIND_SPECULATIVE_LOAD_HARDENING:
diff --git a/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp b/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
index 8fca569a391baf..c925c367d17aaf 100644
--- a/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
+++ b/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
@@ -789,6 +789,8 @@ static uint64_t getAttrKindEncoding(Attribute::AttrKind Kind) {
     return bitc::ATTR_KIND_SANITIZE_HWADDRESS;
   case Attribute::SanitizeThread:
     return bitc::ATTR_KIND_SANITIZE_THREAD;
+  case Attribute::SanitizeType:
+    return bitc::ATTR_KIND_SANITIZE_TYPE;
   case Attribute::SanitizeMemory:
     return bitc::ATTR_KIND_SANITIZE_MEMORY;
   case Attribute::SpeculativeLoadHardening:
diff --git a/llvm/lib/CodeGen/ShrinkWrap.cpp b/llvm/lib/CodeGen/ShrinkWrap.cpp
index ab57d08e527e4d..f4e6f8001debf2 100644
--- a/llvm/lib/CodeGen/ShrinkWrap.cpp
+++ b/llvm/lib/CodeGen/ShrinkWrap.cpp
@@ -984,6 +984,7 @@ bool ShrinkWrap::isShrinkWrapEnabled(const MachineFunction &MF) {
            !(MF.getFunction().hasFnAttribute(Attribute::SanitizeAddress) ||
              MF.getFunction().hasFnAttribute(Attribute::SanitizeThread) ||
              MF.getFunction().hasFnAttribute(Attribute::SanitizeMemory) ||
+             MF.getFunction().hasFnAttribute(Attribute::SanitizeType) ||
              MF.getFunction().hasFnAttribute(Attribute::SanitizeHWAddress));
   // If EnableShrinkWrap is set, it takes precedence on whatever the
   // target sets. The rational is that we assume we want to test
diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index f94bd422c6b592..fb56a65e41ec99 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -167,6 +167,7 @@
 #include "llvm/Transforms/Instrumentation/SanitizerBinaryMetadata.h"
 #include "llvm/Transforms/Instrumentation/SanitizerCoverage.h"
 #include "llvm/Transforms/Instrumentation/ThreadSanitizer.h"
+#include "llvm/Transforms/Instrumentation/TypeSanitizer.h"
 #include "llvm/Transforms/ObjCARC.h"
 #include "llvm/Transforms/Scalar/ADCE.h"
 #include "llvm/Transforms/Scalar/AlignmentFromAssumptions.h"
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index 82ce040c649626..5c9e16435f0ae9 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -138,6 +138,7 @@ MODULE_PASS("synthetic-counts-propagation", SyntheticCountsPropagation())
 MODULE_PASS("trigger-crash", TriggerCrashPass())
 MODULE_PASS("trigger-verifier-error", TriggerVerifierErrorPass())
 MODULE_PASS("tsan-module", ModuleThreadSanitizerPass())
+MODULE_PASS("tysan-module", ModuleTypeSanitizerPass())
 MODULE_PASS("verify", VerifierPass())
 MODULE_PASS("view-callgraph", CallGraphViewerPass())
 MODULE_PASS("wholeprogramdevirt", WholeProgramDevirtPass())
@@ -417,6 +418,7 @@ FUNCTION_PASS("tlshoist", TLSVariableHoistPass())
 FUNCTION_PASS("transform-warning", WarnMissedTransformationsPass())
 FUNCTION_PASS("trigger-verifier-error", TriggerVerifierErrorPass())  
 FUNCTION_PASS("tsan", ThreadSanitizerPass())
+FUNCTION_PASS("tysan", TypeSanitizerPass())
 FUNCTION_PASS("typepromotion", TypePromotionPass(TM))
 FUNCTION_PASS("unify-loop-exits", UnifyLoopExitsPass())
 FUNCTION_PASS("vector-combine", VectorCombinePass())
diff --git a/llvm/lib/Transforms/Instrumentation/CMakeLists.txt b/llvm/lib/Transforms/Instrumentation/CMakeLists.txt
index 424f1d43360677..54211236e18643 100644
--- a/llvm/lib/Transforms/Instrumentation/CMakeLists.txt
+++ b/llvm/lib/Transforms/Instrumentation/CMakeLists.txt
@@ -20,6 +20,7 @@ add_llvm_component_library(LLVMInstrumentation
   SanitizerBinaryMetadata.cpp
   ValueProfileCollector.cpp
   ThreadSanitizer.cpp
+  TypeSanitizer.cpp
   HWAddressSanitizer.cpp
 
   ADDITIONAL_HEADER_DIRS
diff --git a/llvm/lib/Transforms/Instrumentation/TypeSanitizer.cpp b/llvm/lib/Transforms/Instrumentation/TypeSanitizer.cpp
new file mode 100644
index 00000000000000..ed4aba4ad612d9
--- /dev/null
+++ b/llvm/lib/Transforms/Instrumentation/TypeSanitizer.cpp
@@ -0,0 +1,873 @@
+//===----- TypeSanitizer.cpp - type-based-aliasing-violation detector -----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file is a part of TypeSanitizer, a type-based-aliasing-violation
+// detector.
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Transforms/Instrumentation/TypeSanitizer.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallSet.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/Statistic.h"
+#include "llvm/ADT/StringExtras.h"
+#include "llvm/Analysis/MemoryLocation.h"
+#include "llvm/Analysis/TargetLibraryInfo.h"
+#include "llvm/IR/DataLayout.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/IRBuilder.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/LLVMContext.h"
+#include "llvm/IR/MDBuilder.h"
+#include "llvm/IR/Metadata.h"
+#include "llvm/IR/Module.h"
+#include "llvm/IR/Type.h"
+#include "llvm/ProfileData/InstrProf.h"
+#include "llvm/Support/CommandLine.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/MD5.h"
+#include "llvm/Support/MathExtras.h"
+#include "llvm/Support/Regex.h"
+#include "llvm/Support/raw_ostream.h"
+#include "llvm/Transforms/Instrumentation.h"
+#include "llvm/Transforms/Utils/BasicBlockUtils.h"
+#include "llvm/Transforms/Utils/Local.h"
+#include "llvm/Transforms/Utils/ModuleUtils.h"
+
+#include <cctype>
+
+using namespace llvm;
+
+#define DEBUG_TYPE "tysan"
+
+static const char *const kTysanModuleCtorName = "tysan.module_ctor";
+static const char *const kTysanInitName = "__tysan_init";
+static const char *const kTysanCheckName = "__tysan_check";
+static const char *const kTysanGVNamePrefix = "__tysan_v1_";
+
+static const char *const kTysanShadowMemoryAddress =
+    "__tysan_shadow_memory_address";
+static const char *const kTysanAppMemMask = "__tysan_app_memory_mask";
+
+static cl::opt<bool>
+    ClWritesAlwaysSetType("tysan-writes-always-set-type",
+                          cl::desc("Writes always set the type"), cl::Hidden,
+                          cl::init(false));
+
+STATISTIC(NumInstrumentedAccesses, "Number of instrumented accesses");
+
+static Regex AnonNameRegex("^_ZTS.*N[1-9][0-9]*_GLOBAL__N");
+
+namespace {
+
+/// TypeSanitizer: instrument the code in module to find  type-based aliasing
+/// violations.
+struct TypeSanitizer {
+  TypeSanitizer(Module &M);
+  bool run(Function &F, const TargetLibraryInfo &TLI);
+  void instrumentGlobals();
+
+private:
+  typedef SmallDenseMap<const MDNode *, GlobalVariable *, 8>
+      TypeDescriptorsMapTy;
+  typedef SmallDenseMap<const MDNode *, std::string, 8> TypeNameMapTy;
+
+  void initializeCallbacks(Module &M);
+
+  Value *getShadowBase(Function &F);
+  Value *getAppMemMask(Function &F);
+
+  bool instrumentWithShadowUpdate(IRBuilder<> &IRB, const MDNode *TBAAMD,
+                                  Value *Ptr, uint64_t AccessSize, bool IsRead,
+                                  bool IsWrite, Value *&ShadowBase,
+                                  Value *&AppMemMask, bool ForceSetType,
+                                  bool SanitizeFunction,
+                                  TypeDescriptorsMapTy &TypeDescriptors,
+                                  const DataLayout &DL);
+  bool instrumentMemoryAccess(Instruction *I, MemoryLocation &MLoc,
+                              Value *&ShadowBase, Value *&AppMemMask,
+                              bool SanitizeFunction,
+                              TypeDescriptorsMapTy &TypeDescriptors,
+                              const DataLayout &DL);
+  bool instrumentMemInst(Value *I, Value *&ShadowBase, Value *&AppMemMask,
+                         const DataLayout &DL);
+
+  std::string getAnonymousStructIdentifier(const MDNode *MD,
+                                           TypeNameMapTy &TypeNames);
+  bool generateTypeDescriptor(const MDNode *MD,
+                              TypeDescriptorsMapTy &TypeDescriptors,
+                              TypeNameMapTy &TypeNames, Module &M);
+  bool generateBaseTypeDescriptor(const MDNode *MD,
+                                  TypeDescriptorsMapTy &TypeDescriptors,
+                                  TypeNameMapTy &TypeNames, Module &M);
+
+  const Triple TargetTriple;
+  Type *IntptrTy;
+  uint64_t PtrShift;
+  IntegerType *OrdTy;
+
+  // Callbacks to run-time library are computed in doInitialization.
+  Function *TysanCheck;
+  Function *TysanCtorFunction;
+  Function *TysanGlobalsSetTypeFunction;
+};
+} // namespace
+
+TypeSanitizer::TypeSanitizer(Module &M)
+    : TargetTriple(Triple(M.getTargetTriple())) {
+  const DataLayout &DL = M.getDataLayout();
+  IntptrTy = DL.getIntPtrType(M.getContext());
+  PtrShift = countr_zero(IntptrTy->getPrimitiveSizeInBits() / 8);
+
+  TysanGlobalsSetTypeFunction = M.getFunction("__tysan_set_globals_types");
+  initializeCallbacks(M);
+}
+
+void TypeSanitizer::initializeCallbacks(Module &M) {
+  IRBuilder<> IRB(M.getContext());
+  OrdTy = IRB.getInt32Ty();
+
+  AttributeList Attr;
+  Attr = Attr.addFnAttribute(M.getContext(), Attribute::NoUnwind);
+  // Initialize the callbacks.
+  TysanCheck = cast<Function>(
+      M.getOrInsertFunction(kTysanCheckName, Attr, IRB.getVoidTy(),
+                            IRB.getPtrTy(), // Pointer to data to be read.
+                            OrdTy,              // Size of the data in bytes.
+                            IRB.getPtrTy(), // Pointer to type descriptor.
+                            OrdTy               // Flags.
+                            )
+          .getCallee());
+
+  TysanCtorFunction = cast<Function>(
+      M.getOrInsertFunction(kTysanModuleCtorName, Attr, IRB.getVoidTy())
+          .getCallee());
+}
+
+void TypeSanitizer::instrumentGlobals() {
+  Module &M = *TysanCtorFunction->getParent();
+  initializeCallbacks(M);
+  TysanGlobalsSetTypeFunction = nullptr;
+
+  NamedMDNode *Globals = M.getNamedMetadata("llvm.tysan.globals");
+  if (!Globals)
+    return;
+
+  const DataLayout &DL = M.getDataLayout();
+  Value *ShadowBase = nullptr, *AppMemMask = nullptr;
+  TypeDescriptorsMapTy TypeDescriptors;
+  TypeNameMapTy TypeNames;
+
+  for (const auto &GMD : Globals->operands()) {
+    auto *GV = mdconst::dyn_extract_or_null<GlobalVariable>(GMD->getOperand(0));
+    if (!GV)
+      continue;
+    const MDNode *TBAAMD = cast<MDNode>(GMD->getOperand(1));
+    if (!generateBaseTypeDescriptor(TBAAMD, TypeDescriptors, TypeNames, M))
+      continue;
+
+    if (!TysanGlobalsSetTypeFunction) {
+      TysanGlobalsSetTypeFunction = Function::Create(
+          FunctionType::get(Type::getVoidTy(M.getContext()), false),
+          GlobalValue::InternalLinkage, "__tysan_set_globals_types", &M);
+      BasicBlock *BB =
+          BasicBlock::Create(M.getContext(), "", TysanGlobalsSetTypeFunction);
+      ReturnInst::Create(M.getContext(), BB);
+    }
+
+    IRBuilder<> IRB(
+        TysanGlobalsSetTypeFunction->getEntryBlock().getTerminator());
+    Type *AccessTy = GV->getValueType();
+    assert(AccessTy->isSized());
+    uint64_t AccessSize = DL.getTypeStoreSize(AccessTy);
+    instrumentWithShadowUpdate(IRB, TBAAMD, GV, AccessSize, false, false,
+                               ShadowBase, AppMemMask, true, false,
+                               TypeDescriptors, DL);
+  }
+
+  if (TysanGlobalsSetTypeFunction) {
+    IRBuilder<> IRB(TysanCtorFunction->getEntryBlock().getTerminator());
+    IRB.CreateCall(TysanGlobalsSetTypeFunction, {});
+  }
+}
+
+static void insertModuleCtor(Module &M) {
+  Function *TysanCtorFunction;
+  std::tie(TysanCtorFunction, std::ignore) =
+      createSanitizerCtorAndInitFunctions(M, kTysanModuleCtorName,
+                                          kTysanInitName, /*InitArgTypes=*/{},
+                                          /*InitArgs=*/{});
+
+  TypeSanitizer TySan(M);
+  TySan.instrumentGlobals();
+  appendToGlobalCtors(M, TysanCtorFunction, 0);
+}
+
+static const char LUT[] = "0123456789abcdef";
+
+static std::string encodeName(StringRef Name) {
+  size_t Length = Name.size();
+  std::string Output = kTysanGVNamePrefix;
+  Output.reserve(Output.size() + 3 * Length);
+  for (size_t i = 0; i < Length; ++i) {
+    const unsigned char c = Name[i];
+    if (isalnum((int)c)) {
+      Output.push_back(c);
+      continue;
+    }
+
+    if (c == '_') {
+      Output.append("__");
+      continue;
+    }
+
+    Output.push_back('_');
+    Output.push_back(LUT[c >> 4]);
+    Output.push_back(LUT[c & 15]);
+  }
+
+  return Output;
+}
+
+static bool isAnonymousNamespaceName(StringRef Name) {
+  // Types that are in an anonymous namespace are local to this module.
+  // FIXME: This should really be marked by the frontend in the metadata
+  // instead of having us guess this from the mangled name. Moreover, the regex
+  // here can pick up (unlikely) names in the non-reserved namespace (because
+  // it needs to search into the type to pick up cases where the type in the
+  // anonymous namespace is a template parameter, etc.).
+  return AnonNameRegex.match(Name);
+}
+
+std::string
+TypeSanitizer::getAnonymousStructIdentifier(const MDNode *MD,
+                                            TypeNameMapTy &TypeNames) {
+  MD5 Hash;
+
+  for (int i = 1, e = MD->getNumOperands(); i < e; i += 2) {
+    const MDNode *MemberNode = dyn_cast<MDNode>(MD->getOperand(i));
+    if (!MemberNode)
+      return "";
+
+    auto TNI = TypeNames.find(MemberNode);
+    std::string MemberName;
+    if (TNI != TypeNames.end()) {
+      MemberName = TNI->second;
+    } else {
+      if (MemberNode->getNumOperands() < 1)
+        return "";
+      MDString *MemberNameNode = dyn_cast<MDString>(MemberNode->getOperand(0));
+      if (!MemberNameNode)
+        return "";
+      MemberName = MemberNameNode->getString().str();
+      if (MemberName.empty())
+        MemberName = getAnonymousStructIdentifier(MemberNode, TypeNames);
+      if (MemberName.empty())
+        r...
[truncated]

Copy link

github-actions bot commented Dec 22, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

// When using the TypeSanitizer, don't use TBAA information for alias analysis.
// This might cause us to remove memory accesses that we need to verify at
// runtime.
static bool usingSanitizeType(const Value *V) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeBasedAAResult is constructed on a per-function basis, so you should cache this in run(). Has the nice side benefit that Function is available at that point and you don't have to guess it from the accessed value, which will be unreliable (e.g. for accesses to globals).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to TypeBasedAAResult, thanks. There's still a legacy PM wrapper pass that constructs on a per-module basis, but that's not that relevant probably

@thesamesam thesamesam added the TBAA Type-Based Alias Analysis / Strict Aliasing label Dec 26, 2023
@tru
Copy link
Collaborator

tru commented Jan 18, 2024

@fhahn What's the state of this currently? I was just looking at using something like this since we want to convert a older codebase to enable strict-aliasing and it would be helpful to have the tysan for this.

I am happy to test this out on Windows if that hasn't been done yet.

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(an unsubmitted comment I had lying around for some reason)

@fhahn fhahn force-pushed the users/fhahn/tysan-a-type-sanitizer-llvm branch from 201c9e3 to eb97b07 Compare April 18, 2024 21:53
Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fhahn What's the state of this currently? I was just looking at using something like this since we want to convert a older codebase to enable strict-aliasing and it would be helpful to have the tysan for this.

I am happy to test this out on Windows if that hasn't been done yet.

Just rebased this again (and will for the follow on patches in a minute) and address the comments.

The current version catches 6/8 strict-aliasing violations I could find with a quick search in the bug tracker. Most of the SingleSource tests from the llvm-test-suite pass with ~20 failing (still need to dig through those). That is both on ARM64 macOS.

At EuroLLVM I had a chat to people from Sony who were interested. Will try to also post an update on the discourse thread tomorrow.

Any help on Windows would be great, I think at the moment the clang part and runtime is only enabled on Linux/macOS & X86/AArch64.

// When using the TypeSanitizer, don't use TBAA information for alias analysis.
// This might cause us to remove memory accesses that we need to verify at
// runtime.
static bool usingSanitizeType(const Value *V) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to TypeBasedAAResult, thanks. There's still a legacy PM wrapper pass that constructs on a per-module basis, but that's not that relevant probably

@gbMattN
Copy link
Contributor

gbMattN commented May 1, 2024

Any help on Windows would be great, I think at the moment the clang part and runtime is only enabled on Linux/macOS & X86/AArch64.

I ran the clang and llvm tests on Windows. All the llvm tests passed. Clang's CodeGen/sanitize-type-attr.cpp test passed, but Driver/sanitizer-ld.c had a compile issue and failed. I'll look into that and see if I can find a fix.

In order to run the tests I added SanitizerKind::Type to the Windows MSVC ToolChain and nothing else, so that failure may not be an actual issue.

Edit: That test isn't passing for me on windows without the TySan changes as well, so TySan's probably not the issue

@@ -0,0 +1,71 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible spelling mistake, offset spelt with 3 fs?

@fhahn fhahn force-pushed the users/fhahn/tysan-a-type-sanitizer-llvm branch from eb97b07 to b1465a7 Compare June 27, 2024 16:09
@fhahn fhahn force-pushed the users/fhahn/tysan-a-type-sanitizer-llvm branch from b1465a7 to a6d44dd Compare November 22, 2024 18:56
@fhahn
Copy link
Contributor Author

fhahn commented Nov 22, 2024

Just rebased again on top of current main.

Would be great if we could get this in, as multiple people including @gbMattN and I think @tavianator would be interested in making improvements. As is, it is already useful (it catches a number of type-based alias violations in issues submitted by users https://discourse.llvm.org/t/reviving-typesanitizer-a-sanitizer-to-catch-type-based-aliasing-violations/66092/11) and having it in main would allow more people to improve the sanitizer more quickly

@fhahn fhahn requested review from yln and kubamracek November 22, 2024 19:33
@xry111
Copy link
Contributor

xry111 commented Nov 30, 2024

Just rebased again on top of current main.

Would be great if we could get this in, as multiple people including @gbMattN and I think @tavianator would be interested in making improvements. As is, it is already useful (it catches a number of type-based alias violations in issues submitted by users https://discourse.llvm.org/t/reviving-typesanitizer-a-sanitizer-to-catch-type-based-aliasing-violations/66092/11) and having it in main would allow more people to improve the sanitizer more quickly

I guess you need to fix the build failures highlighted by the CI first:

FAILED: lib/Transforms/Instrumentation/CMakeFiles/LLVMInstrumentation.dir/TypeSanitizer.cpp.o 
/var/lib/buildkite-agent/builds/linux-56-59b8f5d88-5wwkv-1/llvm-project/github-pull-requests/llvm/lib/Transforms/Instrumentation/TypeSanitizer.cpp:40:10: fatal error: 'llvm/Transforms/Instrumentation.h' file not found
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It seems moved to llvm/Transforms/Utils/Instrumentation.h at 2ae968a.

@fhahn fhahn requested review from fmayer and efriedma-quic December 6, 2024 10:42
@fhahn fhahn force-pushed the users/fhahn/tysan-a-type-sanitizer-llvm branch from a6d44dd to eff5b7e Compare December 6, 2024 10:42
@fhahn
Copy link
Contributor Author

fhahn commented Dec 6, 2024

Ping :)

I guess you need to fix the build failures highlighted by the CI first:

FAILED: lib/Transforms/Instrumentation/CMakeFiles/LLVMInstrumentation.dir/TypeSanitizer.cpp.o 
/var/lib/buildkite-agent/builds/linux-56-59b8f5d88-5wwkv-1/llvm-project/github-pull-requests/llvm/lib/Transforms/Instrumentation/TypeSanitizer.cpp:40:10: fatal error: 'llvm/Transforms/Instrumentation.h' file not found
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It seems moved to llvm/Transforms/Utils/Instrumentation.h at 2ae968a.

Thanks for the heads-up, removed the include, as it shouldn't be needed in the latest version.

Copy link
Contributor

@fmayer fmayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first passs

Copy link
Contributor

@fmayer fmayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the many nits

@@ -62,7 +62,7 @@ TEST(AliasSetTracker, AliasUnknownInst) {
TargetLibraryInfoImpl TLII(Trip);
TargetLibraryInfo TLI(TLII);
AAResults AA(TLI);
TypeBasedAAResult TBAAR;
TypeBasedAAResult TBAAR(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: /* ...= */

@fhahn fhahn force-pushed the users/fhahn/tysan-a-type-sanitizer-llvm branch 2 times, most recently from 2235299 to 694cb2c Compare December 10, 2024 11:43
SetType();
return true;
}
// We need to check the type here. If the type is unknown, then the read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! random thing for down the line: would it be worth considering outlining this, as we do with hwasan checks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a fork with outlines TySan which I talk about a little on the TySan RFC. I'm planning on waiting for this to be merged before making a pull request to keep things simpler 😅

Copy link
Contributor

@fmayer fmayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for addressing my comments

@fhahn
Copy link
Contributor Author

fhahn commented Dec 13, 2024

Thanks for taking a look! Now the LLVM and Clang parts are good to go, but I'd like to ideally merge this once the compiler-rt parts (#76261) is also good to go.

I guess the LLVM part could land mostly independently, but the Clang part should have complementing compiler-rt support to start with I think

@thesamesam
Copy link
Member

FWIW, having the 2 parts in would make it easier for us to test it before compiler-rt is merged (rather than having to rebase and apply the 3 parts individually to each component, as we build LLVM/Clang/compiler-rt separately), but I don't feel strongly at all either.

@fhahn fhahn force-pushed the users/fhahn/tysan-a-type-sanitizer-llvm branch from 694cb2c to 9af0ba5 Compare December 17, 2024 12:14
@fhahn
Copy link
Contributor Author

fhahn commented Dec 17, 2024

All 3 parts have now been approved, just rebased the PR for some final checks, then I am planning to merge this and the remaining parts later today. Many thanks everyone involved, including @fmayer for all the recent reviews!

Copy link

github-actions bot commented Dec 17, 2024

✅ With the latest revision this PR passed the undef deprecator.

@fhahn fhahn changed the title [TySan] A Type Sanitizer (LLVM) [TySan] Add initial Type Sanitizer (LLVM) Dec 17, 2024
@fhahn fhahn merged commit a487b79 into main Dec 17, 2024
8 checks passed
@fhahn fhahn deleted the users/fhahn/tysan-a-type-sanitizer-llvm branch December 17, 2024 13:57
fhahn added a commit that referenced this pull request Dec 17, 2024
This patch introduces the Clang components of type sanitizer: a
sanitizer for type-based aliasing violations.

It is based on Hal Finkel's https://reviews.llvm.org/D32198.

The Clang changes are mostly formulaic, the one specific change being
that when the TBAA sanitizer is enabled, TBAA is always generated, even
at -O0.

It goes together with the corresponding LLVM changes
(#76259) and compiler-rt
changes (#76261)

PR: #76260
@hoyhoy
Copy link

hoyhoy commented Dec 17, 2024

Does -fsanitize=address,undefined,type work? Or does this only run standalone?

fhahn added a commit that referenced this pull request Dec 17, 2024
This patch introduces the runtime components for type sanitizer: a
sanitizer for type-based aliasing violations.

It is based on Hal Finkel's https://reviews.llvm.org/D32197.

C/C++ have type-based aliasing rules, and LLVM's optimizer can exploit
these given TBAA metadata added by Clang. Roughly, a pointer of given
type cannot be used to access an object of a different type (with, of
course, certain exceptions). Unfortunately, there's a lot of code in the
wild that violates these rules (e.g. for type punning), and such code
often must be built with -fno-strict-aliasing. Performance is often
sacrificed as a result. Part of the problem is the difficulty of finding
TBAA violations. Hopefully, this sanitizer will help.

For each TBAA type-access descriptor, encoded in LLVM's IR using
metadata, the corresponding instrumentation pass generates descriptor
tables. Thus, for each type (and access descriptor), we have a unique
pointer representation. Excepting anonymous-namespace types, these
tables are comdat, so the pointer values should be unique across the
program. The descriptors refer to other descriptors to form a type
aliasing tree (just like LLVM's TBAA metadata does). The instrumentation
handles the "fast path" (where the types match exactly and no
partial-overlaps are detected), and defers to the runtime to handle all
of the more-complicated cases. The runtime, of course, is also
responsible for reporting errors when those are detected.

The runtime uses essentially the same shadow memory region as tsan, and
we use 8 bytes of shadow memory, the size of the pointer to the type
descriptor, for every byte of accessed data in the program. The value 0
is used to represent an unknown type. The value -1 is used to represent
an interior byte (a byte that is part of a type, but not the first
byte). The instrumentation first checks for an exact match between the
type of the current access and the type for that address recorded in the
shadow memory. If it matches, it then checks the shadow for the
remainder of the bytes in the type to make sure that they're all -1. If
not, we call the runtime. If the exact match fails, we next check if the
value is 0 (i.e. unknown). If it is, then we check the shadow for the
remainder of the byes in the type (to make sure they're all 0). If
they're not, we call the runtime. We then set the shadow for the access
address and set the shadow for the remaining bytes in the type to -1
(i.e. marking them as interior bytes). If the type indicated by the
shadow memory for the access address is neither an exact match nor 0, we
call the runtime.

The instrumentation pass inserts calls to the memset intrinsic to set
the memory updated by memset, memcpy, and memmove, as well as
allocas/byval (and for lifetime.start/end) to reset the shadow memory to
reflect that the type is now unknown. The runtime intercepts memset,
memcpy, etc. to perform the same function for the library calls.

The runtime essentially repeats these checks, but uses the full TBAA
algorithm, just as the compiler does, to determine when two types are
permitted to alias. In a situation where access overlap has occurred and
aliasing is not permitted, an error is generated.

As a note, this implementation does not use the compressed shadow-memory
scheme discussed previously
(http://lists.llvm.org/pipermail/llvm-dev/2017-April/111766.html). That
scheme would not handle the struct-path (i.e. structure offset)
information that our TBAA represents. I expect we'll want to further
work on compressing the shadow-memory representation, but I think it
makes sense to do that as follow-up work.

This includes build fixes for Linux from Mingjie Xu.

Depends on #76260 (Clang support), #76259 (LLVM support)


PR: #76261
@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 18, 2024

LLVM Buildbot has detected a new failure on builder lld-x86_64-win running on as-worker-93 while building llvm at step 7 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/146/builds/1862

Here is the relevant piece of the build log for the reference
Step 7 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM-Unit :: Support/./SupportTests.exe/38/87' FAILED ********************
Script(shard):
--
GTEST_OUTPUT=json:C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe-LLVM-Unit-24408-38-87.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=87 GTEST_SHARD_INDEX=38 C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe
--

Script:
--
C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe --gtest_filter=ProgramEnvTest.CreateProcessLongPath
--
C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(160): error: Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(163): error: fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied



C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:160
Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:163
fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied




********************


@aeubanks
Copy link
Contributor

aeubanks commented Jan 8, 2025

could we get a quick writeup of how TySan works in llvm/docs?

also, is there a reason to have both a module and a function pass instead of just a module pass?

@vitalybuka
Copy link
Collaborator

also, is there a reason to have both a module and a function pass instead of just a module pass?

This part is addressed here #120667

@carlosgalvezp
Copy link
Contributor

Great stuff to have! I think it would be really good to have documentation similar to the existing sanitizers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler-rt:sanitizer llvm:analysis Includes value tracking, cost tables and constant folding llvm:ir llvm:transforms TBAA Type-Based Alias Analysis / Strict Aliasing
Projects
None yet
Development

Successfully merging this pull request may close these issues.