Skip to content

[RemoveDIs][DebugInfo][IR] Add parsing for non-intrinsic debug values #79818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 7, 2024

Conversation

SLTozer
Copy link
Contributor

@SLTozer SLTozer commented Jan 29, 2024

This patch adds support for parsing the proposed non-instruction debug info ("RemoveDIs") from textual IR. This patch does not cover bitcode parsing or documentation, and printing and verifier support are added in prior PRs (#79281, #79810); this patch is presented as a standalone patch for the sake of review, but it is intended to land after the aforementioned patches and will not compile without being based on them[0]. This patch also contains a basic affirmative test ("can we parse the new format correctly"), but not the negative tests (testing for parser errors) - these will be added soon, but shouldn't hold up review of the implementation details of the patch.

An important detail of this patch is the fact that although we can now parse in the RemoveDIs (new) and Intrinsic (old) debug info formats, we will always convert back to the old format at the end of parsing - this is done for two reasons: firstly to ensure that every tool is able to process IR printed in the new format, regardless of whether that tool has had RemoveDIs support added, and secondly to maintain the effect of the existing flags: for the tools where support for the new format has been added, we will run LLVM passes in the new format iff --try-experimental-debuginfo-iterators=true, and we will print in the new format iff --write-experimental-debuginfo-iterators=true; the format of the textual IR should have no effect on either of these features.

For reference, the format of the RemoveDIs debug info is as follows:

Old: call void @llvm.dbg.value(metadata i32 %a, metadata !12, metadata !DIExpression()), !dbg !14
New:   #dbg_value { i32 %a, !12, !DIExpression(), !14 }

Old: call void @llvm.dbg.declare(metadata ptr %b, metadata !13, metadata !DIExpression()), !dbg !16
New:   #dbg_declare { ptr %b, !13, !DIExpression(), !16 }

Old: call void @llvm.dbg.assign(metadata i32 %add, metadata !13, metadata !DIExpression(), metadata !15, metadata ptr %b, metadata !DIExpression()), !dbg !18
New:   #dbg_assign { i32 %add, !13, !DIExpression(), !15, ptr %b, !DIExpression(), !18 }

For the sake of review, a few notes on the design and implementation of this patch (some of these will be added to the docs in a subsequent patch):

  • In order to support parsing this format, this patch adds # as a token to the lexer, and keywords for each debug record variant.
  • It is an error for a module to have a mixed format, so we cannot contain calls or declarations for the existing debug intrinsics in the same module as a debug record.
  • Although we print debug records with distinct indentation, it is not an error to parse a module with different indentation, just as with ordinary instructions.
  • Because the set of debug records are much more limited than the set of instructions and there is a large overlap between their arguments, the parser handles all types of debug records in a single function with hardcoded parameter expectations, rather than parsing a generic list of parameters and relying on the verifier to catch incorrect parameter lists. This will likely continue even when we add other types of records, e.g. llvm.dbg.label.

[0] If you want to test the parser patch locally, there is a branch with all the prerequisite patches applied at: https://github.com/SLTozer/llvm-project/tree/ddd-ir-parse

@llvmbot
Copy link
Member

llvmbot commented Jan 29, 2024

@llvm/pr-subscribers-debuginfo

@llvm/pr-subscribers-llvm-ir

Author: Stephen Tozer (SLTozer)

Changes

This patch adds support for parsing the proposed non-instruction debug info ("RemoveDIs") from textual IR. This patch does not cover bitcode parsing or documentation, and printing and verifier support are added in prior PRs (#79281, #79810); this patch is presented as a standalone patch for the sake of review, but it is intended to land after the aforementioned patches and will not compile without being based on them[0]. This patch also contains a basic affirmative test ("can we parse the new format correctly"), but not the negative tests (testing for parser errors) - these will be added soon, but shouldn't hold up review of the implementation details of the patch.

An important detail of this patch is the fact that although we can now parse in the RemoveDIs (new) and Intrinsic (old) debug info formats, we will always convert back to the old format at the end of parsing - this is done for two reasons: firstly to ensure that every tool is able to process IR printed in the new format, regardless of whether that tool has had RemoveDIs support added, and secondly to maintain the effect of the existing flags: for the tools where support for the new format has been added, we will run LLVM passes in the new format iff --try-experimental-debuginfo-iterators=true, and we will print in the new format iff --write-experimental-debuginfo-iterators=true; the format of the textual IR should have no effect on either of these features.

For reference, the format of the RemoveDIs debug info is as follows:

Old: call void @<!-- -->llvm.dbg.value(metadata i32 %a, metadata !12, metadata !DIExpression()), !dbg !14
New:   #dbg_value { i32 %a, !12, !DIExpression(), !14 }

Old: call void @<!-- -->llvm.dbg.declare(metadata ptr %b, metadata !13, metadata !DIExpression()), !dbg !16
New:   #dbg_declare { ptr %b, !13, !DIExpression(), !16 }

Old: call void @<!-- -->llvm.dbg.assign(metadata i32 %add, metadata !13, metadata !DIExpression(), metadata !15, metadata ptr %b, metadata !DIExpression()), !dbg !18
New:   #dbg_assign { i32 %add, !13, !DIExpression(), !15, ptr %b, !DIExpression(), !18 }

For the sake of review, a few notes on the design and implementation of this patch (some of these will be added to the docs in a subsequent patch):

  • In order to support parsing this format, this patch adds # as a token to the lexer, and keywords for each debug record variant.
  • It is an error for a module to have a mixed format, so we cannot contain calls or declarations for the existing debug intrinsics in the same module as a debug record.
  • Although we print debug records with distinct indentation, it is not an error to parse a module with different indentation, just as with ordinary instructions.
  • Because the set of debug records are much more limited than the set of instructions and there is a large overlap between their arguments, the parser handles all types of debug records in a single function with hardcoded parameter expectations, rather than parsing a generic list of parameters and relying on the verifier to catch incorrect parameter lists. This will likely continue even when we add other types of records, e.g. llvm.dbg.label.

[0] If you want to test the parser patch locally, there is a branch with all the prerequisite patches applied at: https://github.com/SLTozer/llvm-project/tree/ddd-ir-parse


Full diff: https://github.com/llvm/llvm-project/pull/79818.diff

7 Files Affected:

  • (modified) llvm/include/llvm/AsmParser/LLParser.h (+4)
  • (modified) llvm/include/llvm/AsmParser/LLToken.h (+3)
  • (modified) llvm/include/llvm/IR/DebugProgramInstruction.h (+24)
  • (modified) llvm/lib/AsmParser/LLLexer.cpp (+17-1)
  • (modified) llvm/lib/AsmParser/LLParser.cpp (+143-1)
  • (modified) llvm/lib/IR/DebugProgramInstruction.cpp (+17)
  • (added) llvm/test/DebugInfo/roundtrip-non-instruction-debug-info.ll (+87)
diff --git a/llvm/include/llvm/AsmParser/LLParser.h b/llvm/include/llvm/AsmParser/LLParser.h
index cf358c384f5203..6e3facbfcea573 100644
--- a/llvm/include/llvm/AsmParser/LLParser.h
+++ b/llvm/include/llvm/AsmParser/LLParser.h
@@ -177,6 +177,9 @@ namespace llvm {
     /// UpgradeDebuginfo so it can generate broken bitcode.
     bool UpgradeDebugInfo;
 
+    bool SeenNewDbgInfoFormat = false;
+    bool SeenOldDbgInfoFormat = false;
+
     std::string SourceFileName;
 
   public:
@@ -585,6 +588,7 @@ namespace llvm {
     bool parseMDNodeTail(MDNode *&N);
     bool parseMDNodeVector(SmallVectorImpl<Metadata *> &Elts);
     bool parseMetadataAttachment(unsigned &Kind, MDNode *&MD);
+    bool parseDebugProgramValue(DPValue *&DPV, PerFunctionState &PFS);
     bool parseInstructionMetadata(Instruction &Inst);
     bool parseGlobalObjectMetadataAttachment(GlobalObject &GO);
     bool parseOptionalFunctionMetadata(Function &F);
diff --git a/llvm/include/llvm/AsmParser/LLToken.h b/llvm/include/llvm/AsmParser/LLToken.h
index 147cf56c821aa1..4ce97041ca309e 100644
--- a/llvm/include/llvm/AsmParser/LLToken.h
+++ b/llvm/include/llvm/AsmParser/LLToken.h
@@ -36,6 +36,7 @@ enum Kind {
   exclaim, // !
   bar,     // |
   colon,   // :
+  hash,    // #
 
   kw_vscale,
   kw_x,
@@ -482,6 +483,8 @@ enum Kind {
   // Type valued tokens (TyVal).
   Type,
 
+  DbgRecordType,
+
   APFloat, // APFloatVal
   APSInt   // APSInt
 };
diff --git a/llvm/include/llvm/IR/DebugProgramInstruction.h b/llvm/include/llvm/IR/DebugProgramInstruction.h
index 737417fb9b9a54..284bfdd18db94f 100644
--- a/llvm/include/llvm/IR/DebugProgramInstruction.h
+++ b/llvm/include/llvm/IR/DebugProgramInstruction.h
@@ -129,6 +129,30 @@ class DPValue : public ilist_node<DPValue>, private DebugValueUser {
           DIAssignID *AssignID, Metadata *Address,
           DIExpression *AddressExpression, const DILocation *DI);
 
+private:
+  /// Private constructor for creating new instances during parsing only. Only
+  /// called through `createUnresolvedDPValue` below, which makes clear that
+  /// this is used for parsing only, and will later return a subclass depending
+  /// on which Type is passed.
+  DPValue(LocationType Type, Metadata *Val, MDNode *Variable,
+          DIExpression *Expression, MDNode *AssignID, Metadata *Address,
+          DIExpression *AddressExpression, MDNode *DI);
+
+public:
+  /// Used to create DPValues during parsing, where some metadata references may
+  /// still be unresolved. Although for some fields a generic `Metadata*`
+  /// argument is accepted for forward type-references, the verifier and
+  /// accessors will reject incorrect types later on. The function is used for
+  /// all types of DPValues for simplicity while parsing, but asserts if any
+  /// necessary fields are empty or unused fields are not empty, i.e. if the
+  /// #dbg_assign fields are used for a non-dbg-assign type.
+  static DPValue *createUnresolvedDPValue(LocationType Type, Metadata *Val,
+                                          MDNode *Variable,
+                                          DIExpression *Expression,
+                                          MDNode *AssignID, Metadata *Address,
+                                          DIExpression *AddressExpression,
+                                          MDNode *DI);
+
   static DPValue *createDPVAssign(Value *Val, DILocalVariable *Variable,
                                   DIExpression *Expression,
                                   DIAssignID *AssignID, Value *Address,
diff --git a/llvm/lib/AsmParser/LLLexer.cpp b/llvm/lib/AsmParser/LLLexer.cpp
index c8da3efbb68aff..4daf8a484da63e 100644
--- a/llvm/lib/AsmParser/LLLexer.cpp
+++ b/llvm/lib/AsmParser/LLLexer.cpp
@@ -440,7 +440,9 @@ lltok::Kind LLLexer::LexCaret() {
 ///    AttrGrpID ::= #[0-9]+
 lltok::Kind LLLexer::LexHash() {
   // Handle AttrGrpID: #[0-9]+
-  return LexUIntID(lltok::AttrGrpID);
+  if (isdigit(static_cast<unsigned char>(CurPtr[0])))
+    return LexUIntID(lltok::AttrGrpID);
+  return lltok::hash;
 }
 
 /// Lex a label, integer type, keyword, or hexadecimal integer constant.
@@ -922,6 +924,20 @@ lltok::Kind LLLexer::LexIdentifier() {
 
 #undef DWKEYWORD
 
+// Keywords for debug record types.
+#define DBGRECORDTYPEKEYWORD(STR)                                              \
+  do {                                                                         \
+    if (Keyword == "dbg_" #STR) {                                              \
+      StrVal = #STR;                                                           \
+      return lltok::DbgRecordType;                                             \
+    }                                                                          \
+  } while (false)
+
+  DBGRECORDTYPEKEYWORD(value);
+  DBGRECORDTYPEKEYWORD(declare);
+  DBGRECORDTYPEKEYWORD(assign);
+#undef DBGRECORDTYPEKEYWORD
+
   if (Keyword.starts_with("DIFlag")) {
     StrVal.assign(Keyword.begin(), Keyword.end());
     return lltok::DIFlag;
diff --git a/llvm/lib/AsmParser/LLParser.cpp b/llvm/lib/AsmParser/LLParser.cpp
index d6c5993797de11..b22d21f94a26ab 100644
--- a/llvm/lib/AsmParser/LLParser.cpp
+++ b/llvm/lib/AsmParser/LLParser.cpp
@@ -62,6 +62,8 @@ static cl::opt<bool> AllowIncompleteIR(
         "Allow incomplete IR on a best effort basis (references to unknown "
         "metadata will be dropped)"));
 
+extern llvm::cl::opt<bool> UseNewDbgInfoFormat;
+
 static std::string getTypeString(Type *T) {
   std::string Result;
   raw_string_ostream Tmp(Result);
@@ -6012,6 +6014,17 @@ bool LLParser::parseTypeAndBasicBlock(BasicBlock *&BB, LocTy &Loc,
   return false;
 }
 
+bool isOldDbgFormatIntrinsic(StringRef Name) {
+  // Exit early for the common (non-debug-intrinsic) case.
+  // We can make this the only check when we begin supporting all "llvm.dbg"
+  // intrinsics in the new debug info format.
+  if (!Name.starts_with("llvm.dbg."))
+    return false;
+  Intrinsic::ID FnID = Function::lookupIntrinsicID(Name);
+  return FnID == Intrinsic::dbg_declare || FnID == Intrinsic::dbg_value ||
+         FnID == Intrinsic::dbg_assign;
+}
+
 /// FunctionHeader
 ///   ::= OptionalLinkage OptionalPreemptionSpecifier OptionalVisibility
 ///       OptionalCallingConv OptRetAttrs OptUnnamedAddr Type GlobalName
@@ -6194,6 +6207,13 @@ bool LLParser::parseFunctionHeader(Function *&Fn, bool IsDefine,
     }
   }
 
+  if (isOldDbgFormatIntrinsic(FunctionName)) {
+    if (SeenNewDbgInfoFormat)
+      return error(NameLoc, "llvm.dbg intrinsic should not appear in a module "
+                            "using non-intrinsic debug info");
+    SeenOldDbgInfoFormat = true;
+  }
+
   Fn = Function::Create(FT, GlobalValue::ExternalLinkage, AddrSpace,
                         FunctionName, M);
 
@@ -6359,9 +6379,29 @@ bool LLParser::parseBasicBlock(PerFunctionState &PFS) {
 
   std::string NameStr;
 
-  // parse the instructions in this block until we get a terminator.
+  // parse the instructions and debug values in this block until we get a
+  // terminator.
   Instruction *Inst;
+  DPValue *DPV;
+  SmallVector<std::unique_ptr<DPValue>> TrailingDPValues;
   do {
+    // Handle debug records first - there should always be an instruction
+    // following the debug records, i.e. they cannot appear after the block
+    // terminator.
+    while (Lex.getKind() == lltok::hash) {
+      if (SeenOldDbgInfoFormat)
+        return error(Lex.getLoc(), "debug record should not appear in a module "
+                                   "containing debug info intrinsics");
+      SeenNewDbgInfoFormat = true;
+      Lex.Lex();
+      if (!BB->getModule()->IsNewDbgInfoFormat)
+        BB->getModule()->convertToNewDbgValues();
+
+      if (parseDebugProgramValue(DPV, PFS))
+        return true;
+      TrailingDPValues.emplace_back(DPV);
+    }
+
     // This instruction may have three possibilities for a name: a) none
     // specified, b) name specified "%foo =", c) number specified: "%4 =".
     LocTy NameLoc = Lex.getLoc();
@@ -6406,11 +6446,103 @@ bool LLParser::parseBasicBlock(PerFunctionState &PFS) {
     // Set the name on the instruction.
     if (PFS.setInstName(NameID, NameStr, NameLoc, Inst))
       return true;
+
+    // Attach any preceding debug values to this instruction.
+    for (std::unique_ptr<DPValue> &DPV : TrailingDPValues) {
+      BB->insertDPValueBefore(DPV.release(), Inst->getIterator());
+    }
+    TrailingDPValues.clear();
   } while (!Inst->isTerminator());
 
+  assert(TrailingDPValues.empty() &&
+         "All debug values should have been attached to an instruction.");
+
   return false;
 }
 
+/// parseDebugProgramValue
+///   ::= #dbg_Type { (ValueAsMetadata|DIArgList|MDNode), MetadataID,
+///   DIExpression, DILocation }
+bool LLParser::parseDebugProgramValue(DPValue *&DPV, PerFunctionState &PFS) {
+  using LocType = DPValue::LocationType;
+  LocTy DPVLoc = Lex.getLoc();
+  if (Lex.getKind() != lltok::DbgRecordType) {
+    return error(DPVLoc, "expected debug record type here");
+  }
+  auto Type = StringSwitch<LocType>(Lex.getStrVal())
+                  .Case("declare", LocType::Declare)
+                  .Case("value", LocType::Value)
+                  .Case("assign", LocType::Assign)
+                  .Default(LocType::End);
+  if (Type == LocType::End)
+    return error(DPVLoc, "expected valid #dbg record here");
+  Lex.Lex();
+  if (parseToken(lltok::lbrace, "Expected '{' here"))
+    return true;
+
+  // Parse Value field...
+  Metadata *ValLocMD;
+  if (parseMetadata(ValLocMD, &PFS))
+    return true;
+  if (parseToken(lltok::comma, "Expected ',' here"))
+    return true;
+
+  // Parse Variable field...
+  MDNode *Variable;
+  if (parseMDNode(Variable))
+    return true;
+  if (parseToken(lltok::comma, "Expected ',' here"))
+    return true;
+
+  // Parse Expression field...
+  LocTy ExprLoc = Lex.getLoc();
+  Metadata *Expression;
+  if (parseMetadata(Expression, &PFS))
+    return true;
+  if (!isa<DIExpression>(Expression))
+    return error(ExprLoc, "expected valid DIExpression here");
+  if (parseToken(lltok::comma, "Expected ',' here"))
+    return true;
+
+  // Parse additional fields for #dbg_assign.
+  MDNode *AssignID = nullptr;
+  Metadata *AddressLocation = nullptr;
+  Metadata *AddressExpression = nullptr;
+  if (Type == LocType::Assign) {
+    // Parse DIAssignID...
+    if (parseMDNode(AssignID))
+      return true;
+    if (parseToken(lltok::comma, "Expected ',' here"))
+      return true;
+
+    // Parse address ValueAsMetadata...
+    if (parseMetadata(AddressLocation, &PFS))
+      return true;
+    if (parseToken(lltok::comma, "Expected ',' here"))
+      return true;
+
+    // Parse address DIExpression...
+    LocTy AddressExprLoc = Lex.getLoc();
+    if (parseMetadata(AddressExpression, &PFS))
+      return true;
+    if (!isa<DIExpression>(Expression))
+      return error(AddressExprLoc, "expected valid DIExpression here");
+    if (parseToken(lltok::comma, "Expected ',' here"))
+      return true;
+  }
+
+  /// Parse DILocation...
+  MDNode *DebugLoc;
+  if (parseMDNode(DebugLoc))
+    return true;
+
+  if (parseToken(lltok::rbrace, "Expected '}' here"))
+    return true;
+  DPV = DPValue::createUnresolvedDPValue(
+      Type, ValLocMD, Variable, cast<DIExpression>(Expression), AssignID,
+      AddressLocation, cast_or_null<DIExpression>(AddressExpression), DebugLoc);
+  return false;
+}
 //===----------------------------------------------------------------------===//
 // Instruction Parsing.
 //===----------------------------------------------------------------------===//
@@ -7638,6 +7770,16 @@ bool LLParser::parseCall(Instruction *&Inst, PerFunctionState &PFS,
     }
     CI->setFastMathFlags(FMF);
   }
+
+  if (CalleeID.Kind == ValID::t_GlobalName &&
+      isOldDbgFormatIntrinsic(CalleeID.StrVal)) {
+    if (SeenNewDbgInfoFormat) {
+      CI->deleteValue();
+      return error(CallLoc, "llvm.dbg intrinsic should not appear in a module "
+                            "using non-intrinsic debug info");
+    }
+    SeenOldDbgInfoFormat = true;
+  }
   CI->setAttributes(PAL);
   ForwardRefAttrGroups[CI] = FwdRefAttrGrps;
   Inst = CI;
diff --git a/llvm/lib/IR/DebugProgramInstruction.cpp b/llvm/lib/IR/DebugProgramInstruction.cpp
index fd234685d5fd4b..560ace9a84d70c 100644
--- a/llvm/lib/IR/DebugProgramInstruction.cpp
+++ b/llvm/lib/IR/DebugProgramInstruction.cpp
@@ -59,6 +59,23 @@ DPValue::DPValue(Metadata *Value, DILocalVariable *Variable,
 
 void DPValue::deleteInstr() { delete this; }
 
+DPValue::DPValue(DPValue::LocationType Type, Metadata *Val, MDNode *Variable,
+                 DIExpression *Expression, MDNode *AssignID, Metadata *Address,
+                 DIExpression *AddressExpression, MDNode *DI)
+    : DebugValueUser({Val, Address, AssignID}), Variable(Variable),
+      Expression(Expression), DbgLoc(DI), AddressExpression(AddressExpression),
+      Type(Type) {}
+
+DPValue *DPValue::createUnresolvedDPValue(DPValue::LocationType Type,
+                                          Metadata *Val, MDNode *Variable,
+                                          DIExpression *Expression,
+                                          MDNode *AssignID, Metadata *Address,
+                                          DIExpression *AddressExpression,
+                                          MDNode *DI) {
+  return new DPValue(Type, Val, Variable, Expression, AssignID, Address,
+                     AddressExpression, DI);
+}
+
 DPValue *DPValue::createDPValue(Value *Location, DILocalVariable *DV,
                                 DIExpression *Expr, const DILocation *DI) {
   return new DPValue(ValueAsMetadata::get(Location), DV, Expr, DI,
diff --git a/llvm/test/DebugInfo/roundtrip-non-instruction-debug-info.ll b/llvm/test/DebugInfo/roundtrip-non-instruction-debug-info.ll
new file mode 100644
index 00000000000000..f8c0024977bc22
--- /dev/null
+++ b/llvm/test/DebugInfo/roundtrip-non-instruction-debug-info.ll
@@ -0,0 +1,87 @@
+;; Test that we can write in the old debug info format.
+; RUN: opt --passes=verify -S --write-experimental-debuginfo=false < %s \
+; RUN:   | FileCheck %s --check-prefixes=CHECK,OLDDBG --implicit-check-not=llvm.dbg
+
+;; Test that we can write in the new debug info format...
+; RUN: opt --passes=verify -S --write-experimental-debuginfo=true < %s \
+; RUN:   | FileCheck %s --check-prefixes=CHECK,NEWDBG --implicit-check-not=llvm.dbg
+
+;; ...and then read the new format and write the old format.
+; RUN: opt --passes=verify -S --write-experimental-debuginfo=true < %s \
+; RUN:   | opt --passes=verify -S --write-experimental-debuginfo=false \
+; RUN:   | FileCheck %s --check-prefixes=CHECK,OLDDBG  --implicit-check-not=llvm.dbg
+
+;; Test also that the new flag is independent of the flag that enables use of
+;; these non-instruction debug info during LLVM passes.
+; RUN: opt --passes=verify -S --try-experimental-debuginfo-iterators --write-experimental-debuginfo=false < %s \
+; RUN:   | FileCheck %s --check-prefixes=CHECK,OLDDBG --implicit-check-not=llvm.dbg
+; RUN: opt --passes=verify -S --try-experimental-debuginfo-iterators --write-experimental-debuginfo=true < %s \
+; RUN:   | FileCheck %s --check-prefixes=CHECK,NEWDBG --implicit-check-not=llvm.dbg
+
+; CHECK: @f(i32 %[[VAL_A:[0-9a-zA-Z]+]])
+; CHECK-NEXT: entry:
+; OLDDBG-NEXT: call void @llvm.dbg.value(metadata i32 %[[VAL_A]], metadata ![[VAR_A:[0-9]+]], metadata !DIExpression()), !dbg ![[LOC_1:[0-9]+]]
+; NEWDBG-NEXT: {{^}}    #dbg_value { i32 %[[VAL_A]], ![[VAR_A:[0-9]+]], !DIExpression(), ![[LOC_1:[0-9]+]] }
+; CHECK-NEXT: {{^}}  %[[VAL_B:[0-9a-zA-Z]+]] = alloca
+; OLDDBG-NEXT: call void @llvm.dbg.declare(metadata ptr %[[VAL_B]], metadata ![[VAR_B:[0-9]+]], metadata !DIExpression()), !dbg ![[LOC_2:[0-9]+]]
+; NEWDBG-NEXT: {{^}}    #dbg_declare { ptr %[[VAL_B]], ![[VAR_B:[0-9]+]], !DIExpression(), ![[LOC_2:[0-9]+]] }
+; CHECK-NEXT: {{^}}  %[[VAL_ADD:[0-9a-zA-Z]+]] = add i32 %[[VAL_A]], 5
+; OLDDBG-NEXT: call void @llvm.dbg.value(metadata !DIArgList(i32 %[[VAL_A]], i32 %[[VAL_ADD]]), metadata ![[VAR_A]], metadata !DIExpression(DW_OP_LLVM_arg, 0, DW_OP_LLVM_arg, 1, DW_OP_plus)), !dbg ![[LOC_3:[0-9]+]]
+; NEWDBG-NEXT: {{^}}    #dbg_value { !DIArgList(i32 %[[VAL_A]], i32 %[[VAL_ADD]]), ![[VAR_A]], !DIExpression(DW_OP_LLVM_arg, 0, DW_OP_LLVM_arg, 1, DW_OP_plus), ![[LOC_3:[0-9]+]] }
+; CHECK-NEXT: {{^}}  store i32 %[[VAL_ADD]]{{.+}}, !DIAssignID ![[ASSIGNID:[0-9]+]]
+; OLDDBG-NEXT: call void @llvm.dbg.assign(metadata i32 %[[VAL_ADD]], metadata ![[VAR_B]], metadata !DIExpression(), metadata ![[ASSIGNID]], metadata ptr %[[VAL_B]], metadata !DIExpression()), !dbg ![[LOC_4:[0-9]+]]
+; NEWDBG-NEXT: {{^}}    #dbg_assign { i32 %[[VAL_ADD]], ![[VAR_B]], !DIExpression(), ![[ASSIGNID]], ptr %[[VAL_B]], !DIExpression(), ![[LOC_4:[0-9]+]] }
+; CHECK-NEXT: {{^}}  ret i32
+
+; OLDDBG-DAG: declare void @llvm.dbg.value
+; OLDDBG-DAG: declare void @llvm.dbg.declare
+; OLDDBG-DAG: declare void @llvm.dbg.assign
+
+; CHECK-DAG: llvm.dbg.cu
+; CHECK-DAG: ![[VAR_A]] = !DILocalVariable(name: "a"
+; CHECK-DAG: ![[VAR_B]] = !DILocalVariable(name: "b"
+; CHECK-DAG: ![[LOC_1]] = !DILocation(line: 3, column: 15
+; CHECK-DAG: ![[LOC_2]] = !DILocation(line: 3, column: 20
+; CHECK-DAG: ![[LOC_3]] = !DILocation(line: 3, column: 25
+; CHECK-DAG: ![[LOC_4]] = !DILocation(line: 3, column: 30
+
+define dso_local i32 @f(i32 %a) !dbg !7 {
+entry:
+  call void @llvm.dbg.value(metadata i32 %a, metadata !20, metadata !DIExpression()), !dbg !30
+  %b = alloca i32, !dbg !30, !DIAssignID !40
+  call void @llvm.dbg.declare(metadata ptr %b, metadata !21, metadata !DIExpression()), !dbg !31
+  %add = add i32 %a, 5, !dbg !31
+  call void @llvm.dbg.value(metadata !DIArgList(i32 %a, i32 %add), metadata !20, metadata !DIExpression(DW_OP_LLVM_arg, 0, DW_OP_LLVM_arg, 1, DW_OP_plus)), !dbg !32
+  store i32 %add, ptr %b, !dbg !32, !DIAssignID !40
+  call void @llvm.dbg.assign(metadata i32 %add, metadata !21, metadata !DIExpression(), metadata !40, metadata ptr %b, metadata !DIExpression()), !dbg !33
+  ret i32 %add, !dbg !33
+
+}
+
+declare void @llvm.dbg.value(metadata, metadata, metadata)
+declare void @llvm.dbg.declare(metadata, metadata, metadata)
+declare void @llvm.dbg.assign(metadata, metadata, metadata, metadata, metadata, metadata)
+
+!llvm.dbg.cu = !{!0}
+!llvm.module.flags = !{!3, !4, !5}
+!llvm.ident = !{!6}
+
+!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, producer: "clang version 18.0.0", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug, enums: !2, nameTableKind: None)
+!1 = !DIFile(filename: "print.c", directory: "/tmp")
+!2 = !{}
+!3 = !{i32 2, !"Dwarf Version", i32 5}
+!4 = !{i32 2, !"Debug Info Version", i32 3}
+!5 = !{i32 1, !"wchar_size", i32 4}
+!6 = !{!"clang version 18.0.0"}
+!7 = distinct !DISubprogram(name: "f", scope: !1, file: !1, line: 3, type: !8, isLocal: false, isDefinition: true, scopeLine: 3, flags: DIFlagPrototyped, isOptimized: true, unit: !0, retainedNodes: !13)
+!8 = !DISubroutineType(types: !9)
+!9 = !{!12, !12}
+!12 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
+!13 = !{!20, !21}
+!20 = !DILocalVariable(name: "a", arg: 1, scope: !7, file: !1, line: 3, type: !12)
+!21 = !DILocalVariable(name: "b", scope: !7, file: !1, line: 3, type: !12)
+!30 = !DILocation(line: 3, column: 15, scope: !7)
+!31 = !DILocation(line: 3, column: 20, scope: !7)
+!32 = !DILocation(line: 3, column: 25, scope: !7)
+!33 = !DILocation(line: 3, column: 30, scope: !7)
+!40 = distinct !DIAssignID()
\ No newline at end of file

Copy link
Contributor

@OCHyams OCHyams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code and test seem reasonable (with a few small nits scattered inline). Probably worth adding some negative (bad-input) tests, and another pair of eyes on this wouldn't hurt.

@@ -6359,9 +6379,29 @@ bool LLParser::parseBasicBlock(PerFunctionState &PFS) {

std::string NameStr;

// parse the instructions in this block until we get a terminator.
// parse the instructions and debug values in this block until we get a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since you're touching this comment anyway can you capitalise parse?

@@ -6406,11 +6446,103 @@ bool LLParser::parseBasicBlock(PerFunctionState &PFS) {
// Set the name on the instruction.
if (PFS.setInstName(NameID, NameStr, NameLoc, Inst))
return true;

// Attach any preceding debug values to this instruction.
for (std::unique_ptr<DPValue> &DPV : TrailingDPValues) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unneeded braces

Comment on lines 6464 to 6465
/// ::= #dbg_Type { (ValueAsMetadata|DIArgList|MDNode), MetadataID,
/// DIExpression, DILocation }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment needs updating for dbg.assigns

if (parseToken(lltok::lbrace, "Expected '{' here"))
return true;

// Parse Value field...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: full stop rather than ellipsis please, unless this is a common style in this file

LocTy DPVLoc = Lex.getLoc();
if (Lex.getKind() != lltok::DbgRecordType) {
return error(DPVLoc, "expected debug record type here");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: braces

Instruction *Inst;
DPValue *DPV;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you sink the DPV variable down to the use in the loop?

@adrian-prantl
Copy link
Collaborator

Let me know if this has been discussed elsewhere, but why is necessary to add a new # token to the grammar?

@adrian-prantl
Copy link
Collaborator

Whoops, sorry! No Idea why the "Close with comment button" is larger than the "Comment" button!

@SLTozer
Copy link
Contributor Author

SLTozer commented Jan 30, 2024

Let me know if this has been discussed elsewhere, but why is necessary to add a new # token to the grammar?

The purpose of the leading # is to create a syntactic distinction from instructions. Without using a leading symbol to specify that a non-instruction line is beginning, the only discriminating feature would be whitespace - which is not clear enough by itself, and can't be used by the parser.
The reason why # specifically is used here is simply that it's free real estate - most other symbols that LLVM uses followed by letters are already classified as specific tokens (i.e. !foo, @foo, $foo, etc), and we'd ideally avoid adding new symbols (such as ~ or &) or overloading existing symbols if alternatives are available.

@jmorse
Copy link
Member

jmorse commented Feb 5, 2024

On top of what Stephen said, using a new token to make debug-info separate encourages everyone to consider debug-info as a first class element of LLVM. The new representation isn't either Metadata or Value based, it's the glue between the two.

@SLTozer
Copy link
Contributor Author

SLTozer commented Feb 16, 2024

I've had to put aside testing for this patch for a short while, but I'll soon be uploading about a dozen small parsing tests for the various parsing errors that can occur in this patch; meanwhile, are there are any more questions or requested changes for the functional component for this patch, or is it good to land once the tests are available and approved?

@SLTozer
Copy link
Contributor Author

SLTozer commented Feb 19, 2024

Updates added: besides adding some more tests for the negative cases as promised, and some slight fixes in the code itself, I've also updated this patch to use parentheses for debug records following the discussion on the printing patch.

// here.
bool finalizeDebugInfoFormat(Module *M) {
if (M)
M->setIsNewDbgInfoFormat(false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't it use UseNewDbgInfoFormat value (and why isn't it used, if it's declared)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just due to an awkward combination of factors at the moment: not all LLVM features internally support the new debug info format yet, so we use UseNewDbgInfoFormat immediately before running optimization passes (the only part that fully supports the new format right now) and convert back afterwards. Right now UseNewDbgInfoFormat defaults to true, so if we use it here to conditionally use the new format then we'll break the tools that use the parser but don't support the new format. We might be able to change that so that UseNewDbgInfoFormat is only true when we use a tool that fully supports it, but that probably involves a separate patch.

In short, the declaration is the mistake here - I'll remove it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

Copy link
Member

@jmorse jmorse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, I've got some moderate nits inline.

I've got a slight twitching feeling about the addition of the debug-records as keywords (well, Identifiers) as it seems like quite a large commitment. However: we can always change the manner in which it's parsed, I think the format is good.

I'm not completely certain about making the mixing of debug-info formats a parsing error: wouldn't it be more natural to defer it until the Verifier? I get that that's impossible right now due to the conversions, but we could check for the existence of the three intrinsics by querying the Module in finalizeDebugInfoFormat and make that an error. That'd be less invasive than filtering the names of all call instruction targets. (Downside: no localised errors, but this doesn't strike me as being something hard to pin down).

Having an error on even the declaration of intrinsics, not just the use of them, seems like quite a firm requirement. I'm not sure if that'll bite us, but we can downgrade the check if that proves to be a problem.

Some sadly menial nits: IMO the test names shouldn't contain "removedi" as that's referring to a transitional project name, in several years time no-one will recognise that name, IMO they should be named something like "dbg-record-invalid-$num" or something? IMHO it's also worth generating a positive-input that successfully parses so that we can point people at it and indicate the range of different things that can be parsed.

(It feels like this is all pretty close!)

@@ -482,6 +483,8 @@ enum Kind {
// Type valued tokens (TyVal).
Type,

DbgRecordType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a few words here indicating these are (presumably) entire-line records rather than something that can appear inline, or otherwise indicate it's significance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think actually these need to be moved to the "String valued tokens" above, since they follow the same model (a DbgRecordType token is just the text "dbg_foo").

@@ -440,7 +440,9 @@ lltok::Kind LLLexer::LexCaret() {
/// AttrGrpID ::= #[0-9]+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be updating these BNF descriptions too? (probably just a case of adding '#'?

}

/// parseDebugProgramValue
/// ::= #dbg_Type '{' (ValueAsMetadata|DIArgList|MDNode) ',' MetadataID ','
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're now using parens rather than brackets.

For consistency, we should avoid using DIArgList etc here as that delves into the class hierarchy. The parsing is described in terms of BNF types, so I think we just say it's a Metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not entirely described in terms of actual token types - we refer to "Instructions" in other descriptions, for example, which aren't tokens but parseable objects, i.e. anything that we can call parseFoo for - but we can follow that description more closely than I've done here, since we do for example call parseMetadata for the "value" arguments.

@@ -0,0 +1,87 @@
;; Test that we can write in the old debug info format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicit check-not's for "#dbg" too?

@SLTozer
Copy link
Contributor Author

SLTozer commented Feb 22, 2024

I've got a slight twitching feeling about the addition of the debug-records as keywords (well, Identifiers) as it seems like quite a large commitment. However: we can always change the manner in which it's parsed, I think the format is good.

I get similar twitches, but at the same time it makes sense - the model here is the same, imo, as how we use a bunch of keywords - the names dbg_foo shouldn't clash with anything that currently exists, and they represent a small group of strings that, while it might be added to over compiler releases, is fixed for any given version of the compiler.

I'm not completely certain about making the mixing of debug-info formats a parsing error: wouldn't it be more natural to defer it until the Verifier?

There's been a lot of uncertainty for me over where to spread the checks between the parser and the verifier; one of the advantages of placing the checks in the parser is that we can give the user the actual location of a violating instance, while in the verifier not only can we not give a line+col position, but the original intrinsic may have been morphed by an upgrade (old IR files may contain dbg.values with extra arguments, for example), making it non-trivial to figure out where the offending line is, or what the problem is at all if you're unfamiliar with the new/old formats. On the other hand, it is aggressive to put these checks in the parser, so I'm open to changing it.

@jmorse
Copy link
Member

jmorse commented Feb 26, 2024

For the mixing-old-and-new issue, AFAIUI the population of people hand-writing debug-info IR tests is small, and we usually copy existing tests to write them. It's also not a syntactic problem to report as parsing will succeed, but the Module will be ill formed, it's more akin to a Verifier error but it makes the most sense to detect it in the parser (to avoid answering questions about what mixture of old/new debug-info should be permitted).

IMO: it'd be better to print out an informative error at the end of parsing, something like "#dbg_* records were encountered during parsing, but llvm.dbg.* intrinsics were seen in functions [list first ten users of the relevant declaration?]. Cannot load a module with both flavours of debug-info". That avoids the invasive code changes but gives a good diagnosis of what's gone wrong.

@SLTozer
Copy link
Contributor Author

SLTozer commented Feb 26, 2024

It's also not a syntactic problem to report as parsing will succeed, but the Module will be ill formed, it's more akin to a Verifier error but it makes the most sense to detect it in the parser

SGTM, (this was the first way I implemented it anyway!)

@SLTozer SLTozer force-pushed the ddd-ir-parse-standalone branch from 4b2e84d to b56190d Compare March 4, 2024 15:37
Copy link

github-actions bot commented Mar 4, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@OCHyams OCHyams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest code + test changes LGTM

@jmorse might want to check he's happy with the overall approach now though

@@ -0,0 +1,18 @@
; RUN: llvm-as -disable-output <%s 2>&1| FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't need <? and while you're there can you add a space between 1 and |?

EDIT: this is in a few of the other tests too. Were these copied and modified from existing tests? Feel free to ignore this nit

!DILocalVariable(scope: !1),
!DIExpression(),
!{})
; CHECK-LABEL: invalid #dbg record location
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's invalid about the "location" and which of the 3 locations (position, src-loc, var-loc) is this referring to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location overload strikes again... I'll update the verifier check to state more clearly that it's a DILocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM thanks

@jmorse
Copy link
Member

jmorse commented Mar 7, 2024

@jmorse might want to check he's happy with the overall approach now though

I'm fine with it, although we should be ready to adjust it if there's turbulence.

@SLTozer SLTozer merged commit 464d9d9 into llvm:main Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants