Skip to content

[MC][NFC] Allow MCInstrAnalysis to store state #65479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 20, 2023

Conversation

mtvec
Copy link
Contributor

@mtvec mtvec commented Sep 6, 2023

Currently, all the analysis functions provided by MCInstrAnalysis work on a single instruction. On some targets, this limits the kind of instructions that can be successfully analyzed as common constructs may need multiple instructions.

For example, a typical call sequence on RISC-V uses a auipc+jalr pair. In order to analyse the jalr inside evaluateBranch, information about the corresponding auipc is needed. Similarly, AArch64 uses adrp+ldr pairs to access globals.

This patch proposes to add state to MCInstrAnalysis to support these use cases. Two new virtual methods are added:

  • updateState: takes an instruction and its address. This methods should be called by clients on every instruction and allows targets to store whatever information they need to analyse future instructions.
  • resetState: clears the state whenever it becomes irrelevant. Clients could call this, for example, when starting to disassemble a new function.

Note that the default implementations do nothing so this patch is NFC. No actual state is stored inside MCInstrAnalysis; deciding the structure of the state is left to the targets.

This patch also modifies llvm-objdump to use the new interface.

This patch is an alternative to D116677 and the idea of storing state in MCInstrAnalysis was first discussed there.

@mtvec mtvec requested review from asb, MaskRay and jh7370 September 6, 2023 13:41
@github-actions github-actions bot added the mc Machine (object) code label Sep 6, 2023
Currently, all the analysis functions provided by `MCInstrAnalysis` work
on a single instruction. On some targets, this limits the kind of
instructions that can be successfully analyzed as common constructs may
need multiple instructions.

For example, a typical call sequence on RISC-V uses a auipc+jalr pair.
In order to analyse the jalr inside `evaluateBranch`, information about
the corresponding auipc is needed. Similarly, AArch64 uses adrp+ldr
pairs to access globals.

This patch proposes to add state to `MCInstrAnalysis` to support these
use cases. Two new virtual methods are added:
- `updateState`: takes an instruction and its address. This methods
  should be called by clients on every instruction and allows targets to
  store whatever information they need to analyse future instructions.
- `resetState`: clears the state whenever it becomes irrelevant. Clients
  could call this, for example, when starting to disassemble a new
  function.

Note that the default implementations do nothing so this patch is NFC.
No actual state is stored inside `MCInstrAnalysis`; deciding the
structure of the state is left to the targets.

This patch also modifies llvm-objdump to use the new interface.

This patch is an alternative to D116677 and the idea of storing state in
`MCInstrAnalysis` was first discussed there.
@mtvec mtvec force-pushed the mcinstranalysis-state branch from 524fed0 to 7228670 Compare September 6, 2023 15:11
@mtvec
Copy link
Contributor Author

mtvec commented Sep 20, 2023

Ping.

@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2023

@llvm/pr-subscribers-llvm-binary-utilities

Changes

Currently, all the analysis functions provided by MCInstrAnalysis work on a single instruction. On some targets, this limits the kind of instructions that can be successfully analyzed as common constructs may need multiple instructions.

For example, a typical call sequence on RISC-V uses a auipc+jalr pair. In order to analyse the jalr inside evaluateBranch, information about the corresponding auipc is needed. Similarly, AArch64 uses adrp+ldr pairs to access globals.

This patch proposes to add state to MCInstrAnalysis to support these use cases. Two new virtual methods are added:

  • updateState: takes an instruction and its address. This methods should be called by clients on every instruction and allows targets to store whatever information they need to analyse future instructions.
  • resetState: clears the state whenever it becomes irrelevant. Clients could call this, for example, when starting to disassemble a new function.

Note that the default implementations do nothing so this patch is NFC. No actual state is stored inside MCInstrAnalysis; deciding the structure of the state is left to the targets.

This patch also modifies llvm-objdump to use the new interface.

This patch is an alternative to D116677 and the idea of storing state in MCInstrAnalysis was first discussed there.


Full diff: https://github.com/llvm/llvm-project/pull/65479.diff

2 Files Affected:

  • (modified) llvm/include/llvm/MC/MCInstrAnalysis.h (+15)
  • (modified) llvm/tools/llvm-objdump/llvm-objdump.cpp (+16-5)
diff --git a/llvm/include/llvm/MC/MCInstrAnalysis.h b/llvm/include/llvm/MC/MCInstrAnalysis.h
index c3c675c39c5590c..e3ddf0b8b8939c9 100644
--- a/llvm/include/llvm/MC/MCInstrAnalysis.h
+++ b/llvm/include/llvm/MC/MCInstrAnalysis.h
@@ -37,6 +37,21 @@ class MCInstrAnalysis {
   MCInstrAnalysis(const MCInstrInfo *Info) : Info(Info) {}
   virtual ~MCInstrAnalysis() = default;
 
+  /// Clear the internal state. See updateState for more information.
+  virtual void resetState() {}
+
+  /// Update internal state with \p Inst at \p Addr.
+  ///
+  /// For some types of analyses, inspecting a single instruction is not
+  /// sufficient. Some examples are auipc/jalr pairs on RISC-V or adrp/ldr pairs
+  /// on AArch64. To support inspecting multiple instructions, targets may keep
+  /// track of an internal state while analysing instructions. Clients should
+  /// call updateState for every instruction which allows later calls to one of
+  /// the analysis functions to take previous instructions into account.
+  /// Whenever state becomes irrelevant (e.g., when starting to disassemble a
+  /// new function), clients should call resetState to clear it.
+  virtual void updateState(const MCInst &Inst, uint64_t Addr) {}
+
   virtual bool isBranch(const MCInst &Inst) const {
     return Info->get(Inst.getOpcode()).isBranch();
   }
diff --git a/llvm/tools/llvm-objdump/llvm-objdump.cpp b/llvm/tools/llvm-objdump/llvm-objdump.cpp
index 96d74d6e2d5e865..8f6479d3c6e31e4 100644
--- a/llvm/tools/llvm-objdump/llvm-objdump.cpp
+++ b/llvm/tools/llvm-objdump/llvm-objdump.cpp
@@ -842,7 +842,7 @@ class DisassemblerTarget {
   std::unique_ptr<const MCSubtargetInfo> SubtargetInfo;
   std::shared_ptr<MCContext> Context;
   std::unique_ptr<MCDisassembler> DisAsm;
-  std::shared_ptr<const MCInstrAnalysis> InstrAnalysis;
+  std::shared_ptr<MCInstrAnalysis> InstrAnalysis;
   std::shared_ptr<MCInstPrinter> InstPrinter;
   PrettyPrinter *Printer;
 
@@ -1265,14 +1265,19 @@ collectBBAddrMapLabels(const std::unordered_map<uint64_t, BBAddrMap> &AddrToBBAd
   }
 }
 
-static void collectLocalBranchTargets(
-    ArrayRef<uint8_t> Bytes, const MCInstrAnalysis *MIA, MCDisassembler *DisAsm,
-    MCInstPrinter *IP, const MCSubtargetInfo *STI, uint64_t SectionAddr,
-    uint64_t Start, uint64_t End, std::unordered_map<uint64_t, std::string> &Labels) {
+static void
+collectLocalBranchTargets(ArrayRef<uint8_t> Bytes, MCInstrAnalysis *MIA,
+                          MCDisassembler *DisAsm, MCInstPrinter *IP,
+                          const MCSubtargetInfo *STI, uint64_t SectionAddr,
+                          uint64_t Start, uint64_t End,
+                          std::unordered_map<uint64_t, std::string> &Labels) {
   // So far only supports PowerPC and X86.
   if (!STI->getTargetTriple().isPPC() && !STI->getTargetTriple().isX86())
     return;
 
+  if (MIA)
+    MIA->resetState();
+
   Labels.clear();
   unsigned LabelCount = 0;
   Start += SectionAddr;
@@ -1298,6 +1303,7 @@ static void collectLocalBranchTargets(
           !Labels.count(Target) &&
           !(STI->getTargetTriple().isPPC() && Target == Index))
         Labels[Target] = ("L" + Twine(LabelCount++)).str();
+      MIA->updateState(Inst, Index);
     }
     Index += Size;
   }
@@ -1939,6 +1945,9 @@ disassembleObject(ObjectFile &Obj, const ObjectFile &DbgObj,
                                BBAddrMapLabels);
       }
 
+      if (DT->InstrAnalysis)
+        DT->InstrAnalysis->resetState();
+
       while (Index < End) {
         // ARM and AArch64 ELF binaries can interleave data and text in the
         // same section. We rely on the markers introduced to understand what
@@ -2155,6 +2164,8 @@ disassembleObject(ObjectFile &Obj, const ObjectFile &DbgObj,
               if (TargetOS == &CommentStream)
                 *TargetOS << "\n";
             }
+
+            DT->InstrAnalysis->updateState(Inst, SectionAddr + Index);
           }
         }
 

@asb
Copy link
Contributor

asb commented Oct 2, 2023

This change LGTM, but I'd appreciate an OK from someone who has a broader view of the MC infrastructure.

@asb
Copy link
Contributor

asb commented Oct 9, 2023

This change LGTM, but I'd appreciate an OK from someone who has a broader view of the MC infrastructure.

@MaskRay @jh7370 might you be able to take a quick look at this? Thanks in advance.

Copy link
Collaborator

@jh7370 jh7370 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks okay to me, but I probably need to see the usage in practice.

@mtvec
Copy link
Contributor Author

mtvec commented Oct 17, 2023

This looks okay to me, but I probably need to see the usage in practice.

Please see #65480 for an application.

Copy link
Collaborator

@jh7370 jh7370 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code formatter check seems to have failed. Might need to run clang-format?

No objections from me - the solution seems reasonably neat. Please give it a few more days to let @MaskRay chime in as he has a better understanding of the MC disassembly stuff than I do and he was away recently, so might still be catching up.

@mtvec
Copy link
Contributor Author

mtvec commented Oct 19, 2023

The code formatter check seems to have failed. Might need to run clang-format?

I'm not entirely sure why they're failing but it seems to be some intermittent error, unrelated to formatting, that I see happening every now and then on GH:

Error: Unable to determine a difference between 53179129edbff13644b7c46336773fb8740899c9..2eac9f59aca1c59d47a615084b2bfb839e9f1b0c

No objections from me - the solution seems reasonably neat. Please give it a few more days to let @MaskRay chime in as he has a better understanding of the MC disassembly stuff than I do and he was away recently, so might still be catching up.

Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:binary-utilities mc Machine (object) code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants