Skip to content

[lldb][Mach-O] Handle shared cache binaries correctly #117832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jasonmolenda
Copy link
Collaborator

The Mach-O load commands have an LC_SYMTAB / struct symtab_command which represents the offset of the symbol table (nlist records) and string table for this binary. In a mach-o binary on disk, these are file offsets. If a mach-o binary is loaded in memory with all segments consecutive, the symoff and stroff are the offsets from the TEXT segment (aka the mach-o header) virtual address to the virtual address of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the segments will be separated -- they will have different slide values. And it is possible for the LINKEDIT segment to be greater than 4GB away from the TEXT segment in the virtual address space, so these 32-bit offsets cannot express the offset from TEXT segment to these tables.

Create separate uint64_t variables to track the offset to the symbol table and string table, instead of reusing the 32-bit ones in the symtab_command structure.

rdar://140432279

The Mach-O load commands have an LC_SYMTAB / struct symtab_command
which represents the offset of the symbol table (nlist records) and
string table for this binary.  In a mach-o binary on disk, these are
file offsets.  If a mach-o binary is loaded in memory with all
segments consecutive, the `symoff` and `stroff` are the offsets from
the TEXT segment (aka the mach-o header) virtual address to the
virtual address of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the
segments will be separated -- they will have different slide values.
And it is possible for the LINKEDIT segment to be greater than 4GB
away from the TEXT segment in the virtual address space, so these
32-bit offsets cannot express the offset from TEXT segment to these
tables.

Create separate uint64_t variables to track the offset to the
symbol table and string table, instead of reusing the 32-bit ones
in the symtab_command structure.

rdar://140432279
@llvmbot
Copy link
Member

llvmbot commented Nov 27, 2024

@llvm/pr-subscribers-lldb

Author: Jason Molenda (jasonmolenda)

Changes

The Mach-O load commands have an LC_SYMTAB / struct symtab_command which represents the offset of the symbol table (nlist records) and string table for this binary. In a mach-o binary on disk, these are file offsets. If a mach-o binary is loaded in memory with all segments consecutive, the symoff and stroff are the offsets from the TEXT segment (aka the mach-o header) virtual address to the virtual address of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the segments will be separated -- they will have different slide values. And it is possible for the LINKEDIT segment to be greater than 4GB away from the TEXT segment in the virtual address space, so these 32-bit offsets cannot express the offset from TEXT segment to these tables.

Create separate uint64_t variables to track the offset to the symbol table and string table, instead of reusing the 32-bit ones in the symtab_command structure.

rdar://140432279


Full diff: https://github.com/llvm/llvm-project/pull/117832.diff

1 Files Affected:

  • (modified) lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp (+20-6)
diff --git a/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp b/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
index 079fd905037d45..5f047d84d53e73 100644
--- a/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
+++ b/lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
@@ -2244,6 +2244,18 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
   // code.
   typedef AddressDataArray<lldb::addr_t, bool, 100> FunctionStarts;
 
+  // The virtual address offset from TEXT to the symbol/string tables
+  // in the LINKEDIT section.  The LC_SYMTAB symtab_command `symoff` and
+  // `stroff` are uint32_t's that give the file offset in the binary.
+  // If the binary is laid down in memory with all segments consecutive,
+  // then these are the offsets from the mach-o header aka TEXT segment
+  // to the tables' virtual addresses.
+  // But if the binary is loaded in virtual address space with different
+  // slides for the segments (e.g. a shared cache), the LINKEDIT may be
+  // more than 4GB away from TEXT, and a 32-bit offset is not sufficient.
+  offset_t symbol_table_offset_from_TEXT = 0;
+  offset_t string_table_offset_from_TEXT = 0;
+
   // Record the address of every function/data that we add to the symtab.
   // We add symbols to the table in the order of most information (nlist
   // records) to least (function starts), and avoid duplicating symbols
@@ -2282,6 +2294,8 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
       if (m_data.GetU32(&offset, &symtab_load_command.symoff, 4) ==
           nullptr) // fill in symoff, nsyms, stroff, strsize fields
         return;
+      string_table_offset_from_TEXT = symtab_load_command.stroff;
+      symbol_table_offset_from_TEXT = symtab_load_command.symoff;
       break;
 
     case LC_DYLD_INFO:
@@ -2403,9 +2417,9 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
 
       const addr_t linkedit_file_offset = linkedit_section_sp->GetFileOffset();
       const addr_t symoff_addr = linkedit_load_addr +
-                                 symtab_load_command.symoff -
+                                 symbol_table_offset_from_TEXT -
                                  linkedit_file_offset;
-      strtab_addr = linkedit_load_addr + symtab_load_command.stroff -
+      strtab_addr = linkedit_load_addr + string_table_offset_from_TEXT -
                     linkedit_file_offset;
 
       // Always load dyld - the dynamic linker - from memory if we didn't
@@ -2473,17 +2487,17 @@ void ObjectFileMachO::ParseSymtab(Symtab &symtab) {
       lldb::addr_t linkedit_offset = linkedit_section_sp->GetFileOffset();
       lldb::offset_t linkedit_slide =
           linkedit_offset - m_linkedit_original_offset;
-      symtab_load_command.symoff += linkedit_slide;
-      symtab_load_command.stroff += linkedit_slide;
+      symbol_table_offset_from_TEXT += linkedit_slide;
+      string_table_offset_from_TEXT += linkedit_slide;
       dyld_info.export_off += linkedit_slide;
       dysymtab.indirectsymoff += linkedit_slide;
       function_starts_load_command.dataoff += linkedit_slide;
       exports_trie_load_command.dataoff += linkedit_slide;
     }
 
-    nlist_data.SetData(m_data, symtab_load_command.symoff,
+    nlist_data.SetData(m_data, symbol_table_offset_from_TEXT,
                        nlist_data_byte_size);
-    strtab_data.SetData(m_data, symtab_load_command.stroff,
+    strtab_data.SetData(m_data, string_table_offset_from_TEXT,
                         strtab_data_byte_size);
 
     // We shouldn't have exports data from both the LC_DYLD_INFO command

@jasonmolenda
Copy link
Collaborator Author

I have two criticisms of the patch as-is. We read the LC_SYMTAB load command into symtab_load_command which has six fields, the first two being the usual cmd and cmdsize. We use all of the next four: ncmds, strsize, symoff, stroff. I'm taking the uses of two of these fields out of symtab_load_command, and putting them in 64-bit locals. But I think the smarter move is probably to read the load command into a temporary object, and then create four local variables with the resized fields as needed, or have a locally defined struct with the resized fields.

The second criticism is that I don't have a good way to test it. I have a request to have ProcessMachCore treat an LC_SEGMENT that has a virtual address & size, but no file size as an all-zeroes segment. In which case it would be possible to create a mach-o corefile that is larger than 4GB in size, but actually only uses a couple hundred kb on disk (and doesn't fill the CI filesystems), and then we'd have to hand-write a mach-o file with a LINKEDIT 4GB away from the TEXT segment. There's a couple pieces that don't exist to do all of this right now, though.

@Michael137
Copy link
Member

Is changing struct symtab_command to have the right-sized fields a no-go?

@jasonmolenda
Copy link
Collaborator Author

Is changing struct symtab_command to have the right-sized fields a no-go?

Yes, this is one possibility, and maybe the best. The current structure reflects the in-binary layout, so it can be read (with endian fixing) on one go.

@jasonmolenda
Copy link
Collaborator Author

Updated patch to have a local symtab_command structure in ObjectFileMachO with larger offset fields, and read the load command fields into this structure. Simplifies the patch; the exiting slide calculations can remain unmodified now that 64-bit offsets are being used.

Copy link
Member

@JDevlieghere JDevlieghere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @Michael137's suggestion and the name makes it obvious what's going on. LGTM.

@jasonmolenda jasonmolenda merged commit 448ac7d into llvm:main Nov 28, 2024
7 checks passed
@jasonmolenda jasonmolenda deleted the use-offset_t-for-symbol-table-and-string-table-offsets branch November 28, 2024 18:32
jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Nov 28, 2024
The Mach-O load commands have an LC_SYMTAB / struct symtab_command which
represents the offset of the symbol table (nlist records) and string
table for this binary. In a mach-o binary on disk, these are file
offsets. If a mach-o binary is loaded in memory with all segments
consecutive, the `symoff` and `stroff` are the offsets from the TEXT
segment (aka the mach-o header) virtual address to the virtual address
of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the
segments will be separated -- they will have different slide values. And
it is possible for the LINKEDIT segment to be greater than 4GB away from
the TEXT segment in the virtual address space, so these 32-bit offsets
cannot express the offset from TEXT segment to these tables.

Create separate uint64_t variables to track the offset to the symbol
table and string table, instead of reusing the 32-bit ones in the
symtab_command structure.

rdar://140432279
(cherry picked from commit 448ac7d)
jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Nov 28, 2024
The Mach-O load commands have an LC_SYMTAB / struct symtab_command which
represents the offset of the symbol table (nlist records) and string
table for this binary. In a mach-o binary on disk, these are file
offsets. If a mach-o binary is loaded in memory with all segments
consecutive, the `symoff` and `stroff` are the offsets from the TEXT
segment (aka the mach-o header) virtual address to the virtual address
of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the
segments will be separated -- they will have different slide values. And
it is possible for the LINKEDIT segment to be greater than 4GB away from
the TEXT segment in the virtual address space, so these 32-bit offsets
cannot express the offset from TEXT segment to these tables.

Create separate uint64_t variables to track the offset to the symbol
table and string table, instead of reusing the 32-bit ones in the
symtab_command structure.

rdar://140432279
(cherry picked from commit 448ac7d)
jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Nov 28, 2024
The Mach-O load commands have an LC_SYMTAB / struct symtab_command which
represents the offset of the symbol table (nlist records) and string
table for this binary. In a mach-o binary on disk, these are file
offsets. If a mach-o binary is loaded in memory with all segments
consecutive, the `symoff` and `stroff` are the offsets from the TEXT
segment (aka the mach-o header) virtual address to the virtual address
of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the
segments will be separated -- they will have different slide values. And
it is possible for the LINKEDIT segment to be greater than 4GB away from
the TEXT segment in the virtual address space, so these 32-bit offsets
cannot express the offset from TEXT segment to these tables.

Create separate uint64_t variables to track the offset to the symbol
table and string table, instead of reusing the 32-bit ones in the
symtab_command structure.

rdar://140432279
(cherry picked from commit 448ac7d)
JDevlieghere added a commit to swiftlang/llvm-project that referenced this pull request Dec 2, 2024
…-vm-offset-shared-cache-segments-61

[lldb][Mach-O] Handle shared cache binaries correctly (llvm#117832)
jasonmolenda added a commit to swiftlang/llvm-project that referenced this pull request Dec 2, 2024
…-vm-offset-shared-cache-segments

[lldb][Mach-O] Handle shared cache binaries correctly (llvm#117832)
adrian-prantl pushed a commit to adrian-prantl/llvm-project that referenced this pull request Dec 5, 2024
The Mach-O load commands have an LC_SYMTAB / struct symtab_command which
represents the offset of the symbol table (nlist records) and string
table for this binary. In a mach-o binary on disk, these are file
offsets. If a mach-o binary is loaded in memory with all segments
consecutive, the `symoff` and `stroff` are the offsets from the TEXT
segment (aka the mach-o header) virtual address to the virtual address
of the start of these tables.

However, if a Mach-O binary is a part of the shared cache, then the
segments will be separated -- they will have different slide values. And
it is possible for the LINKEDIT segment to be greater than 4GB away from
the TEXT segment in the virtual address space, so these 32-bit offsets
cannot express the offset from TEXT segment to these tables.

Create separate uint64_t variables to track the offset to the symbol
table and string table, instead of reusing the 32-bit ones in the
symtab_command structure.

rdar://140432279
(cherry picked from commit 448ac7d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants