Skip to content

Commit 0e3e9ef

Browse files
committed
[Backtracing] Add ImageMap instead of just using an Array.
We want to be able to efficiently serialise lists of images, and to do so it makes most sense to create a separate `ImageMap` type. This also provides a useful place to put methods to e.g. find an image by address or by build ID. rdar://124913332
1 parent 760cc57 commit 0e3e9ef

16 files changed

+1501
-224
lines changed

docs/Backtracing.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -327,3 +327,6 @@ Backtraces are stored internally in a format called :download:`Compact Backtrace
327327
Format <CompactBacktraceFormat.md>`. This provides us with a way to store a
328328
large number of frames in a much smaller space than would otherwise be possible.
329329

330+
Similarly, where we need to store address to image mappings, we
331+
use :download:`Compact ImageMap Format <CompactImageMapFormat.md>` to minimise
332+
storage requirements.

docs/CompactBacktraceFormat.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ information byte:
2525
~~~
2626

2727
The `version` field identifies the version of CBF that is in use; this
28-
document describes version `0`. The `size` field is encoded as
28+
document describes version `0`. The `size` field is encqoded as
2929
follows:
3030

3131
| `size` | Machine word size |

docs/CompactImageMapFormat.md

Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
Compact ImageMap Format
2+
=======================
3+
4+
A process' address space contains (among other things) the set of
5+
dynamically loaded images that have been mapped into that address
6+
space. When generating crash logs or symbolicating backtraces, we
7+
need to be able to capture and potentially store the list of images
8+
that has been loaded, as well as some of the attributes of those
9+
images, including each image's
10+
11+
- Path
12+
- Build ID (aka UUID)
13+
- Base address
14+
- End-of-text address
15+
16+
Compact ImageMap Format (CIF) is a binary format for holding this
17+
information.
18+
19+
### General Format
20+
21+
Compact ImageMap Format data is byte aligned and starts with an
22+
information byte:
23+
24+
~~~
25+
7 6 5 4 3 2 1 0
26+
┌───────────────────────┬───────┐
27+
│ version │ size │
28+
└───────────────────────┴───────┘
29+
~~~
30+
31+
The `version` field identifies the version of CIF that is in use; this
32+
document describes version `0`. The `size` field is encoded as
33+
follows:
34+
35+
| `size` | Machine word size |
36+
| :----: | :---------------- |
37+
| 00 | 16-bit |
38+
| 01 | 32-bit |
39+
| 10 | 64-bit |
40+
| 11 | Reserved |
41+
42+
This is followed immediately by a field encoding the number of images
43+
in the image map; this field is encoded as a sequence of bytes, each
44+
holding seven bits of data, with the top bit clear for the final byte.
45+
The most significant byte is the first. e.g.
46+
47+
| `count` | Encoding |
48+
| ------: | :---------- |
49+
| 0 | 00 |
50+
| 1 | 01 |
51+
| 127 | 7f |
52+
| 128 | 81 00 |
53+
| 129 | 81 01 |
54+
| 700 | 85 3c |
55+
| 1234 | 89 52 |
56+
| 16384 | 81 80 00 |
57+
| 65535 | 83 ff 7f |
58+
| 2097152 | 81 80 80 00 |
59+
60+
This in turn is followed by the list of images, stored in order of
61+
increasing base address. For each image, we start with a header byte:
62+
63+
~~~
64+
7 6 5 4 3 2 1 0
65+
┌───┬───┬───────────┬───────────┐
66+
│ r │ 0 │ acount │ ecount │
67+
└───┴───┴───────────┴───────────┘
68+
~~~
69+
70+
If `r` is set, then the base address is understood to be relative to
71+
the previously computed base address.
72+
73+
This byte is followed by `acount + 1` bytes of base address, then
74+
`ecount + 1` bytes of offset to the end of text.
75+
76+
Following this is an encoded count of bytes in the build ID,
77+
encoded using the 7-bit scheme we used to encode the image count, and
78+
then after that come the build ID bytes themselves.
79+
80+
Finally, we encode the path string using the scheme below.
81+
82+
### String Encoding
83+
84+
Image paths contain a good deal of redundancy; paths are therefore
85+
encoded using a prefix compression scheme. The basic idea here is
86+
that while generating or reading the data, we maintain a mapping from
87+
small integers to path prefix segments.
88+
89+
The mapping is initialised with the following fixed list that never
90+
need to be stored in CIF data:
91+
92+
| code | Path prefix |
93+
| :--: | :---------------------------------- |
94+
| 0 | `/lib` |
95+
| 1 | `/usr/lib` |
96+
| 2 | `/usr/local/lib` |
97+
| 3 | `/opt/lib` |
98+
| 4 | `/System/Library/Frameworks` |
99+
| 5 | `/System/Library/PrivateFrameworks` |
100+
| 6 | `/System/iOSSupport` |
101+
| 7 | `/Library/Frameworks` |
102+
| 8 | `/System/Applications` |
103+
| 9 | `/Applications` |
104+
| 10 | `C:\Windows\System32` |
105+
| 11 | `C:\Program Files\` |
106+
107+
Codes below 32 are reserved for future expansion of the fixed list.
108+
109+
Strings are encoded as a sequence of bytes, as follows:
110+
111+
| `opcode` | Mnemonic | Meaning |
112+
| :--------: | :-------- | :---------------------------------------- |
113+
| `00000000` | `end` | Marks the end of the string |
114+
| `00xxxxxx` | `str` | Raw string data |
115+
| `01xxxxxx` | `framewk` | Names a framework |
116+
| `1exxxxxx` | `expand` | Identifies a prefix in the table |
117+
118+
#### `end`
119+
120+
##### Encoding
121+
122+
~~~
123+
7 6 5 4 3 2 1 0
124+
┌───────────────────────────────┐
125+
│ 0 0 0 0 0 0 0 0 │ end
126+
└───────────────────────────────┘
127+
~~~
128+
129+
#### Meaning
130+
131+
Marks the end of the string
132+
133+
#### `str`
134+
135+
##### Encoding
136+
137+
~~~
138+
7 6 5 4 3 2 1 0
139+
┌───────┬───────────────────────┐
140+
│ 0 0 │ count │ str
141+
└───────┴───────────────────────┘
142+
~~~
143+
144+
##### Meaning
145+
146+
The next `count` bytes are included in the string verbatim.
147+
Additionally, all path prefixes of this string data will be added to
148+
the current prefix table. For instance, if the string data is
149+
`/swift/linux/x86_64/libfoo.so`, then the prefix `/swift` will be
150+
assigned the next available code, `/swift/linux` the code after that,
151+
and `/swift/linux/x86_64` the code following that one.
152+
153+
#### `framewk`
154+
155+
##### Encoding
156+
157+
~~~
158+
7 6 5 4 3 2 1 0
159+
┌───────┬───────────────────────┐
160+
│ 0 1 │ count │ framewk
161+
└───────┴───────────────────────┘
162+
~~~
163+
164+
##### Meaning
165+
166+
The next byte is a version character (normally `A`, but some
167+
frameworks use higher characters), after which there are `count + 1`
168+
bytes of name.
169+
170+
This is expanded using the pattern
171+
`/<name>.framework/Versions/<version>/<name>`. This also marks the
172+
end of the string.
173+
174+
#### `expand`
175+
176+
##### Encoding
177+
178+
~~~
179+
7 6 5 4 3 2 1 0
180+
┌───┬───┬───────────────────────┐
181+
│ 1 │ e │ code │ expand
182+
└───┴───┴───────────────────────┘
183+
~~~
184+
185+
##### Meaning
186+
187+
If `e` is `0`, `code` is the index into the prefix table for the
188+
prefix that should be appended to the string at this point.
189+
190+
If `e` is `1`, this opcode is followed by `code + 1` bytes that give
191+
a value `v` such that `v + 64` is the index into the prefix table for
192+
the prefix that should be appended to the string at this point.
193+
194+
#### Example
195+
196+
Let's say we wish to encode the following strings:
197+
198+
/System/Library/Frameworks/AppKit.framework/Versions/C/AppKit
199+
/System/Library/Frameworks/Photos.framework/Versions/A/Photos
200+
/usr/lib/libobjc.A.dylib
201+
/usr/lib/libz.1.dylib
202+
/usr/lib/swift/libswiftCore.dylib
203+
/usr/lib/libSystem.B.dylib
204+
/usr/lib/libc++.1.dylib
205+
206+
We would encode
207+
208+
<84> <45> CAppKit <00>
209+
210+
We then follow with
211+
212+
<84> <45> APhotos <00>
213+
214+
Next we have
215+
216+
<81> <10> /libobjc.A.dylib <00>
217+
<81> <0d> /libz.1.dylib <00>
218+
<81> <19> /swift/libswiftCore.dylib <00>
219+
220+
assigning code 32 to `/swift`, then
221+
222+
<81> <12> /libSystem.B.dylib <00>
223+
<81> <0f> /libc++.1.dylib <00>
224+
225+
In total the original data would have taken up 256 bytes. Instead, we
226+
have used 122 bytes, a saving of over 50%.

0 commit comments

Comments
 (0)