@@ -7,27 +7,138 @@ it is a pragmatic concession to the fact that *everyone* is pretty bad at modeli
7
7
atomics. At very least, we can benefit from existing tooling and research around
8
8
C.
9
9
10
- Trying to fully explain the model is fairly hopeless. If you want all the
11
- nitty-gritty details, you should check out [ C's specification] [ C11-model ] .
12
- Still, we'll try to cover the basics and some of the problems Rust developers
13
- face.
10
+ Trying to fully explain the model in this book is fairly hopeless. It's defined
11
+ in terms of madness-inducing causality graphs that require a full book to properly
12
+ understand in a practical way. If you want all the nitty-gritty details, you
13
+ should check out [ C's specification] [ C11-model ] . Still, we'll try to cover the
14
+ basics and some of the problems Rust developers face.
14
15
15
- The C11 memory model is fundamentally about trying to bridge the gap between C's
16
- single-threaded semantics, common compiler optimizations, and hardware peculiarities
17
- in the face of a multi-threaded environment. It does this by splitting memory
18
- accesses into two worlds: data accesses, and atomic accesses.
16
+ The C11 memory model is fundamentally about trying to bridge the gap between
17
+ the semantics we want, the optimizations compilers want, and the inconsistent
18
+ chaos our hardware wants. * We* would like to just write programs and have them
19
+ do exactly what we said but, you know, * fast* . Wouldn't that be great?
20
+
21
+
22
+
23
+
24
+ # Compiler Reordering
25
+
26
+ Compilers fundamentally want to be able to do all sorts of crazy transformations
27
+ to reduce data dependencies and eleminate dead code. In particular, they may
28
+ radically change the actual order of events, or make events never occur! If we
29
+ write something like
30
+
31
+ ``` rust,ignore
32
+ x = 1;
33
+ y = 3;
34
+ x = 2;
35
+ ```
36
+
37
+ The compiler may conclude that it would * really* be best if your program did
38
+
39
+ ``` rust,ignore
40
+ x = 2;
41
+ y = 3;
42
+ ```
43
+
44
+ This has inverted the order of events * and* completely eliminated one event. From
45
+ a single-threaded perspective this is completely unobservable: after all the
46
+ statements have executed we are in exactly the same state. But if our program is
47
+ multi-threaded, we may have been relying on ` x ` to * actually* be assigned to 1 before
48
+ ` y ` was assigned. We would * really* like the compiler to be able to make these kinds
49
+ of optimizations, because they can seriously improve performance. On the other hand,
50
+ we'd really like to be able to depend on our program * doing the thing we said* .
51
+
52
+
53
+
54
+
55
+ # Hardware Reordering
56
+
57
+ On the other hand, even if the compiler totally understood what we wanted and
58
+ respected our wishes, our * hardware* might instead get us in trouble. Trouble comes
59
+ from CPUs in the form of memory hierarchies. There is indeed a global shared memory
60
+ space somewhere in your hardware, but from the perspective of each CPU core it is
61
+ * so very far away* and * so very slow* . Each CPU would rather work with its local
62
+ cache of the data and only go through all the * anguish* of talking to shared
63
+ memory * only* when it doesn't actually have that memory in cache.
64
+
65
+ After all, that's the whole * point* of the cache, right? If every read from the
66
+ cache had to run back to shared memory to double check that it hadn't changed,
67
+ what would the point be? The end result is that the hardware doesn't guarantee
68
+ that events that occur in the same order on * one* thread, occur in the same order
69
+ on * another* thread. To guarantee this, we must issue special instructions to
70
+ the CPU telling it to be a bit less smart.
71
+
72
+ For instance, say we convince the compiler to emit this logic:
73
+
74
+ ``` text
75
+ initial state: x = 0, y = 1
76
+
77
+ THREAD 1 THREAD2
78
+ y = 3; if x == 1 {
79
+ x = 1; y *= 2;
80
+ }
81
+ ```
82
+
83
+ Ideally this program has 2 possible final states:
84
+
85
+ * ` y = 3 ` : (thread 2 did the check before thread 1 completed)
86
+ * ` y = 6 ` : (thread 2 did the check after thread 1 completed)
87
+
88
+ However there's a third potential state that the hardware enables:
89
+
90
+ * ` y = 2 ` : (thread 2 saw ` x = 2 ` , but not ` y = 3 ` , and then overwrote ` y = 3 ` )
91
+
92
+ ```
93
+
94
+ It's worth noting that different kinds of CPU provide different guarantees. It
95
+ is common to seperate hardware into two categories: strongly-ordered and weakly-
96
+ ordered. Most notably x86/64 provides strong ordering guarantees, while ARM and
97
+ provides weak ordering guarantees. This has two consequences for
98
+ concurrent programming:
99
+
100
+ * Asking for stronger guarantees on strongly-ordered hardware may be cheap or
101
+ even *free* because they already provide strong guarantees unconditionally.
102
+ Weaker guarantees may only yield performance wins on weakly-ordered hardware.
103
+
104
+ * Asking for guarantees that are *too* weak on strongly-ordered hardware
105
+ is more likely to *happen* to work, even though your program is strictly
106
+ incorrect. If possible, concurrent algorithms should be tested on
107
+ weakly-ordered hardware.
108
+
109
+
110
+
111
+
112
+
113
+ # Data Accesses
114
+
115
+ The C11 memory model attempts to bridge the gap by allowing us to talk about
116
+ the *causality* of our program. Generally, this is by establishing a
117
+ *happens before* relationships between parts of the program and the threads
118
+ that are running them. This gives the hardware and compiler room to optimize the
119
+ program more aggressively where a strict happens-before relationship isn't
120
+ established, but forces them to be more careful where one *is* established.
121
+ The way we communicate these relationships are through *data accesses* and
122
+ *atomic accesses*.
19
123
20
124
Data accesses are the bread-and-butter of the programming world. They are
21
125
fundamentally unsynchronized and compilers are free to aggressively optimize
22
- them. In particular data accesses are free to be reordered by the compiler
126
+ them. In particular, data accesses are free to be reordered by the compiler
23
127
on the assumption that the program is single-threaded. The hardware is also free
24
- to propagate the changes made in data accesses as lazily and inconsistently as
25
- it wants to other threads. Mostly critically, data accesses are where we get data
26
- races. These are pretty clearly awful semantics to try to write a multi-threaded
27
- program with.
128
+ to propagate the changes made in data accesses to other threads
129
+ as lazily and inconsistently as it wants. Mostly critically, data accesses are
130
+ how data races happen. Data accesses are very friendly to the hardware and
131
+ compiler, but as we've seen they offer *awful* semantics to try to
132
+ write synchronized code with.
28
133
29
- Atomic accesses are the answer to this. Each atomic access can be marked with
30
- an * ordering* . The set of orderings Rust exposes are:
134
+ Atomic accesses are how we tell the hardware and compiler that our program is
135
+ multi-threaded. Each atomic access can be marked with
136
+ an *ordering* that specifies what kind of relationship it establishes with
137
+ other accesses. In practice, this boils down to telling the compiler and hardware
138
+ certain things they *can't* do. For the compiler, this largely revolves
139
+ around re-ordering of instructions. For the hardware, this largely revolves
140
+ around how writes are propagated to other threads. The set of orderings Rust
141
+ exposes are:
31
142
32
143
* Sequentially Consistent (SeqCst)
33
144
* Release
@@ -36,11 +147,80 @@ an *ordering*. The set of orderings Rust exposes are:
36
147
37
148
(Note: We explicitly do not expose the C11 *consume* ordering)
38
149
39
- TODO: give simple "basic" explanation of these
40
- TODO: implementing Arc example (why does Drop need the trailing barrier?)
150
+ TODO: negative reasoning vs positive reasoning?
151
+
152
+
153
+
154
+
155
+ # Sequentially Consistent
156
+
157
+ Sequentially Consistent is the most powerful of all, implying the restrictions
158
+ of all other orderings. A Sequentially Consistent operation *cannot*
159
+ be reordered: all accesses on one thread that happen before and after it *stay*
160
+ before and after it. A program that has sequential consistency has the very nice
161
+ property that there is a single global execution of the program's instructions
162
+ that all threads agree on. This execution is also particularly nice to reason
163
+ about: it's just an interleaving of each thread's individual executions.
164
+
165
+ The relative developer-friendliness of sequential consistency doesn't come for
166
+ free. Even on strongly-ordered platforms, sequential consistency involves
167
+ emitting memory fences.
168
+
169
+ In practice, sequential consistency is rarely necessary for program correctness.
170
+ However sequential consistency is definitely the right choice if you're not
171
+ confident about the other memory orders. Having your program run a bit slower
172
+ than it needs to is certainly better than it running incorrectly! It's also
173
+ completely trivial to downgrade to a weaker consistency later.
174
+
175
+
176
+
177
+
178
+ # Acquire-Release
41
179
180
+ Acquire and Release are largely intended to be paired. Their names hint at
181
+ their use case: they're perfectly suited for acquiring and releasing locks,
182
+ and ensuring that critical sections don't overlap.
42
183
184
+ An acquire access ensures that every access after it *stays* after it. However
185
+ operations that occur before an acquire are free to be reordered to occur after
186
+ it.
43
187
188
+ A release access ensures that every access before it *stays* before it. However
189
+ operations that occur after a release are free to be reordered to occur before
190
+ it.
191
+
192
+ Basic use of release-acquire is simple: you acquire a location of memory to
193
+ begin the critical section, and the release that location to end it. If
194
+ thread A releases a location of memory and thread B acquires that location of
195
+ memory, this establishes that A's critical section *happened before* B's
196
+ critical section. All accesses that happened before the release will be observed
197
+ by anything that happens after the acquire.
198
+
199
+ On strongly-ordered platforms most accesses have release or acquire semantics,
200
+ making release and acquire often totally free. This is not the case on
201
+ weakly-ordered platforms.
202
+
203
+
204
+
205
+
206
+ # Relaxed
207
+
208
+ Relaxed accesses are the absolute weakest. They can be freely re-ordered and
209
+ provide no happens-before relationship. Still, relaxed operations *are* still
210
+ atomic, which is valuable. Relaxed operations are appropriate for things that
211
+ you definitely want to happen, but don't particularly care about much else. For
212
+ instance, incrementing a counter can be relaxed if you're not using the
213
+ counter to synchronize any other accesses.
214
+
215
+ There's rarely a benefit in making an operation relaxed on strongly-ordered
216
+ platforms, since they usually provide release-acquire semantics anyway. However
217
+ relaxed operations can be cheaper on weakly-ordered platforms.
218
+
219
+
220
+
221
+
222
+
223
+ TODO: implementing Arc example (why does Drop need the trailing barrier?)
44
224
45
225
46
226
[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf
0 commit comments