2
2
3
3
Rust pretty blatantly just inherits C11's memory model for atomics. This is not
4
4
due this model being particularly excellent or easy to understand. Indeed, this
5
- model is quite complex and known to have [ several flaws] [ C11-busted ] . Rather,
6
- it is a pragmatic concession to the fact that * everyone* is pretty bad at modeling
5
+ model is quite complex and known to have [ several flaws] [ C11-busted ] . Rather, it
6
+ is a pragmatic concession to the fact that * everyone* is pretty bad at modeling
7
7
atomics. At very least, we can benefit from existing tooling and research around
8
8
C.
9
9
10
10
Trying to fully explain the model in this book is fairly hopeless. It's defined
11
- in terms of madness-inducing causality graphs that require a full book to properly
12
- understand in a practical way. If you want all the nitty-gritty details, you
13
- should check out [ C's specification (Section 7.17)] [ C11-model ] . Still, we'll try
14
- to cover the basics and some of the problems Rust developers face.
11
+ in terms of madness-inducing causality graphs that require a full book to
12
+ properly understand in a practical way. If you want all the nitty-gritty
13
+ details, you should check out [ C's specification (Section 7.17)] [ C11-model ] .
14
+ Still, we'll try to cover the basics and some of the problems Rust developers
15
+ face.
15
16
16
- The C11 memory model is fundamentally about trying to bridge the gap between
17
- the semantics we want, the optimizations compilers want, and the inconsistent
18
- chaos our hardware wants. * We* would like to just write programs and have them
19
- do exactly what we said but, you know, * fast* . Wouldn't that be great?
17
+ The C11 memory model is fundamentally about trying to bridge the gap between the
18
+ semantics we want, the optimizations compilers want, and the inconsistent chaos
19
+ our hardware wants. * We* would like to just write programs and have them do
20
+ exactly what we said but, you know, * fast* . Wouldn't that be great?
20
21
21
22
22
23
@@ -41,33 +42,35 @@ x = 2;
41
42
y = 3;
42
43
```
43
44
44
- This has inverted the order of events * and* completely eliminated one event. From
45
- a single-threaded perspective this is completely unobservable: after all the
46
- statements have executed we are in exactly the same state. But if our program is
47
- multi-threaded, we may have been relying on ` x ` to * actually* be assigned to 1 before
48
- ` y ` was assigned. We would * really* like the compiler to be able to make these kinds
49
- of optimizations, because they can seriously improve performance. On the other hand,
50
- we'd really like to be able to depend on our program * doing the thing we said* .
45
+ This has inverted the order of events * and* completely eliminated one event.
46
+ From a single-threaded perspective this is completely unobservable: after all
47
+ the statements have executed we are in exactly the same state. But if our
48
+ program is multi-threaded, we may have been relying on ` x ` to * actually* be
49
+ assigned to 1 before ` y ` was assigned. We would * really* like the compiler to be
50
+ able to make these kinds of optimizations, because they can seriously improve
51
+ performance. On the other hand, we'd really like to be able to depend on our
52
+ program * doing the thing we said* .
51
53
52
54
53
55
54
56
55
57
# Hardware Reordering
56
58
57
59
On the other hand, even if the compiler totally understood what we wanted and
58
- respected our wishes, our * hardware* might instead get us in trouble. Trouble comes
59
- from CPUs in the form of memory hierarchies. There is indeed a global shared memory
60
- space somewhere in your hardware, but from the perspective of each CPU core it is
61
- * so very far away* and * so very slow* . Each CPU would rather work with its local
62
- cache of the data and only go through all the * anguish* of talking to shared
63
- memory * only* when it doesn't actually have that memory in cache.
60
+ respected our wishes, our * hardware* might instead get us in trouble. Trouble
61
+ comes from CPUs in the form of memory hierarchies. There is indeed a global
62
+ shared memory space somewhere in your hardware, but from the perspective of each
63
+ CPU core it is * so very far away* and * so very slow* . Each CPU would rather work
64
+ with its local cache of the data and only go through all the * anguish* of
65
+ talking to shared memory * only* when it doesn't actually have that memory in
66
+ cache.
64
67
65
68
After all, that's the whole * point* of the cache, right? If every read from the
66
69
cache had to run back to shared memory to double check that it hadn't changed,
67
70
what would the point be? The end result is that the hardware doesn't guarantee
68
- that events that occur in the same order on * one* thread, occur in the same order
69
- on * another* thread. To guarantee this, we must issue special instructions to
70
- the CPU telling it to be a bit less smart.
71
+ that events that occur in the same order on * one* thread, occur in the same
72
+ order on * another* thread. To guarantee this, we must issue special instructions
73
+ to the CPU telling it to be a bit less smart.
71
74
72
75
For instance, say we convince the compiler to emit this logic:
73
76
@@ -82,86 +85,82 @@ x = 1; y *= 2;
82
85
83
86
Ideally this program has 2 possible final states:
84
87
85
- * ` y = 3 ` : (thread 2 did the check before thread 1 completed)
86
- * ` y = 6 ` : (thread 2 did the check after thread 1 completed)
88
+ * ` y = 3 ` : (thread 2 did the check before thread 1 completed) y = 6`: (thread 2
89
+ * `did the check after thread 1 completed)
87
90
88
91
However there's a third potential state that the hardware enables:
89
92
90
93
* ` y = 2 ` : (thread 2 saw ` x = 2 ` , but not ` y = 3 ` , and then overwrote ` y = 3 ` )
91
94
92
95
It's worth noting that different kinds of CPU provide different guarantees. It
93
- is common to seperate hardware into two categories: strongly-ordered and weakly-
94
- ordered. Most notably x86/64 provides strong ordering guarantees, while ARM and
95
- provides weak ordering guarantees. This has two consequences for
96
- concurrent programming:
96
+ is common to separate hardware into two categories: strongly-ordered and weakly-
97
+ ordered. Most notably x86/64 provides strong ordering guarantees, while ARM
98
+ provides weak ordering guarantees. This has two consequences for concurrent
99
+ programming:
97
100
98
101
* Asking for stronger guarantees on strongly-ordered hardware may be cheap or
99
102
even * free* because they already provide strong guarantees unconditionally.
100
103
Weaker guarantees may only yield performance wins on weakly-ordered hardware.
101
104
102
- * Asking for guarantees that are * too* weak on strongly-ordered hardware
103
- is more likely to * happen* to work, even though your program is strictly
104
- incorrect. If possible, concurrent algorithms should be tested on
105
- weakly- ordered hardware.
105
+ * Asking for guarantees that are * too* weak on strongly-ordered hardware is
106
+ more likely to * happen* to work, even though your program is strictly
107
+ incorrect. If possible, concurrent algorithms should be tested on weakly-
108
+ ordered hardware.
106
109
107
110
108
111
109
112
110
113
111
114
# Data Accesses
112
115
113
- The C11 memory model attempts to bridge the gap by allowing us to talk about
114
- the * causality* of our program. Generally, this is by establishing a
115
- * happens before* relationships between parts of the program and the threads
116
- that are running them. This gives the hardware and compiler room to optimize the
117
- program more aggressively where a strict happens-before relationship isn't
118
- established, but forces them to be more careful where one * is* established.
119
- The way we communicate these relationships are through * data accesses* and
120
- * atomic accesses* .
116
+ The C11 memory model attempts to bridge the gap by allowing us to talk about the
117
+ * causality* of our program. Generally, this is by establishing a * happens
118
+ before* relationships between parts of the program and the threads that are
119
+ running them. This gives the hardware and compiler room to optimize the program
120
+ more aggressively where a strict happens-before relationship isn't established,
121
+ but forces them to be more careful where one * is* established. The way we
122
+ communicate these relationships are through * data accesses* and * atomic
123
+ accesses* .
121
124
122
125
Data accesses are the bread-and-butter of the programming world. They are
123
126
fundamentally unsynchronized and compilers are free to aggressively optimize
124
- them. In particular, data accesses are free to be reordered by the compiler
125
- on the assumption that the program is single-threaded. The hardware is also free
126
- to propagate the changes made in data accesses to other threads
127
- as lazily and inconsistently as it wants. Mostly critically, data accesses are
128
- how data races happen. Data accesses are very friendly to the hardware and
129
- compiler, but as we've seen they offer * awful* semantics to try to
130
- write synchronized code with. Actually, that's too weak. * It is literally
131
- impossible to write correct synchronized code using only data accesses* .
127
+ them. In particular, data accesses are free to be reordered by the compiler on
128
+ the assumption that the program is single-threaded. The hardware is also free to
129
+ propagate the changes made in data accesses to other threads as lazily and
130
+ inconsistently as it wants. Mostly critically, data accesses are how data races
131
+ happen. Data accesses are very friendly to the hardware and compiler, but as
132
+ we've seen they offer * awful* semantics to try to write synchronized code with.
133
+ Actually, that's too weak. * It is literally impossible to write correct
134
+ synchronized code using only data accesses* .
132
135
133
136
Atomic accesses are how we tell the hardware and compiler that our program is
134
- multi-threaded. Each atomic access can be marked with
135
- an * ordering* that specifies what kind of relationship it establishes with
136
- other accesses. In practice, this boils down to telling the compiler and hardware
137
- certain things they * can't* do. For the compiler, this largely revolves
138
- around re-ordering of instructions. For the hardware, this largely revolves
139
- around how writes are propagated to other threads. The set of orderings Rust
140
- exposes are:
141
-
142
- * Sequentially Consistent (SeqCst)
143
- * Release
144
- * Acquire
145
- * Relaxed
137
+ multi-threaded. Each atomic access can be marked with an * ordering* that
138
+ specifies what kind of relationship it establishes with other accesses. In
139
+ practice, this boils down to telling the compiler and hardware certain things
140
+ they * can't* do. For the compiler, this largely revolves around re-ordering of
141
+ instructions. For the hardware, this largely revolves around how writes are
142
+ propagated to other threads. The set of orderings Rust exposes are:
143
+
144
+ * Sequentially Consistent (SeqCst) Release Acquire Relaxed
146
145
147
146
(Note: We explicitly do not expose the C11 * consume* ordering)
148
147
149
- TODO: negative reasoning vs positive reasoning?
150
- TODO: "can't forget to synchronize"
148
+ TODO: negative reasoning vs positive reasoning? TODO: "can't forget to
149
+ synchronize"
151
150
152
151
153
152
154
153
# Sequentially Consistent
155
154
156
155
Sequentially Consistent is the most powerful of all, implying the restrictions
157
- of all other orderings. Intuitively, a sequentially consistent operation * cannot *
158
- be reordered: all accesses on one thread that happen before and after it * stay *
159
- before and after it. A data-race-free program that uses only sequentially consistent
160
- atomics and data accesses has the very nice property that there is a single global
161
- execution of the program's instructions that all threads agree on. This execution
162
- is also particularly nice to reason about: it's just an interleaving of each thread's
163
- individual executions. This * does not * hold if you start using the weaker atomic
164
- orderings.
156
+ of all other orderings. Intuitively, a sequentially consistent operation
157
+ * cannot * be reordered: all accesses on one thread that happen before and after a
158
+ SeqCst access * stay * before and after it. A data-race-free program that uses
159
+ only sequentially consistent atomics and data accesses has the very nice
160
+ property that there is a single global execution of the program's instructions
161
+ that all threads agree on. This execution is also particularly nice to reason
162
+ about: it's just an interleaving of each thread's individual executions. This
163
+ * does not * hold if you start using the weaker atomic orderings.
165
164
166
165
The relative developer-friendliness of sequential consistency doesn't come for
167
166
free. Even on strongly-ordered platforms sequential consistency involves
@@ -173,26 +172,26 @@ confident about the other memory orders. Having your program run a bit slower
173
172
than it needs to is certainly better than it running incorrectly! It's also
174
173
* mechanically* trivial to downgrade atomic operations to have a weaker
175
174
consistency later on. Just change ` SeqCst ` to e.g. ` Relaxed ` and you're done! Of
176
- course, proving that this transformation is * correct* is whole other matter.
175
+ course, proving that this transformation is * correct* is a whole other matter.
177
176
178
177
179
178
180
179
181
180
# Acquire-Release
182
181
183
- Acquire and Release are largely intended to be paired. Their names hint at
184
- their use case: they're perfectly suited for acquiring and releasing locks,
185
- and ensuring that critical sections don't overlap.
182
+ Acquire and Release are largely intended to be paired. Their names hint at their
183
+ use case: they're perfectly suited for acquiring and releasing locks, and
184
+ ensuring that critical sections don't overlap.
186
185
187
186
Intuitively, an acquire access ensures that every access after it * stays* after
188
187
it. However operations that occur before an acquire are free to be reordered to
189
188
occur after it. Similarly, a release access ensures that every access before it
190
- * stays* before it. However operations that occur after a release are free to
191
- be reordered to occur before it.
189
+ * stays* before it. However operations that occur after a release are free to be
190
+ reordered to occur before it.
192
191
193
192
When thread A releases a location in memory and then thread B subsequently
194
193
acquires * the same* location in memory, causality is established. Every write
195
- that happened * before* A's release will be observed by B * after* it's release.
194
+ that happened * before* A's release will be observed by B * after* its release.
196
195
However no causality is established with any other threads. Similarly, no
197
196
causality is established if A and B access * different* locations in memory.
198
197
0 commit comments