|
| 1 | +% Atomics |
| 2 | + |
| 3 | +Rust pretty blatantly just inherits C11's memory model for atomics. This is not |
| 4 | +due this model being particularly excellent or easy to understand. Indeed, this |
| 5 | +model is quite complex and known to have [several flaws][C11-busted]. Rather, |
| 6 | +it is a pragmatic concession to the fact that *everyone* is pretty bad at modeling |
| 7 | +atomics. At very least, we can benefit from existing tooling and research around |
| 8 | +C. |
| 9 | + |
| 10 | +Trying to fully explain the model in this book is fairly hopeless. It's defined |
| 11 | +in terms of madness-inducing causality graphs that require a full book to properly |
| 12 | +understand in a practical way. If you want all the nitty-gritty details, you |
| 13 | +should check out [C's specification (Section 7.17)][C11-model]. Still, we'll try |
| 14 | +to cover the basics and some of the problems Rust developers face. |
| 15 | + |
| 16 | +The C11 memory model is fundamentally about trying to bridge the gap between |
| 17 | +the semantics we want, the optimizations compilers want, and the inconsistent |
| 18 | +chaos our hardware wants. *We* would like to just write programs and have them |
| 19 | +do exactly what we said but, you know, *fast*. Wouldn't that be great? |
| 20 | + |
| 21 | + |
| 22 | + |
| 23 | + |
| 24 | +# Compiler Reordering |
| 25 | + |
| 26 | +Compilers fundamentally want to be able to do all sorts of crazy transformations |
| 27 | +to reduce data dependencies and eliminate dead code. In particular, they may |
| 28 | +radically change the actual order of events, or make events never occur! If we |
| 29 | +write something like |
| 30 | + |
| 31 | +```rust,ignore |
| 32 | +x = 1; |
| 33 | +y = 3; |
| 34 | +x = 2; |
| 35 | +``` |
| 36 | + |
| 37 | +The compiler may conclude that it would *really* be best if your program did |
| 38 | + |
| 39 | +```rust,ignore |
| 40 | +x = 2; |
| 41 | +y = 3; |
| 42 | +``` |
| 43 | + |
| 44 | +This has inverted the order of events *and* completely eliminated one event. From |
| 45 | +a single-threaded perspective this is completely unobservable: after all the |
| 46 | +statements have executed we are in exactly the same state. But if our program is |
| 47 | +multi-threaded, we may have been relying on `x` to *actually* be assigned to 1 before |
| 48 | +`y` was assigned. We would *really* like the compiler to be able to make these kinds |
| 49 | +of optimizations, because they can seriously improve performance. On the other hand, |
| 50 | +we'd really like to be able to depend on our program *doing the thing we said*. |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | + |
| 55 | +# Hardware Reordering |
| 56 | + |
| 57 | +On the other hand, even if the compiler totally understood what we wanted and |
| 58 | +respected our wishes, our *hardware* might instead get us in trouble. Trouble comes |
| 59 | +from CPUs in the form of memory hierarchies. There is indeed a global shared memory |
| 60 | +space somewhere in your hardware, but from the perspective of each CPU core it is |
| 61 | +*so very far away* and *so very slow*. Each CPU would rather work with its local |
| 62 | +cache of the data and only go through all the *anguish* of talking to shared |
| 63 | +memory *only* when it doesn't actually have that memory in cache. |
| 64 | + |
| 65 | +After all, that's the whole *point* of the cache, right? If every read from the |
| 66 | +cache had to run back to shared memory to double check that it hadn't changed, |
| 67 | +what would the point be? The end result is that the hardware doesn't guarantee |
| 68 | +that events that occur in the same order on *one* thread, occur in the same order |
| 69 | +on *another* thread. To guarantee this, we must issue special instructions to |
| 70 | +the CPU telling it to be a bit less smart. |
| 71 | + |
| 72 | +For instance, say we convince the compiler to emit this logic: |
| 73 | + |
| 74 | +```text |
| 75 | +initial state: x = 0, y = 1 |
| 76 | +
|
| 77 | +THREAD 1 THREAD2 |
| 78 | +y = 3; if x == 1 { |
| 79 | +x = 1; y *= 2; |
| 80 | + } |
| 81 | +``` |
| 82 | + |
| 83 | +Ideally this program has 2 possible final states: |
| 84 | + |
| 85 | +* `y = 3`: (thread 2 did the check before thread 1 completed) |
| 86 | +* `y = 6`: (thread 2 did the check after thread 1 completed) |
| 87 | + |
| 88 | +However there's a third potential state that the hardware enables: |
| 89 | + |
| 90 | +* `y = 2`: (thread 2 saw `x = 2`, but not `y = 3`, and then overwrote `y = 3`) |
| 91 | + |
| 92 | +It's worth noting that different kinds of CPU provide different guarantees. It |
| 93 | +is common to seperate hardware into two categories: strongly-ordered and weakly- |
| 94 | +ordered. Most notably x86/64 provides strong ordering guarantees, while ARM and |
| 95 | +provides weak ordering guarantees. This has two consequences for |
| 96 | +concurrent programming: |
| 97 | + |
| 98 | +* Asking for stronger guarantees on strongly-ordered hardware may be cheap or |
| 99 | + even *free* because they already provide strong guarantees unconditionally. |
| 100 | + Weaker guarantees may only yield performance wins on weakly-ordered hardware. |
| 101 | + |
| 102 | +* Asking for guarantees that are *too* weak on strongly-ordered hardware |
| 103 | + is more likely to *happen* to work, even though your program is strictly |
| 104 | + incorrect. If possible, concurrent algorithms should be tested on |
| 105 | + weakly-ordered hardware. |
| 106 | + |
| 107 | + |
| 108 | + |
| 109 | + |
| 110 | + |
| 111 | +# Data Accesses |
| 112 | + |
| 113 | +The C11 memory model attempts to bridge the gap by allowing us to talk about |
| 114 | +the *causality* of our program. Generally, this is by establishing a |
| 115 | +*happens before* relationships between parts of the program and the threads |
| 116 | +that are running them. This gives the hardware and compiler room to optimize the |
| 117 | +program more aggressively where a strict happens-before relationship isn't |
| 118 | +established, but forces them to be more careful where one *is* established. |
| 119 | +The way we communicate these relationships are through *data accesses* and |
| 120 | +*atomic accesses*. |
| 121 | + |
| 122 | +Data accesses are the bread-and-butter of the programming world. They are |
| 123 | +fundamentally unsynchronized and compilers are free to aggressively optimize |
| 124 | +them. In particular, data accesses are free to be reordered by the compiler |
| 125 | +on the assumption that the program is single-threaded. The hardware is also free |
| 126 | +to propagate the changes made in data accesses to other threads |
| 127 | +as lazily and inconsistently as it wants. Mostly critically, data accesses are |
| 128 | +how data races happen. Data accesses are very friendly to the hardware and |
| 129 | +compiler, but as we've seen they offer *awful* semantics to try to |
| 130 | +write synchronized code with. Actually, that's too weak. *It is literally |
| 131 | +impossible to write correct synchronized code using only data accesses*. |
| 132 | + |
| 133 | +Atomic accesses are how we tell the hardware and compiler that our program is |
| 134 | +multi-threaded. Each atomic access can be marked with |
| 135 | +an *ordering* that specifies what kind of relationship it establishes with |
| 136 | +other accesses. In practice, this boils down to telling the compiler and hardware |
| 137 | +certain things they *can't* do. For the compiler, this largely revolves |
| 138 | +around re-ordering of instructions. For the hardware, this largely revolves |
| 139 | +around how writes are propagated to other threads. The set of orderings Rust |
| 140 | +exposes are: |
| 141 | + |
| 142 | +* Sequentially Consistent (SeqCst) |
| 143 | +* Release |
| 144 | +* Acquire |
| 145 | +* Relaxed |
| 146 | + |
| 147 | +(Note: We explicitly do not expose the C11 *consume* ordering) |
| 148 | + |
| 149 | +TODO: negative reasoning vs positive reasoning? |
| 150 | +TODO: "can't forget to synchronize" |
| 151 | + |
| 152 | + |
| 153 | + |
| 154 | +# Sequentially Consistent |
| 155 | + |
| 156 | +Sequentially Consistent is the most powerful of all, implying the restrictions |
| 157 | +of all other orderings. Intuitively, a sequentially consistent operation *cannot* |
| 158 | +be reordered: all accesses on one thread that happen before and after it *stay* |
| 159 | +before and after it. A data-race-free program that uses only sequentially consistent |
| 160 | +atomics and data accesses has the very nice property that there is a single global |
| 161 | +execution of the program's instructions that all threads agree on. This execution |
| 162 | +is also particularly nice to reason about: it's just an interleaving of each thread's |
| 163 | +individual executions. This *does not* hold if you start using the weaker atomic |
| 164 | +orderings. |
| 165 | + |
| 166 | +The relative developer-friendliness of sequential consistency doesn't come for |
| 167 | +free. Even on strongly-ordered platforms sequential consistency involves |
| 168 | +emitting memory fences. |
| 169 | + |
| 170 | +In practice, sequential consistency is rarely necessary for program correctness. |
| 171 | +However sequential consistency is definitely the right choice if you're not |
| 172 | +confident about the other memory orders. Having your program run a bit slower |
| 173 | +than it needs to is certainly better than it running incorrectly! It's also |
| 174 | +*mechanically* trivial to downgrade atomic operations to have a weaker |
| 175 | +consistency later on. Just change `SeqCst` to e.g. `Relaxed` and you're done! Of |
| 176 | +course, proving that this transformation is *correct* is whole other matter. |
| 177 | + |
| 178 | + |
| 179 | + |
| 180 | + |
| 181 | +# Acquire-Release |
| 182 | + |
| 183 | +Acquire and Release are largely intended to be paired. Their names hint at |
| 184 | +their use case: they're perfectly suited for acquiring and releasing locks, |
| 185 | +and ensuring that critical sections don't overlap. |
| 186 | + |
| 187 | +Intuitively, an acquire access ensures that every access after it *stays* after |
| 188 | +it. However operations that occur before an acquire are free to be reordered to |
| 189 | +occur after it. Similarly, a release access ensures that every access before it |
| 190 | +*stays* before it. However operations that occur after a release are free to |
| 191 | +be reordered to occur before it. |
| 192 | + |
| 193 | +When thread A releases a location in memory and then thread B subsequently |
| 194 | +acquires *the same* location in memory, causality is established. Every write |
| 195 | +that happened *before* A's release will be observed by B *after* it's release. |
| 196 | +However no causality is established with any other threads. Similarly, no |
| 197 | +causality is established if A and B access *different* locations in memory. |
| 198 | + |
| 199 | +Basic use of release-acquire is therefore simple: you acquire a location of |
| 200 | +memory to begin the critical section, and then release that location to end it. |
| 201 | +For instance, a simple spinlock might look like: |
| 202 | + |
| 203 | +```rust |
| 204 | +use std::sync::Arc; |
| 205 | +use std::sync::atomic::{AtomicBool, Ordering}; |
| 206 | +use std::thread; |
| 207 | + |
| 208 | +fn main() { |
| 209 | + let lock = Arc::new(AtomicBool::new(true)); // value answers "am I locked?" |
| 210 | + |
| 211 | + // ... distribute lock to threads somehow ... |
| 212 | + |
| 213 | + // Try to acquire the lock by setting it to false |
| 214 | + while !lock.compare_and_swap(true, false, Ordering::Acquire) { } |
| 215 | + // broke out of the loop, so we successfully acquired the lock! |
| 216 | + |
| 217 | + // ... scary data accesses ... |
| 218 | + |
| 219 | + // ok we're done, release the lock |
| 220 | + lock.store(true, Ordering::Release); |
| 221 | +} |
| 222 | +``` |
| 223 | + |
| 224 | +On strongly-ordered platforms most accesses have release or acquire semantics, |
| 225 | +making release and acquire often totally free. This is not the case on |
| 226 | +weakly-ordered platforms. |
| 227 | + |
| 228 | + |
| 229 | + |
| 230 | + |
| 231 | +# Relaxed |
| 232 | + |
| 233 | +Relaxed accesses are the absolute weakest. They can be freely re-ordered and |
| 234 | +provide no happens-before relationship. Still, relaxed operations *are* still |
| 235 | +atomic. That is, they don't count as data accesses and any read-modify-write |
| 236 | +operations done to them occur atomically. Relaxed operations are appropriate for |
| 237 | +things that you definitely want to happen, but don't particularly otherwise care |
| 238 | +about. For instance, incrementing a counter can be safely done by multiple |
| 239 | +threads using a relaxed `fetch_add` if you're not using the counter to |
| 240 | +synchronize any other accesses. |
| 241 | + |
| 242 | +There's rarely a benefit in making an operation relaxed on strongly-ordered |
| 243 | +platforms, since they usually provide release-acquire semantics anyway. However |
| 244 | +relaxed operations can be cheaper on weakly-ordered platforms. |
| 245 | + |
| 246 | + |
| 247 | + |
| 248 | + |
| 249 | + |
| 250 | +[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf |
| 251 | +[C11-model]: http://www.open-std.org/jtc1/sc22/wg14/www/standards.html#9899 |
0 commit comments