Skip to content

Commit 26d86b8

Browse files
committed
---
yaml --- r: 228999 b: refs/heads/try c: 668bdd3 h: refs/heads/master i: 228997: 5ad7385 228995: 73f5eb7 228991: 8ca3f99 v: v3
1 parent f19ce30 commit 26d86b8

File tree

2 files changed

+198
-18
lines changed

2 files changed

+198
-18
lines changed

[refs]

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
refs/heads/master: aca2057ed5fb7af3f8905b2bc01f72fa001c35c8
33
refs/heads/snap-stage3: 1af31d4974e33027a68126fa5a5a3c2c6491824f
4-
refs/heads/try: d8f460c29d6b7e07775562e2090f3f36c6d651e0
4+
refs/heads/try: 668bdd3650f8fcce956d8e47ae74247a50da3e46
55
refs/tags/release-0.1: 1f5c5126e96c79d22cb7862f75304136e204f105
66
refs/tags/release-0.2: c870d2dffb391e14efb05aa27898f1f6333a9596
77
refs/tags/release-0.3: b5f0d0f648d9a6153664837026ba1be43d3e2503

branches/try/atomics.md

Lines changed: 197 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -7,27 +7,138 @@ it is a pragmatic concession to the fact that *everyone* is pretty bad at modeli
77
atomics. At very least, we can benefit from existing tooling and research around
88
C.
99

10-
Trying to fully explain the model is fairly hopeless. If you want all the
11-
nitty-gritty details, you should check out [C's specification][C11-model].
12-
Still, we'll try to cover the basics and some of the problems Rust developers
13-
face.
10+
Trying to fully explain the model in this book is fairly hopeless. It's defined
11+
in terms of madness-inducing causality graphs that require a full book to properly
12+
understand in a practical way. If you want all the nitty-gritty details, you
13+
should check out [C's specification][C11-model]. Still, we'll try to cover the
14+
basics and some of the problems Rust developers face.
1415

15-
The C11 memory model is fundamentally about trying to bridge the gap between C's
16-
single-threaded semantics, common compiler optimizations, and hardware peculiarities
17-
in the face of a multi-threaded environment. It does this by splitting memory
18-
accesses into two worlds: data accesses, and atomic accesses.
16+
The C11 memory model is fundamentally about trying to bridge the gap between
17+
the semantics we want, the optimizations compilers want, and the inconsistent
18+
chaos our hardware wants. *We* would like to just write programs and have them
19+
do exactly what we said but, you know, *fast*. Wouldn't that be great?
20+
21+
22+
23+
24+
# Compiler Reordering
25+
26+
Compilers fundamentally want to be able to do all sorts of crazy transformations
27+
to reduce data dependencies and eleminate dead code. In particular, they may
28+
radically change the actual order of events, or make events never occur! If we
29+
write something like
30+
31+
```rust,ignore
32+
x = 1;
33+
y = 3;
34+
x = 2;
35+
```
36+
37+
The compiler may conclude that it would *really* be best if your program did
38+
39+
```rust,ignore
40+
x = 2;
41+
y = 3;
42+
```
43+
44+
This has inverted the order of events *and* completely eliminated one event. From
45+
a single-threaded perspective this is completely unobservable: after all the
46+
statements have executed we are in exactly the same state. But if our program is
47+
multi-threaded, we may have been relying on `x` to *actually* be assigned to 1 before
48+
`y` was assigned. We would *really* like the compiler to be able to make these kinds
49+
of optimizations, because they can seriously improve performance. On the other hand,
50+
we'd really like to be able to depend on our program *doing the thing we said*.
51+
52+
53+
54+
55+
# Hardware Reordering
56+
57+
On the other hand, even if the compiler totally understood what we wanted and
58+
respected our wishes, our *hardware* might instead get us in trouble. Trouble comes
59+
from CPUs in the form of memory hierarchies. There is indeed a global shared memory
60+
space somewhere in your hardware, but from the perspective of each CPU core it is
61+
*so very far away* and *so very slow*. Each CPU would rather work with its local
62+
cache of the data and only go through all the *anguish* of talking to shared
63+
memory *only* when it doesn't actually have that memory in cache.
64+
65+
After all, that's the whole *point* of the cache, right? If every read from the
66+
cache had to run back to shared memory to double check that it hadn't changed,
67+
what would the point be? The end result is that the hardware doesn't guarantee
68+
that events that occur in the same order on *one* thread, occur in the same order
69+
on *another* thread. To guarantee this, we must issue special instructions to
70+
the CPU telling it to be a bit less smart.
71+
72+
For instance, say we convince the compiler to emit this logic:
73+
74+
```text
75+
initial state: x = 0, y = 1
76+
77+
THREAD 1 THREAD2
78+
y = 3; if x == 1 {
79+
x = 1; y *= 2;
80+
}
81+
```
82+
83+
Ideally this program has 2 possible final states:
84+
85+
* `y = 3`: (thread 2 did the check before thread 1 completed)
86+
* `y = 6`: (thread 2 did the check after thread 1 completed)
87+
88+
However there's a third potential state that the hardware enables:
89+
90+
* `y = 2`: (thread 2 saw `x = 2`, but not `y = 3`, and then overwrote `y = 3`)
91+
92+
```
93+
94+
It's worth noting that different kinds of CPU provide different guarantees. It
95+
is common to seperate hardware into two categories: strongly-ordered and weakly-
96+
ordered. Most notably x86/64 provides strong ordering guarantees, while ARM and
97+
provides weak ordering guarantees. This has two consequences for
98+
concurrent programming:
99+
100+
* Asking for stronger guarantees on strongly-ordered hardware may be cheap or
101+
even *free* because they already provide strong guarantees unconditionally.
102+
Weaker guarantees may only yield performance wins on weakly-ordered hardware.
103+
104+
* Asking for guarantees that are *too* weak on strongly-ordered hardware
105+
is more likely to *happen* to work, even though your program is strictly
106+
incorrect. If possible, concurrent algorithms should be tested on
107+
weakly-ordered hardware.
108+
109+
110+
111+
112+
113+
# Data Accesses
114+
115+
The C11 memory model attempts to bridge the gap by allowing us to talk about
116+
the *causality* of our program. Generally, this is by establishing a
117+
*happens before* relationships between parts of the program and the threads
118+
that are running them. This gives the hardware and compiler room to optimize the
119+
program more aggressively where a strict happens-before relationship isn't
120+
established, but forces them to be more careful where one *is* established.
121+
The way we communicate these relationships are through *data accesses* and
122+
*atomic accesses*.
19123
20124
Data accesses are the bread-and-butter of the programming world. They are
21125
fundamentally unsynchronized and compilers are free to aggressively optimize
22-
them. In particular data accesses are free to be reordered by the compiler
126+
them. In particular, data accesses are free to be reordered by the compiler
23127
on the assumption that the program is single-threaded. The hardware is also free
24-
to propagate the changes made in data accesses as lazily and inconsistently as
25-
it wants to other threads. Mostly critically, data accesses are where we get data
26-
races. These are pretty clearly awful semantics to try to write a multi-threaded
27-
program with.
128+
to propagate the changes made in data accesses to other threads
129+
as lazily and inconsistently as it wants. Mostly critically, data accesses are
130+
how data races happen. Data accesses are very friendly to the hardware and
131+
compiler, but as we've seen they offer *awful* semantics to try to
132+
write synchronized code with.
28133
29-
Atomic accesses are the answer to this. Each atomic access can be marked with
30-
an *ordering*. The set of orderings Rust exposes are:
134+
Atomic accesses are how we tell the hardware and compiler that our program is
135+
multi-threaded. Each atomic access can be marked with
136+
an *ordering* that specifies what kind of relationship it establishes with
137+
other accesses. In practice, this boils down to telling the compiler and hardware
138+
certain things they *can't* do. For the compiler, this largely revolves
139+
around re-ordering of instructions. For the hardware, this largely revolves
140+
around how writes are propagated to other threads. The set of orderings Rust
141+
exposes are:
31142
32143
* Sequentially Consistent (SeqCst)
33144
* Release
@@ -36,11 +147,80 @@ an *ordering*. The set of orderings Rust exposes are:
36147
37148
(Note: We explicitly do not expose the C11 *consume* ordering)
38149
39-
TODO: give simple "basic" explanation of these
40-
TODO: implementing Arc example (why does Drop need the trailing barrier?)
150+
TODO: negative reasoning vs positive reasoning?
151+
152+
153+
154+
155+
# Sequentially Consistent
156+
157+
Sequentially Consistent is the most powerful of all, implying the restrictions
158+
of all other orderings. A Sequentially Consistent operation *cannot*
159+
be reordered: all accesses on one thread that happen before and after it *stay*
160+
before and after it. A program that has sequential consistency has the very nice
161+
property that there is a single global execution of the program's instructions
162+
that all threads agree on. This execution is also particularly nice to reason
163+
about: it's just an interleaving of each thread's individual executions.
164+
165+
The relative developer-friendliness of sequential consistency doesn't come for
166+
free. Even on strongly-ordered platforms, sequential consistency involves
167+
emitting memory fences.
168+
169+
In practice, sequential consistency is rarely necessary for program correctness.
170+
However sequential consistency is definitely the right choice if you're not
171+
confident about the other memory orders. Having your program run a bit slower
172+
than it needs to is certainly better than it running incorrectly! It's also
173+
completely trivial to downgrade to a weaker consistency later.
174+
175+
176+
177+
178+
# Acquire-Release
41179
180+
Acquire and Release are largely intended to be paired. Their names hint at
181+
their use case: they're perfectly suited for acquiring and releasing locks,
182+
and ensuring that critical sections don't overlap.
42183
184+
An acquire access ensures that every access after it *stays* after it. However
185+
operations that occur before an acquire are free to be reordered to occur after
186+
it.
43187
188+
A release access ensures that every access before it *stays* before it. However
189+
operations that occur after a release are free to be reordered to occur before
190+
it.
191+
192+
Basic use of release-acquire is simple: you acquire a location of memory to
193+
begin the critical section, and the release that location to end it. If
194+
thread A releases a location of memory and thread B acquires that location of
195+
memory, this establishes that A's critical section *happened before* B's
196+
critical section. All accesses that happened before the release will be observed
197+
by anything that happens after the acquire.
198+
199+
On strongly-ordered platforms most accesses have release or acquire semantics,
200+
making release and acquire often totally free. This is not the case on
201+
weakly-ordered platforms.
202+
203+
204+
205+
206+
# Relaxed
207+
208+
Relaxed accesses are the absolute weakest. They can be freely re-ordered and
209+
provide no happens-before relationship. Still, relaxed operations *are* still
210+
atomic, which is valuable. Relaxed operations are appropriate for things that
211+
you definitely want to happen, but don't particularly care about much else. For
212+
instance, incrementing a counter can be relaxed if you're not using the
213+
counter to synchronize any other accesses.
214+
215+
There's rarely a benefit in making an operation relaxed on strongly-ordered
216+
platforms, since they usually provide release-acquire semantics anyway. However
217+
relaxed operations can be cheaper on weakly-ordered platforms.
218+
219+
220+
221+
222+
223+
TODO: implementing Arc example (why does Drop need the trailing barrier?)
44224
45225
46226
[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf

0 commit comments

Comments
 (0)