Skip to content

Commit 454b588

Browse files
committed
Updated SPEC.md and DESIGN.md based on recent changes
- Added math behind CTZ limits - Added documentation over atomic moves
1 parent f3578e3 commit 454b588

File tree

2 files changed

+232
-53
lines changed

2 files changed

+232
-53
lines changed

DESIGN.md

Lines changed: 211 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -200,7 +200,7 @@ Now we could just leave files here, copying the entire file on write
200200
provides the synchronization without the duplicated memory requirements
201201
of the metadata blocks. However, we can do a bit better.
202202

203-
## CTZ linked-lists
203+
## CTZ skip-lists
204204

205205
There are many different data structures for representing the actual
206206
files in filesystems. Of these, the littlefs uses a rather unique [COW](https://upload.wikimedia.org/wikipedia/commons/0/0c/Cow_female_black_white.jpg)
@@ -246,26 +246,29 @@ runtime to just _read_ a file? That's awful. Keep in mind reading files are
246246
usually the most common filesystem operation.
247247

248248
To avoid this problem, the littlefs uses a multilayered linked-list. For
249-
every block that is divisible by a power of two, the block contains an
250-
additional pointer that points back by that power of two. Another way of
251-
thinking about this design is that there are actually many linked-lists
252-
threaded together, with each linked-lists skipping an increasing number
253-
of blocks. If you're familiar with data-structures, you may have also
254-
recognized that this is a deterministic skip-list.
249+
every nth block where n is divisible by 2^x, the block contains a pointer
250+
to block n-2^x. So each block contains anywhere from 1 to log2(n) pointers
251+
that skip to various sections of the preceding list. If you're familiar with
252+
data-structures, you may have recognized that this is a type of deterministic
253+
skip-list.
255254

256-
To find the power of two factors efficiently, we can use the instruction
257-
[count trailing zeros (CTZ)](https://en.wikipedia.org/wiki/Count_trailing_zeros),
258-
which is where this linked-list's name comes from.
255+
The name comes from the use of the
256+
[count trailing zeros (CTZ)](https://en.wikipedia.org/wiki/Count_trailing_zeros)
257+
instruction, which allows us to calculate the power-of-two factors efficiently.
258+
For a given block n, the block contains ctz(n)+1 pointers.
259259

260260
```
261-
Exhibit C: A backwards CTZ linked-list
261+
Exhibit C: A backwards CTZ skip-list
262262
.--------. .--------. .--------. .--------. .--------. .--------.
263263
| data 0 |<-| data 1 |<-| data 2 |<-| data 3 |<-| data 4 |<-| data 5 |
264264
| |<-| |--| |<-| |--| | | |
265265
| |<-| |--| |--| |--| | | |
266266
'--------' '--------' '--------' '--------' '--------' '--------'
267267
```
268268

269+
The additional pointers allow us to navigate the data-structure on disk
270+
much more efficiently than in a single linked-list.
271+
269272
Taking exhibit C for example, here is the path from data block 5 to data
270273
block 1. You can see how data block 3 was completely skipped:
271274
```
@@ -285,15 +288,57 @@ The path to data block 0 is even more quick, requiring only two jumps:
285288
'--------' '--------' '--------' '--------' '--------' '--------'
286289
```
287290

288-
The CTZ linked-list has quite a few interesting properties. All of the pointers
289-
in the block can be found by just knowing the index in the list of the current
290-
block, and, with a bit of math, the amortized overhead for the linked-list is
291-
only two pointers per block. Most importantly, the CTZ linked-list has a
292-
worst case lookup runtime of O(logn), which brings the runtime of reading a
293-
file down to O(n logn). Given that the constant runtime is divided by the
294-
amount of data we can store in a block, this is pretty reasonable.
295-
296-
Here is what it might look like to update a file stored with a CTZ linked-list:
291+
We can find the runtime complexity by looking at the path to any block from
292+
the block containing the most pointers. Every step along the path divides
293+
the search space for the block in half. This gives us a runtime of O(log n).
294+
To get to the block with the most pointers, we can perform the same steps
295+
backwards, which keeps the asymptotic runtime at O(log n). The interesting
296+
part about this data structure is that this optimal path occurs naturally
297+
if we greedily choose the pointer that covers the most distance without passing
298+
our target block.
299+
300+
So now we have a representation of files that can be appended trivially with
301+
a runtime of O(1), and can be read with a worst case runtime of O(n logn).
302+
Given that the the runtime is also divided by the amount of data we can store
303+
in a block, this is pretty reasonable.
304+
305+
Unfortunately, the CTZ skip-list comes with a few questions that aren't
306+
straightforward to answer. What is the overhead? How do we handle more
307+
pointers than we can store in a block?
308+
309+
One way to find the overhead per block is to look at the data structure as
310+
multiple layers of linked-lists. Each linked-list skips twice as many blocks
311+
as the previous linked-list. Or another way of looking at it is that each
312+
linked-list uses half as much storage per block as the previous linked-list.
313+
As we approach infinity, the number of pointers per block forms a geometric
314+
series. Solving this geometric series gives us an average of only 2 pointers
315+
per block.
316+
317+
![overhead per block](https://latex.codecogs.com/gif.latex?%5Clim_%7Bn%5Cto%5Cinfty%7D%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D0%7D%5E%7Bn%7D%5Cleft%28%5Ctext%7Bctz%7D%28i%29&plus;1%5Cright%29%20%3D%20%5Csum_%7Bi%3D0%7D%5E%7B%5Cinfty%7D%5Cfrac%7B1%7D%7B2%5Ei%7D%20%3D%202)
318+
319+
Finding the maximum number of pointers in a block is a bit more complicated,
320+
but since our file size is limited by the integer width we use to store the
321+
size, we can solve for it. Setting the overhead of the maximum pointers equal
322+
to the block size we get the following equation. Note that a smaller block size
323+
results in more pointers, and a larger word width results in larger pointers.
324+
325+
![maximum overhead](https://latex.codecogs.com/gif.latex?B%20%3D%20%5Cfrac%7Bw%7D%7B8%7D%5Cleft%5Clceil%5Clog_2%5Cleft%28%5Cfrac%7B2%5Ew%7D%7BB-2%5Cfrac%7Bw%7D%7B8%7D%7D%5Cright%29%5Cright%5Crceil)
326+
327+
where:
328+
B = block size in bytes
329+
w = word width in bits
330+
331+
Solving the equation for B gives us the minimum block size for various word
332+
widths:
333+
32 bit CTZ skip-list = minimum block size of 104 bytes
334+
64 bit CTZ skip-list = minimum block size of 448 bytes
335+
336+
Since littlefs uses a 32 bit word size, we are limited to a minimum block
337+
size of 104 bytes. This is a perfectly reasonable minimum block size, with most
338+
block sizes starting around 512 bytes. So we can avoid the additional logic
339+
needed to avoid overflowing our block's capacity in the CTZ skip-list.
340+
341+
Here is what it might look like to update a file stored with a CTZ skip-list:
297342
```
298343
block 1 block 2
299344
.---------.---------.
@@ -367,7 +412,7 @@ v
367412
## Block allocation
368413

369414
So those two ideas provide the grounds for the filesystem. The metadata pairs
370-
give us directories, and the CTZ linked-lists give us files. But this leaves
415+
give us directories, and the CTZ skip-lists give us files. But this leaves
371416
one big [elephant](https://upload.wikimedia.org/wikipedia/commons/3/37/African_Bush_Elephant.jpg)
372417
of a question. How do we get those blocks in the first place?
373418

@@ -653,9 +698,17 @@ deorphan step that simply iterates through every directory in the linked-list
653698
and checks it against every directory entry in the filesystem to see if it
654699
has a parent. The deorphan step occurs on the first block allocation after
655700
boot, so orphans should never cause the littlefs to run out of storage
656-
prematurely.
701+
prematurely. Note that the deorphan step never needs to run in a readonly
702+
filesystem.
703+
704+
## The move problem
657705

658-
And for my final trick, moving a directory:
706+
Now we have a real problem. How do we move things between directories while
707+
remaining power resilient? Even looking at the problem from a high level,
708+
it seems impossible. We can update directory blocks atomically, but atomically
709+
updating two independent directory blocks is not an atomic operation.
710+
711+
Here's the steps the filesystem may go through to move a directory:
659712
```
660713
.--------.
661714
|root dir|-.
@@ -716,18 +769,135 @@ v
716769
'--------'
717770
```
718771

719-
Note that once again we don't care about the ordering of directories in the
720-
linked-list, so we can simply leave directories in their old positions. This
721-
does make the diagrams a bit hard to draw, but the littlefs doesn't really
722-
care.
772+
We can leave any orphans up to the deorphan step to collect, but that doesn't
773+
help the case where dir A has both dir B and the root dir as parents if we
774+
lose power inconveniently.
775+
776+
Initially, you might think this is fine. Dir A _might_ end up with two parents,
777+
but the filesystem will still work as intended. But then this raises the
778+
question of what do we do when the dir A wears out? For other directory blocks
779+
we can update the parent pointer, but for a dir with two parents we would need
780+
work out how to update both parents. And the check for multiple parents would
781+
need to be carried out for every directory, even if the directory has never
782+
been moved.
783+
784+
It also presents a bad user-experience, since the condition of ending up with
785+
two parents is rare, it's unlikely user-level code will be prepared. Just think
786+
about how a user would recover from a multi-parented directory. They can't just
787+
remove one directory, since remove would report the directory as "not empty".
788+
789+
Other atomic filesystems simple COW the entire directory tree. But this
790+
introduces a significant bit of complexity, which leads to code size, along
791+
with a surprisingly expensive runtime cost during what most users assume is
792+
a single pointer update.
793+
794+
Another option is to update the directory block we're moving from to point
795+
to the destination with a sort of predicate that we have moved if the
796+
destination exists. Unfortunately, the omnipresent concern of wear could
797+
cause any of these directory entries to change blocks, and changing the
798+
entry size before a move introduces complications if it spills out of
799+
the current directory block.
800+
801+
So how do we go about moving a directory atomically?
802+
803+
We rely on the improbableness of power loss.
804+
805+
Power loss during a move is certainly possible, but it's actually relatively
806+
rare. Unless a device is writing to a filesystem constantly, it's unlikely that
807+
a power loss will occur during filesystem activity. We still need to handle
808+
the condition, but runtime during a power loss takes a back seat to the runtime
809+
during normal operations.
810+
811+
So what littlefs does is unelegantly simple. When littlefs moves a file, it
812+
marks the file as "moving". This is stored as a single bit in the directory
813+
entry and doesn't take up much space. Then littlefs moves the directory,
814+
finishing with the complete remove of the "moving" directory entry.
815+
816+
```
817+
.--------.
818+
|root dir|-.
819+
| pair 0 | |
820+
.--------| |-'
821+
| '--------'
822+
| .-' '-.
823+
| v v
824+
| .--------. .--------.
825+
'->| dir A |->| dir B |
826+
| pair 0 | | pair 0 |
827+
| | | |
828+
'--------' '--------'
829+
830+
| update root directory to mark directory A as moving
831+
v
832+
833+
.----------.
834+
|root dir |-.
835+
| pair 0 | |
836+
.-------| moving A!|-'
837+
| '----------'
838+
| .-' '-.
839+
| v v
840+
| .--------. .--------.
841+
'->| dir A |->| dir B |
842+
| pair 0 | | pair 0 |
843+
| | | |
844+
'--------' '--------'
845+
846+
| update directory B to point to directory A
847+
v
848+
849+
.----------.
850+
|root dir |-.
851+
| pair 0 | |
852+
.-------| moving A!|-'
853+
| '----------'
854+
| .-----' '-.
855+
| | v
856+
| | .--------.
857+
| | .->| dir B |
858+
| | | | pair 0 |
859+
| | | | |
860+
| | | '--------'
861+
| | .-------'
862+
| v v |
863+
| .--------. |
864+
'->| dir A |-'
865+
| pair 0 |
866+
| |
867+
'--------'
868+
869+
| update root to no longer contain directory A
870+
v
871+
.--------.
872+
|root dir|-.
873+
| pair 0 | |
874+
.----| |-'
875+
| '--------'
876+
| |
877+
| v
878+
| .--------.
879+
| .->| dir B |
880+
| | | pair 0 |
881+
| '--| |-.
882+
| '--------' |
883+
| | |
884+
| v |
885+
| .--------. |
886+
'--->| dir A |-'
887+
| pair 0 |
888+
| |
889+
'--------'
890+
```
891+
892+
Now, if we run into a directory entry that has been marked as "moved", one
893+
of two things is possible. Either the directory entry exists elsewhere in the
894+
filesystem, or it doesn't. This is a O(n) operation, but only occurs in the
895+
unlikely case we lost power during a move.
723896

724-
It's also worth noting that once again we have an operation that isn't actually
725-
atomic. After we add directory A to directory B, we could lose power, leaving
726-
directory A as a part of both the root directory and directory B. However,
727-
there isn't anything inherent to the littlefs that prevents a directory from
728-
having multiple parents, so in this case, we just allow that to happen. Extra
729-
care is taken to only remove a directory from the linked-list if there are
730-
no parents left in the filesystem.
897+
And we can easily fix the "moved" directory entry. Since we're already scanning
898+
the filesystem during the deorphan step, we can also check for moved entries.
899+
If we find one, we either remove the "moved" marking or remove the whole entry
900+
if it exists elsewhere in the filesystem.
731901

732902
## Wear awareness
733903

@@ -955,18 +1125,18 @@ So, to summarize:
9551125

9561126
1. The littlefs is composed of directory blocks
9571127
2. Each directory is a linked-list of metadata pairs
958-
3. These metadata pairs can be updated atomically by alternative which
1128+
3. These metadata pairs can be updated atomically by alternating which
9591129
metadata block is active
9601130
4. Directory blocks contain either references to other directories or files
961-
5. Files are represented by copy-on-write CTZ linked-lists
962-
6. The CTZ linked-lists support appending in O(1) and reading in O(n logn)
963-
7. Blocks are allocated by scanning the filesystem for used blocks in a
1131+
5. Files are represented by copy-on-write CTZ skip-lists which support O(1)
1132+
append and O(n logn) reading
1133+
6. Blocks are allocated by scanning the filesystem for used blocks in a
9641134
fixed-size lookahead region is that stored in a bit-vector
965-
8. To facilitate scanning the filesystem, all directories are part of a
1135+
7. To facilitate scanning the filesystem, all directories are part of a
9661136
linked-list that is threaded through the entire filesystem
967-
9. If a block develops an error, the littlefs allocates a new block, and
1137+
8. If a block develops an error, the littlefs allocates a new block, and
9681138
moves the data and references of the old block to the new.
969-
10. Any case where an atomic operation is not possible, it is taken care of
1139+
9. Any case where an atomic operation is not possible, mistakes are resolved
9701140
by a deorphan step that occurs on the first allocation after boot
9711141

9721142
That's the little filesystem. Thanks for reading!

0 commit comments

Comments
 (0)