mirror of
				https://github.com/eledio-devices/thirdparty-littlefs.git
				synced 2025-10-31 00:32:38 +01:00 
			
		
		
		
	Updated SPEC.md and DESIGN.md based on recent changes
- Added math behind CTZ limits - Added documentation over atomic moves
This commit is contained in:
		
							
								
								
									
										250
									
								
								DESIGN.md
									
									
									
									
									
								
							
							
						
						
									
										250
									
								
								DESIGN.md
									
									
									
									
									
								
							| @@ -200,7 +200,7 @@ Now we could just leave files here, copying the entire file on write | ||||
| provides the synchronization without the duplicated memory requirements | ||||
| of the metadata blocks. However, we can do a bit better. | ||||
|  | ||||
| ## CTZ linked-lists | ||||
| ## CTZ skip-lists | ||||
|  | ||||
| There are many different data structures for representing the actual | ||||
| files in filesystems. Of these, the littlefs uses a rather unique [COW](https://upload.wikimedia.org/wikipedia/commons/0/0c/Cow_female_black_white.jpg) | ||||
| @@ -246,19 +246,19 @@ runtime to just _read_ a file? That's awful. Keep in mind reading files are | ||||
| usually the most common filesystem operation. | ||||
|  | ||||
| To avoid this problem, the littlefs uses a multilayered linked-list. For | ||||
| every block that is divisible by a power of two, the block contains an | ||||
| additional pointer that points back by that power of two. Another way of | ||||
| thinking about this design is that there are actually many linked-lists | ||||
| threaded together, with each linked-lists skipping an increasing number | ||||
| of blocks. If you're familiar with data-structures, you may have also | ||||
| recognized that this is a deterministic skip-list. | ||||
| every nth block where n is divisible by 2^x, the block contains a pointer | ||||
| to block n-2^x. So each block contains anywhere from 1 to log2(n) pointers | ||||
| that skip to various sections of the preceding list. If you're familiar with | ||||
| data-structures, you may have recognized that this is a type of deterministic | ||||
| skip-list. | ||||
|  | ||||
| To find the power of two factors efficiently, we can use the instruction | ||||
| [count trailing zeros (CTZ)](https://en.wikipedia.org/wiki/Count_trailing_zeros), | ||||
| which is where this linked-list's name comes from. | ||||
| The name comes from the use of the | ||||
| [count trailing zeros (CTZ)](https://en.wikipedia.org/wiki/Count_trailing_zeros) | ||||
| instruction, which allows us to calculate the power-of-two factors efficiently. | ||||
| For a given block n, the block contains ctz(n)+1 pointers. | ||||
|  | ||||
| ``` | ||||
| Exhibit C: A backwards CTZ linked-list | ||||
| Exhibit C: A backwards CTZ skip-list | ||||
| .--------.  .--------.  .--------.  .--------.  .--------.  .--------. | ||||
| | data 0 |<-| data 1 |<-| data 2 |<-| data 3 |<-| data 4 |<-| data 5 | | ||||
| |        |<-|        |--|        |<-|        |--|        |  |        | | ||||
| @@ -266,6 +266,9 @@ Exhibit C: A backwards CTZ linked-list | ||||
| '--------'  '--------'  '--------'  '--------'  '--------'  '--------' | ||||
| ``` | ||||
|  | ||||
| The additional pointers allow us to navigate the data-structure on disk | ||||
| much more efficiently than in a single linked-list. | ||||
|  | ||||
| Taking exhibit C for example, here is the path from data block 5 to data | ||||
| block 1. You can see how data block 3 was completely skipped: | ||||
| ``` | ||||
| @@ -285,15 +288,57 @@ The path to data block 0 is even more quick, requiring only two jumps: | ||||
| '--------'  '--------'  '--------'  '--------'  '--------'  '--------' | ||||
| ``` | ||||
|  | ||||
| The CTZ linked-list has quite a few interesting properties. All of the pointers | ||||
| in the block can be found by just knowing the index in the list of the current | ||||
| block, and, with a bit of math, the amortized overhead for the linked-list is | ||||
| only two pointers per block.  Most importantly, the CTZ linked-list has a | ||||
| worst case lookup runtime of O(logn), which brings the runtime of reading a | ||||
| file down to O(n logn). Given that the constant runtime is divided by the | ||||
| amount of data we can store in a block, this is pretty reasonable. | ||||
| We can find the runtime complexity by looking at the path to any block from | ||||
| the block containing the most pointers. Every step along the path divides | ||||
| the search space for the block in half. This gives us a runtime of O(log n). | ||||
| To get to the block with the most pointers, we can perform the same steps | ||||
| backwards, which keeps the asymptotic runtime at O(log n). The interesting | ||||
| part about this data structure is that this optimal path occurs naturally | ||||
| if we greedily choose the pointer that covers the most distance without passing | ||||
| our target block. | ||||
|  | ||||
| Here is what it might look like to update a file stored with a CTZ linked-list: | ||||
| So now we have a representation of files that can be appended trivially with | ||||
| a runtime of O(1), and can be read with a worst case runtime of O(n logn). | ||||
| Given that the the runtime is also divided by the amount of data we can store | ||||
| in a block, this is pretty reasonable. | ||||
|  | ||||
| Unfortunately, the CTZ skip-list comes with a few questions that aren't | ||||
| straightforward to answer. What is the overhead? How do we handle more | ||||
| pointers than we can store in a block? | ||||
|  | ||||
| One way to find the overhead per block is to look at the data structure as | ||||
| multiple layers of linked-lists. Each linked-list skips twice as many blocks | ||||
| as the previous linked-list. Or another way of looking at it is that each  | ||||
| linked-list uses half as much storage per block as the previous linked-list. | ||||
| As we approach infinity, the number of pointers per block forms a geometric | ||||
| series. Solving this geometric series gives us an average of only 2 pointers | ||||
| per block. | ||||
|  | ||||
|  | ||||
|  | ||||
| Finding the maximum number of pointers in a block is a bit more complicated, | ||||
| but since our file size is limited by the integer width we use to store the | ||||
| size, we can solve for it. Setting the overhead of the maximum pointers equal | ||||
| to the block size we get the following equation. Note that a smaller block size | ||||
| results in more pointers, and a larger word width results in larger pointers. | ||||
|  | ||||
|  | ||||
|  | ||||
| where:   | ||||
| B = block size in bytes   | ||||
| w = word width in bits   | ||||
|  | ||||
| Solving the equation for B gives us the minimum block size for various word | ||||
| widths:   | ||||
| 32 bit CTZ skip-list = minimum block size of 104 bytes   | ||||
| 64 bit CTZ skip-list = minimum block size of 448 bytes   | ||||
|  | ||||
| Since littlefs uses a 32 bit word size, we are limited to a minimum block | ||||
| size of 104 bytes. This is a perfectly reasonable minimum block size, with most | ||||
| block sizes starting around 512 bytes. So we can avoid the additional logic | ||||
| needed to avoid overflowing our block's capacity in the CTZ skip-list. | ||||
|  | ||||
| Here is what it might look like to update a file stored with a CTZ skip-list: | ||||
| ``` | ||||
|                                       block 1   block 2 | ||||
|                                     .---------.---------. | ||||
| @@ -367,7 +412,7 @@ v | ||||
| ## Block allocation | ||||
|  | ||||
| So those two ideas provide the grounds for the filesystem. The metadata pairs | ||||
| give us directories, and the CTZ linked-lists give us files. But this leaves | ||||
| give us directories, and the CTZ skip-lists give us files. But this leaves | ||||
| one big [elephant](https://upload.wikimedia.org/wikipedia/commons/3/37/African_Bush_Elephant.jpg) | ||||
| of a question. How do we get those blocks in the first place? | ||||
|  | ||||
| @@ -653,9 +698,17 @@ deorphan step that simply iterates through every directory in the linked-list | ||||
| and checks it against every directory entry in the filesystem to see if it | ||||
| has a parent. The deorphan step occurs on the first block allocation after | ||||
| boot, so orphans should never cause the littlefs to run out of storage | ||||
| prematurely. | ||||
| prematurely. Note that the deorphan step never needs to run in a readonly | ||||
| filesystem. | ||||
|  | ||||
| And for my final trick, moving a directory: | ||||
| ## The move problem | ||||
|  | ||||
| Now we have a real problem. How do we move things between directories while | ||||
| remaining power resilient? Even looking at the problem from a high level, | ||||
| it seems impossible. We can update directory blocks atomically, but atomically | ||||
| updating two independent directory blocks is not an atomic operation. | ||||
|  | ||||
| Here's the steps the filesystem may go through to move a directory: | ||||
| ``` | ||||
|          .--------. | ||||
|          |root dir|-. | ||||
| @@ -716,18 +769,135 @@ v | ||||
|      '--------' | ||||
| ``` | ||||
|  | ||||
| Note that once again we don't care about the ordering of directories in the | ||||
| linked-list, so we can simply leave directories in their old positions. This | ||||
| does make the diagrams a bit hard to draw, but the littlefs doesn't really | ||||
| care. | ||||
| We can leave any orphans up to the deorphan step to collect, but that doesn't | ||||
| help the case where dir A has both dir B and the root dir as parents if we | ||||
| lose power inconveniently. | ||||
|  | ||||
| It's also worth noting that once again we have an operation that isn't actually | ||||
| atomic. After we add directory A to directory B, we could lose power, leaving | ||||
| directory A as a part of both the root directory and directory B. However, | ||||
| there isn't anything inherent to the littlefs that prevents a directory from | ||||
| having multiple parents, so in this case, we just allow that to happen. Extra | ||||
| care is taken to only remove a directory from the linked-list if there are | ||||
| no parents left in the filesystem. | ||||
| Initially, you might think this is fine. Dir A _might_ end up with two parents, | ||||
| but the filesystem will still work as intended. But then this raises the | ||||
| question of what do we do when the dir A wears out? For other directory blocks | ||||
| we can update the parent pointer, but for a dir with two parents we would need | ||||
| work out how to update both parents. And the check for multiple parents would | ||||
| need to be carried out for every directory, even if the directory has never | ||||
| been moved. | ||||
|  | ||||
| It also presents a bad user-experience, since the condition of ending up with | ||||
| two parents is rare, it's unlikely user-level code will be prepared. Just think | ||||
| about how a user would recover from a multi-parented directory. They can't just | ||||
| remove one directory, since remove would report the directory as "not empty". | ||||
|  | ||||
| Other atomic filesystems simple COW the entire directory tree. But this | ||||
| introduces a significant bit of complexity, which leads to code size, along | ||||
| with a surprisingly expensive runtime cost during what most users assume is | ||||
| a single pointer update. | ||||
|  | ||||
| Another option is to update the directory block we're moving from to point | ||||
| to the destination with a sort of predicate that we have moved if the | ||||
| destination exists. Unfortunately, the omnipresent concern of wear could | ||||
| cause any of these directory entries to change blocks, and changing the | ||||
| entry size before a move introduces complications if it spills out of | ||||
| the current directory block. | ||||
|  | ||||
| So how do we go about moving a directory atomically? | ||||
|  | ||||
| We rely on the improbableness of power loss. | ||||
|  | ||||
| Power loss during a move is certainly possible, but it's actually relatively | ||||
| rare. Unless a device is writing to a filesystem constantly, it's unlikely that | ||||
| a power loss will occur during filesystem activity. We still need to handle | ||||
| the condition, but runtime during a power loss takes a back seat to the runtime | ||||
| during normal operations. | ||||
|  | ||||
| So what littlefs does is unelegantly simple. When littlefs moves a file, it | ||||
| marks the file as "moving". This is stored as a single bit in the directory | ||||
| entry and doesn't take up much space. Then littlefs moves the directory, | ||||
| finishing with the complete remove of the "moving" directory entry. | ||||
|  | ||||
| ``` | ||||
|          .--------. | ||||
|          |root dir|-. | ||||
|          | pair 0 | | | ||||
| .--------|        |-' | ||||
| |        '--------' | ||||
| |        .-'    '-. | ||||
| |       v          v | ||||
| |  .--------.  .--------. | ||||
| '->| dir A  |->| dir B  | | ||||
|    | pair 0 |  | pair 0 | | ||||
|    |        |  |        | | ||||
|    '--------'  '--------' | ||||
|  | ||||
| |  update root directory to mark directory A as moving | ||||
| v | ||||
|  | ||||
|         .----------. | ||||
|         |root dir  |-. | ||||
|         | pair 0   | | | ||||
| .-------| moving A!|-' | ||||
| |       '----------' | ||||
| |        .-'    '-. | ||||
| |       v          v | ||||
| |  .--------.  .--------. | ||||
| '->| dir A  |->| dir B  | | ||||
|    | pair 0 |  | pair 0 | | ||||
|    |        |  |        | | ||||
|    '--------'  '--------' | ||||
|  | ||||
| |  update directory B to point to directory A | ||||
| v | ||||
|  | ||||
|         .----------. | ||||
|         |root dir  |-. | ||||
|         | pair 0   | | | ||||
| .-------| moving A!|-' | ||||
| |       '----------' | ||||
| |    .-----'    '-. | ||||
| |    |             v | ||||
| |    |           .--------. | ||||
| |    |        .->| dir B  | | ||||
| |    |        |  | pair 0 | | ||||
| |    |        |  |        | | ||||
| |    |        |  '--------' | ||||
| |    |     .-------' | ||||
| |    v    v   | | ||||
| |  .--------. | | ||||
| '->| dir A  |-' | ||||
|    | pair 0 | | ||||
|    |        | | ||||
|    '--------' | ||||
|  | ||||
| |  update root to no longer contain directory A | ||||
| v | ||||
|      .--------. | ||||
|      |root dir|-. | ||||
|      | pair 0 | | | ||||
| .----|        |-' | ||||
| |    '--------' | ||||
| |        | | ||||
| |        v | ||||
| |    .--------. | ||||
| | .->| dir B  | | ||||
| | |  | pair 0 | | ||||
| | '--|        |-. | ||||
| |    '--------' | | ||||
| |        |      | | ||||
| |        v      | | ||||
| |    .--------. | | ||||
| '--->| dir A  |-' | ||||
|      | pair 0 | | ||||
|      |        | | ||||
|      '--------' | ||||
| ``` | ||||
|  | ||||
| Now, if we run into a directory entry that has been marked as "moved", one | ||||
| of two things is possible. Either the directory entry exists elsewhere in the | ||||
| filesystem, or it doesn't. This is a O(n) operation, but only occurs in the | ||||
| unlikely case we lost power during a move. | ||||
|  | ||||
| And we can easily fix the "moved" directory entry. Since we're already scanning | ||||
| the filesystem during the deorphan step, we can also check for moved entries. | ||||
| If we find one, we either remove the "moved" marking or remove the whole entry | ||||
| if it exists elsewhere in the filesystem. | ||||
|  | ||||
| ## Wear awareness | ||||
|  | ||||
| @@ -955,18 +1125,18 @@ So, to summarize: | ||||
|  | ||||
| 1. The littlefs is composed of directory blocks | ||||
| 2. Each directory is a linked-list of metadata pairs | ||||
| 3. These metadata pairs can be updated atomically by alternative which | ||||
| 3. These metadata pairs can be updated atomically by alternating which | ||||
|    metadata block is active | ||||
| 4. Directory blocks contain either references to other directories or files | ||||
| 5. Files are represented by copy-on-write CTZ linked-lists | ||||
| 6. The CTZ linked-lists support appending in O(1) and reading in O(n logn) | ||||
| 7. Blocks are allocated by scanning the filesystem for used blocks in a | ||||
| 5. Files are represented by copy-on-write CTZ skip-lists which support O(1) | ||||
|    append and O(n logn) reading | ||||
| 6. Blocks are allocated by scanning the filesystem for used blocks in a | ||||
|    fixed-size lookahead region is that stored in a bit-vector | ||||
| 8. To facilitate scanning the filesystem, all directories are part of a | ||||
| 7. To facilitate scanning the filesystem, all directories are part of a | ||||
|    linked-list that is threaded through the entire filesystem | ||||
| 9. If a block develops an error, the littlefs allocates a new block, and | ||||
| 8. If a block develops an error, the littlefs allocates a new block, and | ||||
|    moves the data and references of the old block to the new. | ||||
| 10. Any case where an atomic operation is not possible, it is taken care of | ||||
| 9. Any case where an atomic operation is not possible, mistakes are resolved | ||||
|    by a deorphan step that occurs on the first allocation after boot | ||||
|  | ||||
| That's the little filesystem. Thanks for reading! | ||||
|   | ||||
							
								
								
									
										33
									
								
								SPEC.md
									
									
									
									
									
								
							
							
						
						
									
										33
									
								
								SPEC.md
									
									
									
									
									
								
							| @@ -121,13 +121,18 @@ Here's the layout of entries on disk: | ||||
| **Entry type** - Type of the entry, currently this is limited to the following: | ||||
| - 0x11 - file entry | ||||
| - 0x22 - directory entry | ||||
| - 0xe2 - superblock entry | ||||
| - 0x2e - superblock entry | ||||
|  | ||||
| Additionally, the type is broken into two 4 bit nibbles, with the lower nibble | ||||
| Additionally, the type is broken into two 4 bit nibbles, with the upper nibble | ||||
| specifying the type's data structure used when scanning the filesystem. The | ||||
| upper nibble clarifies the type further when multiple entries share the same | ||||
| lower nibble clarifies the type further when multiple entries share the same | ||||
| data structure. | ||||
|  | ||||
| The highest bit is reserved for marking the entry as "moved". If an entry | ||||
| is marked as "moved", the entry may also exist somewhere else in the | ||||
| filesystem. If the entry exists elsewhere, this entry must be treated as | ||||
| though it does not exist. | ||||
|  | ||||
| **Entry length** - Length in bytes of the entry-specific data. This does | ||||
| not include the entry type size, attributes, or name. The full size in bytes | ||||
| of the entry is 4 + entry length + attribute length + name length. | ||||
| @@ -175,7 +180,7 @@ Here's the layout of the superblock entry: | ||||
|  | ||||
| | offset | size                   | description                            | | ||||
| |--------|------------------------|----------------------------------------| | ||||
| | 0x00   | 8 bits                 | entry type (0xe2 for superblock entry) | | ||||
| | 0x00   | 8 bits                 | entry type (0x2e for superblock entry) | | ||||
| | 0x01   | 8 bits                 | entry length (20 bytes)                | | ||||
| | 0x02   | 8 bits                 | attribute length                       | | ||||
| | 0x03   | 8 bits                 | name length (8 bytes)                  | | ||||
| @@ -208,7 +213,7 @@ Here's an example of a complete superblock: | ||||
| (32 bits) revision count   = 3                    (0x00000003) | ||||
| (32 bits) dir size         = 52 bytes, end of dir (0x00000034) | ||||
| (64 bits) tail pointer     = 3, 2                 (0x00000003, 0x00000002) | ||||
| (8 bits)  entry type       = superblock           (0xe2) | ||||
| (8 bits)  entry type       = superblock           (0x2e) | ||||
| (8 bits)  entry length     = 20 bytes             (0x14) | ||||
| (8 bits)  attribute length = 0 bytes              (0x00) | ||||
| (8 bits)  name length      = 8 bytes              (0x08) | ||||
| @@ -220,7 +225,7 @@ Here's an example of a complete superblock: | ||||
| (32 bits) crc              = 0xc50b74fa | ||||
|  | ||||
| 00000000: 03 00 00 00 34 00 00 00 03 00 00 00 02 00 00 00  ....4........... | ||||
| 00000010: e2 14 00 08 03 00 00 00 02 00 00 00 00 02 00 00  ................ | ||||
| 00000010: 2e 14 00 08 03 00 00 00 02 00 00 00 00 02 00 00  ................ | ||||
| 00000020: 00 04 00 00 01 00 01 00 6c 69 74 74 6c 65 66 73  ........littlefs | ||||
| 00000030: fa 74 0b c5                                      .t.. | ||||
| ``` | ||||
| @@ -262,15 +267,19 @@ Here's an example of a directory entry: | ||||
|  | ||||
| Files are stored in entries with a pointer to the head of the file and the | ||||
| size of the file. This is enough information to determine the state of the | ||||
| CTZ linked-list that is being referenced. | ||||
| CTZ skip-list that is being referenced. | ||||
|  | ||||
| How files are actually stored on disk is a bit complicated. The full | ||||
| explanation of CTZ linked-lists can be found in [DESIGN.md](DESIGN.md#ctz-linked-lists). | ||||
| explanation of CTZ skip-lists can be found in [DESIGN.md](DESIGN.md#ctz-skip-lists). | ||||
|  | ||||
| A terribly quick summary: For every nth block where n is divisible by 2^x, | ||||
| the block contains a pointer that points x blocks towards the beginning of the | ||||
| file. These pointers are stored in order of x in each block of the file | ||||
| immediately before the data in the block. | ||||
| the block contains a pointer to block n-2^x. These pointers are stored in | ||||
| increasing order of x in each block of the file preceding the data in the | ||||
| block. | ||||
|  | ||||
| The maximum number of pointers in a block is bounded by the maximum file size | ||||
| divided by the block size. With 32 bits for file size, this results in a | ||||
| minimum block size of 104 bytes. | ||||
|  | ||||
| Here's the layout of a file entry: | ||||
|  | ||||
| @@ -286,7 +295,7 @@ Here's the layout of a file entry: | ||||
| | 0xc+a  | name length bytes      | directory name                     | | ||||
|  | ||||
| **File head** - Pointer to the block that is the head of the file's CTZ | ||||
| linked-list. | ||||
| skip-list. | ||||
|  | ||||
| **File size** - Size of file in bytes. | ||||
|  | ||||
|   | ||||
		Reference in New Issue
	
	Block a user