The power-cycled-relocation test with random renames has been the most
aggressive test applied to littlefs so far, with:
- Random nested directory creation
- Random nested directory removal
- Random nested directory renames (this could make the
threaded linked-list very interesting)
- Relocating blocks every write (maximum wear-leveling)
- Incrementally cycling power every write
Also added a couple other tests to test_orphans and test_relocations.
The good news is the added testing worked well, it found quite a number
of complex and subtle bugs that have been difficult to find.
1. It's actually possible for our parent to be relocated and go out of
sync in lfs_mkdir. This can happen if our predecessor's predecessor
is our parent as we are threading ourselves into the filesystem's
threaded list. (note this doesn't happen if our predecessor _is_ our
parent, as we then update our parent in a single commit).
This is annoying because it only happens if our parent is a long (>1
pair) directory, otherwise we wouldn't need to catch relocations.
Fortunately we can reuse the internal open file/dir linked-list to
catch relocations easily, as long as we're careful to unhook our
parent whenever lfs_mkdir returns.
2. Even more surprising, it's possible for the child in lfs_remove
to be relocated while we delete the entry from our parent. This
can happen if we are our own parent's predecessor, since we need
to be updated then if our parent relocates.
Fortunately we can also hook into the open linked-list here.
Note this same issue was present in lfs_rename.
Fortunately, this means now all fetched dirs are hooked into the
open linked-list if they are needed across a commit. This means
we shouldn't need assumptions about tree movement for correctness.
3. lfs_rename("deja/vu", "deja/vu") with the same source and destination
was broken and tried to delete the entry twice.
4. Managing gstate deltas when we lose power during relocations was
broken. And unfortunately complicated.
The issue happens when we lose power during a relocation while
removing a directory.
When we remove a directory, we need to move the contents of its
gstate delta to another directory or we'll corrupt littlefs gstate.
(gstate is an xor of all deltas on the filesystem). We used to just
xor the gstate into our parent's gstate, however this isn't correct.
The gstate isn't built out of the directory tree, but rather out of
the threaded linked-list (which exists to make collecting this
gstate efficient).
Because we have to remove our dir in two operations, there's a point
were both the updated parent and child can exist in threaded
linked-list and duplicate the child's gstate delta.
.--------.
->| parent |-.
| gstate | |
.-| a |-'
| '--------'
| X <- child is orphaned
| .--------.
'>| child |->
| gstate |
| a |
'--------'
What we need to do is save our child's gstate and only give it to our
predecessor, since this finalizes the removal of the child.
However we still need to make valid updates to the gstate to mark
that we've created an orphan when we start removing the child.
This led to a small rework of how the gstate is handled. Now we have
a separation of the gpending state that should be written out ASAP
and the gdelta state that is collected from orphans awaiting
deletion.
5. lfs_deorphan wasn't actually able to handle deorphaning/desyncing
more than one orphan after a power-cycle. Having more than one orphan
is very rare, but of course very possible. Fortunately this was just
a mistake with using a break the in the deorphan, perhaps left from
v1 where multiple orphans weren't possible?
Note that we use a continue to force a refetch of the orphaned block.
This is needed in the case of a half-orphan, since the fetched
half-orphan may have an outdated tail pointer.
The root of the problem was some assumptions about what tags could be
sent to lfs_dir_commit.
- The first assumption is that there could be only one splice (create/delete)
tag at a time, which is trivially broken by the core commit in lfs_rename.
- The second assumption is that there is at most one create and one delete in
a single commit. This is less obvious but turns out to not be true in
the case that we rename a file such that it overwrites another file in
the same directory (1 delete for source file, 1 delete for destination).
- The third assumption was that there was an ordering to the
delete/creates passed to lfs_dir_commit. It may be possible to force all
deletes to follow creates by rearranging the tags in lfs_rename, but
this risks overflowing tag ids.
The way the lfs_dir_commit first collected the "deletetag" and "createtag"
broke all three of these assumptions. And because we lose the ordering
information we can no longer apply the directory changes to open files
correctly. The file ids may be shifted in a way that doesn't reflect the
actual operations on disk.
These problems were made worst by lfs_dir_commit cleaning up moves
implicitly, which also creates deletes implicitly. While cleaning up moves
in lfs_dir_commit may save some code size, it makes the commit logic much more
difficult to implement correctly.
This bug turned into pulling out a dead tree stump, roots and all.
I ended up reworking how lfs_dir_commit updates open files so that it
has less assumptions, now it just traverses the commit tags multiple
times in order to update file ids after a successful commit in the
correct order.
This also got rid of the dir copy by carefully updating split dirs
after all files have an up-to-date copy of the original dir.
I also just removed the implicit move cleanup. It turns out the only
commits that can occur before we have cleaned up the move is in
lfs_fs_relocate, so it was simple enough to explicitly handle this case
when we update our parent and pred during a relocate.
Cases where we may need to fix moves:
- In lfs_rename when we move a file/dir
- In lfs_demove if we lose power
- In lfs_fs_relocate if we have to relocate our parent and we find it
had a pending move (or else the move will be outdated)
- In lfs_fs_relocate if we have to relocate our predecessor and we find it
had a pending move (or else the move will be outdated)
Note the two cases in lfs_fs_relocate may be recursive. But
lfs_fs_relocate can only trigger other lfs_fs_relocates so it's not
possible for pending moves to spill out into other filesystem commits
And of couse, I added several tests to cover these situations. Hopefully
the rename-with-open-files logic should be fairly locked down now.
found with initial fix by eastmoutain
Both test_move and test_orphan needed internal knowledge which comes
with the addition of the "in" attribute. This was in the plan for the
test-revamp from the beginning as it really opens up the ability to
write more unit-style-tests using internal knowledge of how littlefs
works. More unit-style-tests should help _fix_ bugs by limiting the
scope of the test and where the bug could be hiding.
The "in" attribute effectively runs tests _inside_ the .c file
specified, giving the test access to all static members without
needed to change their visibility.