Commit Graph

4 Commits

Author SHA1 Message Date
Christopher Haster
d04b077506 Fixed minor things to get CI passing again
- Added caching to Travis install dirs, because otherwise
  pip3 install fails randomly
- Increased size of littlefs-fuse disk because test script has
  a larger footprint now
- Skip a couple of reentrant tests under byte-level writes because
  the tests just take too long and cause Travis to bail due to no
  output for 10m
- Fixed various Valgrind errors
  - Suppressed uninit checks for tests where LFS_BLOCK_ERASE_VALUE == -1.
    In this case rambd goes uninitialized, which is fine for rambd's
    purposes. Note I couldn't figure out how to limit this suppression
    to only the malloc in rambd, this doesn't seem possible with Valgrind.
  - Fixed memory leaks in exhaustion tests
  - Fixed off-by-1 string null-terminator issue in paths tests
- Fixed lfs_file_sync issue caused by revealed by fixing memory leaks
  in exhaustion tests. Getting ENOSPC during a file write puts the file
  in a bad state where littlefs doesn't know how to write it out safely.
  In this case, lfs_file_sync and lfs_file_close return 0 without
  writing out state so that device-side resources can still be cleaned
  up. To recover from ENOSPC, the file needs to be reopened and the
  writes recreated. Not sure if there is a better way to handle this.
- Added some quality-of-life improvements to Valgrind testing
  - Fit Valgrind messages into truncated output when not in verbose mode
  - Turned on origin tracking
2020-02-18 18:05:03 -06:00
Christopher Haster
02c84ac5f4 Cleaned up dependent fixes on branch
These should probably have been cleaned up in each commit to allow
cherry-picking, but due to time I haven't been able to.

- Went with creating an mdir copy in lfs_dir_commit. This handles a
  number of related cleanup issues in lfs_dir_compact and it does so
  more robustly. As a plus we can use the copy to update dependencies
  in the mlist.

- Eliminated code left by the ENOSPC file outlining

- Cleaned up TODOs and lingering comments

- Changed the reentrant many directory create/rename/remove test to use
  a smaller set of directories because of space issues when
  READ/PROG_SIZE=512
2020-02-09 12:37:39 -06:00
Christopher Haster
517d3414c5 Fixed more bugs, mostly related to ENOSPC on different geometries
Fixes:
- Fixed reproducability issue when we can't read a directory revision
- Fixed incorrect erase assumption if lfs_dir_fetch exceeds block size
- Fixed cleanup issue caused by lfs_fs_relocate failing when trying to
  outline a file in lfs_file_sync
- Fixed cleanup issue if we run out of space while extending a CTZ skip-list
- Fixed missing half-orphans when allocating blocks during lfs_fs_deorphan

Also:
- Added cycle-detection to readtree.py
- Allowed pseudo-C expressions in test conditions (and it's
  beautifully hacky, see line 187 of test.py)
- Better handling of ctrl-C during test runs
- Added build-only mode to test.py
- Limited stdout of test failures to 5 lines unless in verbose mode

Explanation of fixes below

1. Fixed reproducability issue when we can't read a directory revision

   An interesting subtlety of the block-device layer is that the
   block-device is allowed to return LFS_ERR_CORRUPT on reads to
   untouched blocks. This can easily happen if a user is using ECC or
   some sort of CMAC on their blocks. Normally we never run into this,
   except for the optimization around directory revisions where we use
   uninitialized data to start our revision count.

   We correctly handle this case by ignoring whats on disk if the read
   fails, but end up using unitialized RAM instead. This is not an issue
   for normal use, though it can lead to a small information leak.
   However it creates a big problem for reproducability, which is very
   helpful for debugging.

   I ended up running into a case where the RAM values for the revision
   count was different, causing two identical runs to wear-level at
   different times, leading to one version running out of space before a
   bug occured because it expanded the superblock early.

2. Fixed incorrect erase assumption if lfs_dir_fetch exceeds block size

   This could be caused if the previous tag was a valid commit and we
   lost power causing a partially written tag as the start of a new
   commit.

   Fortunately we already have a separate condition for exceeding the
   block size, so we can force that case to always treat the mdir as
   unerased.

3. Fixed cleanup issue caused by lfs_fs_relocate failing when trying to
   outline a file in lfs_file_sync

   Most operations involving metadata-pairs treat the mdir struct as
   entirely temporary and throw it out if any error occurs. Except for
   lfs_file_sync since the mdir is also a part of the file struct.

   This is relevant because of a cleanup issue in lfs_dir_compact that
   usually doesn't have side-effects. The issue is that lfs_fs_relocate
   can fail. It needs to allocate new blocks to relocate to, and as the
   disk reaches its end of life, it can fail with ENOSPC quite often.

   If lfs_fs_relocate fails, the containing lfs_dir_compact would return
   immediately without restoring the previous state of the mdir. If a new
   commit comes in on the same mdir, the old state left there could
   corrupt the filesystem.

   It's interesting to note this is forced to happen in lfs_file_sync,
   since it always tries to outline the file if it gets ENOSPC (ENOSPC
   can mean both no blocks to allocate and that the mdir is full). I'm
   not actually sure this bit of code is necessary anymore, we may be
   able to remove it.

4. Fixed cleanup issue if we run out of space while extending a CTZ
   skip-list

   The actually CTZ skip-list logic itself hasn't been touched in more
   than a year at this point, so I was surprised to find a bug here. But
   it turns out the CTZ skip-list could be put in an invalid state if we
   run out of space while trying to extend the skip-list.

   This only becomes a problem if we keep the file open, clean up some
   space elsewhere, and then continue to write to the open file without
   modifying it. Fortunately an easy fix.

5. Fixed missing half-orphans when allocating blocks during
   lfs_fs_deorphan

   This was a really interesting bug. Normally, we don't have to worry
   about allocations, since we force consistency before we are allowed
   to allocate blocks. But what about the deorphan operation itself?
   Don't we need to allocate blocks if we relocate while deorphaning?

   It turns out the deorphan operation can lead to allocating blocks
   while there's still orphans and half-orphans on the threaded
   linked-list. Orphans aren't an issue, but half-orphans may contain
   references to blocks in the outdated half, which doesn't get scanned
   during the normal allocation pass.

   Fortunately we already fetch directory entries to check CTZ lists, so
   we can also check half-orphans here. However this causes
   lfs_fs_traverse to duplicate all metadata-pairs, not sure what to do
   about this yet.
2020-02-09 11:54:22 -06:00
Christopher Haster
aab6aa0ed9 Cleaned up test script and directory naming
- Removed old tests and test scripts
- Reorganize the block devices to live under one directory
- Plugged new test framework into Makefile

renamed:
- scripts/test_.py -> scripts/test.py
- tests_ -> tests
- {file,ram,test}bd/* -> bd/*

It took a surprising amount of effort to make the Makefile behave since
it turns out the "test_%" rule could override "tests/test_%.toml.test"
which is generated as part of test.py.
2020-01-27 10:16:29 -06:00