User:Thedabbingduck/OStrichFS

From OSDev Wiki
Jump to navigation Jump to search

OstrichFS (version alpha) is a crash-consistent, snapshot-capable, flash-optimized file system designed for use on SSDs and SD cards. It is a "roll your own" filesystem, designed and written from scratch. It takes various features from other popular filesystem designs, as described below. It was developed as part of the OStrich project by a group of students at the University of Texas at Austin.

Overview

OstrichFS offers:

  • Crash consistency: every metadata and data change is logged and committed out-of-place. On reboot it replays only the necessary operations to restore a valid state. This ensures that losing power at any point will not leave the system in an unrecoverable state.
  • Snapshots: on-demand checkpoints capture the complete inode-mapping, letting you mount any past state read-only without duplicating file data.
  • Flash-friendly writes: avoids small in-place updates by always allocating new blocks, reducing wear and write amplification.
  • Rich file support: multi-level inode (15 direct, 10 single-indirect, 2 double-indirect) lets you store very large files.
  • Simple allocation: bitmaps for free inodes and blocks keep allocation code minimal and fast.
  • Journaling area: a fixed region (e.g. 64 blocks) holds both operation logs and checkpoint chains.

Drawbacks & limits:

  • Log/checkpoint area is fixed-size—may overflow if you go too long without checkpointing.
  • No automatic garbage collection—old data and inodes accumulate until manually reclaimed.
  • Snapshot array capped at 128—older checkpoints will be overwritten.
  • No built-in wear-leveling or configurable parameters yet.
  • Early development—bug-prone, feature-incomplete and under active development.
  • Minimal caching capabilities (only structures like bitmaps / inode tables are cached at the filesystem level, other caching must be implemented by the operating system)

Design & Key Concepts

Disk Regions

Block 0 is the superblock (magic, region offsets, free-counts, checkpoint pointers). It contains pointers to the following regions:

  • Inode region – array of fixed-size inodes.
  • Inode table – maps logical inode numbers to physical inode slots (enables copy-on-write updates).
  • Bitmaps – track free inodes and data blocks.
  • Data blocks – hold file and directory contents.
  • Log/checkpoint area – sequential operation log entries plus checkpoint block chains.

Copy-on-Write & Logging

Every change (file write, directory update, metadata tweak) follows:

  1. Modify in memory → write to a new block
  2. Update parent pointers (indirect blocks or inode) via new blocks
  3. Log the final inode mapping change before switching the live mapping

On mount, OstrichFS loads the last checkpoint and replays any log entries past that point.

Snapshotting

// Mount a snapshot read-only:
fs_req_mount_snapshot(checkpointID);

1. Freeze the current inode-table mapping.

2. Serialize all <inodeNumber, inodeLocation> pairs into checkpoint blocks.

3. Record the checkpoint linked list's first checkpoint block in the superblock.

4. Mounting that checkpoint reconstructs the old mapping in memory and prohibits writes.

On-Disk Layout

OStrichFS utilizes a generally simplified on-disk layout to focus more on user-level features instead of disk-usage or performance optimization.

The superblock contains the following fields:

Field Name Size (bytes) Type Usage
magic 8 uint64_t Stores the magic number for OStrichFS (0xCA5CADEDBA5EBA11)
size 8 uint64_t Stores the max number of data bytes based on the number of data blocks in the filesystem
totalBlockCount 4 uint32_t Stores the total number of blocks on the partition
dataBlockCount 4 uint32_t Stores the total number of data blocks in the filesystem
inodeCount 4 uint32_t Stores the total number of inodes in the filesystem
freeDataBlockCount 4 uint32_t Stores the number of free data blocks in the filesystem
freeInodeCount 4 uint32_t Stores the number of free inodes in the filesystem
dataBlockBitmap 4 uint32_t Stores the block number at which the data block bitmap starts
dataBlockBitmapSize 4 uint32_t Stores the size of the data block bitmap (in blocks)
inodeBitmap 4 uint32_t Stores the block number at which the inode bitmap starts
inodeBitmapSize 4 uint32_t Stores the size of the inode bitmap (in blocks)
inodeTable 4 uint32_t Stores the block number at which the inode table starts
inodeTableSize 4 uint32_t Stores the size of the inode table (in blocks)
dataBlockRegionStart 4 uint32_t Stores the block number at which the data block region starts
inodeRegionStart 4 uint32_t Stores the block number at which the inode region starts
inodeRegionSize 4 uint32_t Stores the size of the inode region (in blocks)
logAreaStart 4 uint32_t Stores the block number at which the log area starts
logAreaSize 4 uint32_t Stores the size of the log area (in blocks)
systemStateSeqNum 8 uint64_t Stores the sequence number of the system to verify against for consistency on system reboot
latestCheckpointIndex 2 uint16_t Stores the index of the latest checkpoint of the system
checkpointArr 512 uint32_t[] Stores the locations of all checkpoints on the system
readOnly 1 bool Determines whether or not the filesystem is readOnly (mainly used for snapshots in memory, should always be false on disk)

The inode structure is relatively simple with the minimal fields required for a basic filesystem:

Field Name Size (bytes) Type Usage
size 8 uint64_t Size of the file/directory (in bytes)
blockCount 4 uint32_t The number of blocks allocated to this file/directory
uid 2 uint16_t UID for user owning the file/directory
gid 2 uint16_t GID for group owning the file/directory
permissions 2 uint16_t Permission bits (POSIX-style) for the file/directory
numFiles 2 uint16_t Number of files in the directory (not used for file inodes)
directBlocks 60 uint32_t[] Locations of direct blocks
indirectBlocks 40 uint32_t[] Location of indirect blocks
doubleIndirectBlocks 8 uint32_t[] Location of double indirect blocks

The inodes are specifically structured to require 128 bytes per inode, allowing for 32 inodes per block. The permission bits of the inode are used for rwx permission for user/group/other and also to determine whether or not a file is a directory. The remaining 6 bits are unused at the moment, but could be utilized for additional flags (like the immutable flag in linux).

Design Considerations

The inode bitmap and data block bitmaps referenced by the superblock are on-disk structures to keep track of which blocks are currently in use. The filesystem utilizes a write-through cache for recent bitmap blocks to minimize excess reads (it still always writes to disk when allocating a block/inode for the sake of consistency).

The inode table provides a level of indirection for inode access. All inodes are referenced via their inode number which uniquely identifies a file or directory on the system. This number is then mapped to an inode location via the inode table. When inodes or blocks are edited, the copy of write logic of the filesystem updates (and logs) the inode table to change the inode location while keeping the inode number the same. This minimizes the amount of chained copy on writes needed and allows us to mount a snapshot by simply loading an older version of the inode table (since all of the old data blocks and inodes should still be on disk at their original locations). This logic applies to all files, even the root directory (which is provided the inode number of zero but will be allocated a new location for every change made to it). The inode table also contributes to consistency guarantees because until the inode table is updated, changes made to blocks or inode do not affect the running filesystem. By enforcing the constraint that the inode table is updated last in every file operation, we are guaranteed a level of consistency (assuming the block write operation is considered atomic).

The fields referring to locations or sizes of on-disk regions (e.g., dataBlockRegionStart) are initialized at filesystem creation and expected to remain constant. The filesystem allocates inode and data block regions assuming a ratio of 1 inode per 16KiB of data (which translates to 1 inode for 4 blocks with 4KiB blocks). Using this ratio, the filesystem calculates the size of bitmap and inode table required and stores them in the superblock.

Note that OStrichFS currently uses 32-bit unsigned integers to refer to block numbers which limits the max filesystem size to 2^32 blocks (~16 TiB). While the actual size of the filesystem would be slightly smaller due to the overhead of bitmaps, inodes, and other metadata structures, this limit is significantly higher than any situation which OStrichFS was designed for. Switching the 64 bit unsigned integers would not be a significant design challenge, but it would greatly increase the overhead for metadata structures and thus be a detriment on smaller disks.

Supported POSIX-style operations

Operations that aren't listed are not supported at this moment:

Syscall Status Notes
open(path, flags, mode) Partially Maps to fs_req_open(); returns an inode index rather than a file descriptor; only basic read/write flags; mode used only at creation.
read(fd, buf, count) Supported Via File::read_at(offset,…); OS must manage file offsets.
write(fd, buf, count) Supported Via File::write_at(offset,…); OS must manage and update offsets.
stat(path, statbuf) Partially Inode provides size, uid/gid, permissions; OS must read inode fields directly.
fstat(fd, statbuf) Partially OS must translate fd→inode_index then read inode.
mkdir(path, mode) Supported fs_req_create_file(..., is_dir=true); mode stored but no chmod.
rmdir(path) Supported fs_req_remove_file(); only empties directories.
unlink(path) Supported fs_req_remove_file(); data blocks not reclaimed until GC.
truncate(path, length) Partially write_at can extend or overwrite data; no dedicated ftruncate() syscall.

Note: OstrichFS relies on the OS for descriptor management, path parsing, and advanced flag handling.


Similarities with other file systems

  • Copy-on-write metadata: Like ZFS and Btrfs, OstrichFS never overwrites in-place but allocates new blocks for metadata (indirect blocks, inodes), enabling safe snapshots.
  • Transaction logging: Similar to EXT4’s journal, OstrichFS records critical inode mapping changes in a write-ahead log to guarantee atomicity and fast crash recovery.
  • Checkpoint-based snapshots: As in NILFS2, OstrichFS can capture point-in-time filesystem states by checkpointing its inode-table mapping, providing read-only mounts of past states.
  • Inode mapping table (NAT): Echoing F2FS’s Node Address Table, OstrichFS centralizes inode number → physical location indirection, simplifying copy-on-write updates without rewriting parent structures.
  • Multi-level block pointers: Like ext2/EXT4, OstrichFS inodes support direct, single-indirect, and double-indirect pointers to accommodate large files.
  • Bitmap allocation: Free-space management via bitmaps for inodes and data blocks follows the model used by many Unix-style file systems (ext2/EXT4, FAT variants), offering simplicity and speed.


Development Status

OstrichFS is in alpha. It is functional but not battle-tested and lacks many polishing features. You will encounter bugs and missing checks. If you’re an OSDev enthusiast or student looking to learn and contribute, your patches and feedback are welcome.

Roadmap & Future Work

  • Garbage collection: reclaim orphaned blocks and stale inodes.
  • fsdebug tool: create and verify disk images, inspect logs and checkpoints.
  • Configurable parameters: make log area size, checkpoint frequency, and bitmap layouts tunable.
  • Wear-leveling hooks: integrate with flash controllers for better longevity.

See Also

External Links

Last updated: 2025-04-28