User:Thedabbingduck/OStrichFS
OstrichFS (version alpha) is a crash-consistent, snapshot-capable, flash-optimized file system designed for use on SSDs and SD cards. It is a "roll your own" filesystem, designed and written from scratch. It takes various features from other popular filesystem designs, as described below. It was developed as part of the OStrich project by a group of students at the University of Texas at Austin.
Overview
OstrichFS offers:
- Crash consistency: every metadata and data change is logged and committed out-of-place. On reboot it replays only the necessary operations to restore a valid state. This ensures that losing power at any point will not leave the system in an unrecoverable state.
- Snapshots: on-demand checkpoints capture the complete inode-mapping, letting you mount any past state read-only without duplicating file data.
- Flash-friendly writes: avoids small in-place updates by always allocating new blocks, reducing wear and write amplification.
- Rich file support: multi-level inode (15 direct, 10 single-indirect, 2 double-indirect) lets you store very large files.
- Simple allocation: bitmaps for free inodes and blocks keep allocation code minimal and fast.
- Journaling area: a fixed region (e.g. 64 blocks) holds both operation logs and checkpoint chains.
Drawbacks & limits:
- Log/checkpoint area is fixed-size—may overflow if you go too long without checkpointing.
- No automatic garbage collection—old data and inodes accumulate until manually reclaimed.
- Snapshot array capped at 128—older checkpoints will be overwritten.
- No built-in wear-leveling or configurable parameters yet.
- Early development—bug-prone, feature-incomplete and under active development.
- Minimal caching capabilities (only structures like bitmaps / inode tables are cached at the filesystem level, other caching must be implemented by the operating system)
Design & Key Concepts
Disk Regions
Block 0 is the superblock (magic, region offsets, free-counts, checkpoint pointers). It contains pointers to the following regions:
- Inode region – array of fixed-size inodes.
- Inode table – maps logical inode numbers to physical inode slots (enables copy-on-write updates).
- Bitmaps – track free inodes and data blocks.
- Data blocks – hold file and directory contents.
- Log/checkpoint area – sequential operation log entries plus checkpoint block chains.
Copy-on-Write & Logging
Every change (file write, directory update, metadata tweak) follows:
- Modify in memory → write to a new block
- Update parent pointers (indirect blocks or inode) via new blocks
- Log the final inode mapping change before switching the live mapping
On mount, OstrichFS loads the last checkpoint and replays any log entries past that point.
Snapshotting
// Mount a snapshot read-only:
fs_req_mount_snapshot(checkpointID);
1. Freeze the current inode-table mapping.
2. Serialize all <inodeNumber, inodeLocation> pairs into checkpoint blocks.
3. Record the checkpoint linked list's first checkpoint block in the superblock.
4. Mounting that checkpoint reconstructs the old mapping in memory and prohibits writes.
On-Disk Layout
OStrichFS utilizes a generally simplified on-disk layout to focus more on user-level features instead of disk-usage or performance optimization.
The superblock contains the following fields:
Field Name | Size (bytes) | Type | Usage |
---|---|---|---|
magic | 8 | uint64_t | Stores the magic number for OStrichFS (0xCA5CADEDBA5EBA11) |
size | 8 | uint64_t | Stores the max number of data bytes based on the number of data blocks in the filesystem |
totalBlockCount | 4 | uint32_t | Stores the total number of blocks on the partition |
dataBlockCount | 4 | uint32_t | Stores the total number of data blocks in the filesystem |
inodeCount | 4 | uint32_t | Stores the total number of inodes in the filesystem |
freeDataBlockCount | 4 | uint32_t | Stores the number of free data blocks in the filesystem |
freeInodeCount | 4 | uint32_t | Stores the number of free inodes in the filesystem |
dataBlockBitmap | 4 | uint32_t | Stores the block number at which the data block bitmap starts |
dataBlockBitmapSize | 4 | uint32_t | Stores the size of the data block bitmap (in blocks) |
inodeBitmap | 4 | uint32_t | Stores the block number at which the inode bitmap starts |
inodeBitmapSize | 4 | uint32_t | Stores the size of the inode bitmap (in blocks) |
inodeTable | 4 | uint32_t | Stores the block number at which the inode table starts |
inodeTableSize | 4 | uint32_t | Stores the size of the inode table (in blocks) |
dataBlockRegionStart | 4 | uint32_t | Stores the block number at which the data block region starts |
inodeRegionStart | 4 | uint32_t | Stores the block number at which the inode region starts |
inodeRegionSize | 4 | uint32_t | Stores the size of the inode region (in blocks) |
logAreaStart | 4 | uint32_t | Stores the block number at which the log area starts |
logAreaSize | 4 | uint32_t | Stores the size of the log area (in blocks) |
systemStateSeqNum | 8 | uint64_t | Stores the sequence number of the system to verify against for consistency on system reboot |
latestCheckpointIndex | 2 | uint16_t | Stores the index of the latest checkpoint of the system |
checkpointArr | 512 | uint32_t[] | Stores the locations of all checkpoints on the system |
readOnly | 1 | bool | Determines whether or not the filesystem is readOnly (mainly used for snapshots in memory, should always be false on disk) |
The inode structure is relatively simple with the minimal fields required for a basic filesystem:
Field Name | Size (bytes) | Type | Usage |
---|---|---|---|
size | 8 | uint64_t | Size of the file/directory (in bytes) |
blockCount | 4 | uint32_t | The number of blocks allocated to this file/directory |
uid | 2 | uint16_t | UID for user owning the file/directory |
gid | 2 | uint16_t | GID for group owning the file/directory |
permissions | 2 | uint16_t | Permission bits (POSIX-style) for the file/directory |
numFiles | 2 | uint16_t | Number of files in the directory (not used for file inodes) |
directBlocks | 60 | uint32_t[] | Locations of direct blocks |
indirectBlocks | 40 | uint32_t[] | Location of indirect blocks |
doubleIndirectBlocks | 8 | uint32_t[] | Location of double indirect blocks |
The inodes are specifically structured to require 128 bytes per inode, allowing for 32 inodes per block. The permission bits of the inode are used for rwx permission for user/group/other and also to determine whether or not a file is a directory. The remaining 6 bits are unused at the moment, but could be utilized for additional flags (like the immutable flag in linux).
Design Considerations
The inode bitmap and data block bitmaps referenced by the superblock are on-disk structures to keep track of which blocks are currently in use. The filesystem utilizes a write-through cache for recent bitmap blocks to minimize excess reads (it still always writes to disk when allocating a block/inode for the sake of consistency).
The inode table provides a level of indirection for inode access. All inodes are referenced via their inode number which uniquely identifies a file or directory on the system. This number is then mapped to an inode location via the inode table. When inodes or blocks are edited, the copy of write logic of the filesystem updates (and logs) the inode table to change the inode location while keeping the inode number the same. This minimizes the amount of chained copy on writes needed and allows us to mount a snapshot by simply loading an older version of the inode table (since all of the old data blocks and inodes should still be on disk at their original locations). This logic applies to all files, even the root directory (which is provided the inode number of zero but will be allocated a new location for every change made to it). The inode table also contributes to consistency guarantees because until the inode table is updated, changes made to blocks or inode do not affect the running filesystem. By enforcing the constraint that the inode table is updated last in every file operation, we are guaranteed a level of consistency (assuming the block write operation is considered atomic).
The fields referring to locations or sizes of on-disk regions (e.g., dataBlockRegionStart) are initialized at filesystem creation and expected to remain constant. The filesystem allocates inode and data block regions assuming a ratio of 1 inode per 16KiB of data (which translates to 1 inode for 4 blocks with 4KiB blocks). Using this ratio, the filesystem calculates the size of bitmap and inode table required and stores them in the superblock.
Note that OStrichFS currently uses 32-bit unsigned integers to refer to block numbers which limits the max filesystem size to 2^32 blocks (~16 TiB). While the actual size of the filesystem would be slightly smaller due to the overhead of bitmaps, inodes, and other metadata structures, this limit is significantly higher than any situation which OStrichFS was designed for. Switching the 64 bit unsigned integers would not be a significant design challenge, but it would greatly increase the overhead for metadata structures and thus be a detriment on smaller disks.
Supported POSIX-style operations
Operations that aren't listed are not supported at this moment:
Syscall | Status | Notes |
---|---|---|
open(path, flags, mode) | Partially | Maps to fs_req_open(); returns an inode index rather than a file descriptor; only basic read/write flags; mode used only at creation. |
read(fd, buf, count) | Supported | Via File::read_at(offset,…); OS must manage file offsets. |
write(fd, buf, count) | Supported | Via File::write_at(offset,…); OS must manage and update offsets. |
stat(path, statbuf) | Partially | Inode provides size, uid/gid, permissions; OS must read inode fields directly. |
fstat(fd, statbuf) | Partially | OS must translate fd→inode_index then read inode. |
mkdir(path, mode) | Supported | fs_req_create_file(..., is_dir=true); mode stored but no chmod. |
rmdir(path) | Supported | fs_req_remove_file(); only empties directories. |
unlink(path) | Supported | fs_req_remove_file(); data blocks not reclaimed until GC. |
truncate(path, length) | Partially | write_at can extend or overwrite data; no dedicated ftruncate() syscall. |
Note: OstrichFS relies on the OS for descriptor management, path parsing, and advanced flag handling.
Similarities with other file systems
- Copy-on-write metadata: Like ZFS and Btrfs, OstrichFS never overwrites in-place but allocates new blocks for metadata (indirect blocks, inodes), enabling safe snapshots.
- Transaction logging: Similar to EXT4’s journal, OstrichFS records critical inode mapping changes in a write-ahead log to guarantee atomicity and fast crash recovery.
- Checkpoint-based snapshots: As in NILFS2, OstrichFS can capture point-in-time filesystem states by checkpointing its inode-table mapping, providing read-only mounts of past states.
- Inode mapping table (NAT): Echoing F2FS’s Node Address Table, OstrichFS centralizes inode number → physical location indirection, simplifying copy-on-write updates without rewriting parent structures.
- Multi-level block pointers: Like ext2/EXT4, OstrichFS inodes support direct, single-indirect, and double-indirect pointers to accommodate large files.
- Bitmap allocation: Free-space management via bitmaps for inodes and data blocks follows the model used by many Unix-style file systems (ext2/EXT4, FAT variants), offering simplicity and speed.
Development Status
OstrichFS is in alpha. It is functional but not battle-tested and lacks many polishing features. You will encounter bugs and missing checks. If you’re an OSDev enthusiast or student looking to learn and contribute, your patches and feedback are welcome.
Roadmap & Future Work
- Garbage collection: reclaim orphaned blocks and stale inodes.
- fsdebug tool: create and verify disk images, inspect logs and checkpoints.
- Configurable parameters: make log area size, checkpoint frequency, and bitmap layouts tunable.
- Wear-leveling hooks: integrate with flash controllers for better longevity.
See Also
External Links
Last updated: 2025-04-28