Ext4

From OSDev Wiki
Jump to navigation Jump to search

This article is a stub! This page or section is a stub. You can help the wiki by accurately contributing to it.

Filesystems
Virtual Filesystems

VFS

Disk Filesystems
CD/DVD Filesystems
Network Filesystems
Flash Filesystems

While the ext4 filesystem originated a series of patches to the ext3 filesystem, it was later rebranded as a dedicated filesystem design that shares design with ext2 and ext3. Like ext3, it supports journaling. Amongst the upgrades are larger maximums (file size, filesystem size, files per folder, folders per folder etc), and features that were inspired from existing filesystems like XFS.

This information is based off the Ext4 implementation as of Linux 5.9rc3. Ext4 is subject to change, so one may wish to also check the latest kernel headers for new information.

Basic Concepts

This article is a stub! This page or section is a stub. You can help the wiki by accurately contributing to it.

Superblock

See Ext2 wiki page for an easier introduction to the concept of a superblock. All values are little endian unless otherwise specified.

Starting

Byte

Ending

Byte

Size Description
0 3 4 Total number of inodes in file system
4 7 4 Total number of blocks in file system
8 11 4 Number of reserved blocks
12 15 4 Total number of unallocated blocks
16 19 4 Total number of unallocated inodes
20 23 4 Block number of the block containing the superblock. This is 1 on 1024 byte block size filesystems, and 0 for all others.
24 27 4 log2 (block size) - 10. (In other words, the number to shift 1,024 to the left by to obtain the block size)
28 31 4 log2 (fragment size) - 10. (In other words, the number to shift 1,024 to the left by to obtain the fragment size)
32 35 4 Number of blocks in each block group
36 39 4 Number of fragments in each block group
40 43 4 Number of inodes in each block group
44 47 4 Last mount time (in POSIX time)
48 51 4 Last written time (in POSIX time)
52 53 2 Number of times the volume has been mounted since its last consistency check (fsck)
54 55 2 Number of mounts allowed before a consistency check (fsck) must be done
56 57 2 Magic signature (0xef53), used to help confirm the presence of Ext4 on a volume
58 59 2 File system state.
60 61 2 What to do when an error is detected
62 63 2 Minor portion of version (combine with Major portion below to construct full version field)
64 67 4 POSIX time of last consistency check (fsck)
68 71 4 Interval (in POSIX time) between forced consistency checks (fsck)
72 75 4 Operating system ID from which the filesystem on this volume was created (see below)
76 79 4 Major portion of version (combine with Minor portion above to construct full version field)
80 81 2 User ID that can use reserved blocks
82 83 2 Group ID that can use reserved blocks

These fields are for ext4 dynamic superblocks only. If a bit is set in the required feature set it does not recognize, it must refuse to mount the filesystem. Filesystem checks, however, must abort on any unrecognized flag in the optional or required features.

Starting

Byte

Ending

Byte

Size

in Bytes

Description
84 87 4 First non-reserved inode in file system.
88 89 2 Size of each inode structure in bytes.
90 91 2 Block group that this superblock is part of for backup copies.
92 95 4 Optional features present.
96 99 4 Required features present.
100 103 4 Features that if not supported the volume must be mounted read-only.
104 119 16 File system UUID.
120 135 16 Volume name.
136 199 64 Path Volume was last mounted to.
200 203 4 Compression algorithm used.
204 204 1 Amount of blocks to preallocate for files
205 205 1 Amount of blocks to preallocate for directories.
206 207 2 Amount of reserved GDT entries for filesystem expansion.
208 223 16 Journal UUID.
224 227 4 Journal Inode.
228 231 4 Journal Device number.
232 235 4 Head of orphan inode list.
236 251 16 HTREE hash seed in an array of 32 bit integers.
252 252 1 Hash algorithm to use for directories.
253 253 1 Journal blocks field contains a copy of the inode's block array and size.
254 255 2 Size of group descriptors in bytes, for 64 bit mode.
256 259 4 Mount options.
260 263 4 First metablock block group, if enabled.
264 267 4 Filesystem Creation Time.
268 335 68 Journal Inode Backup in an array of 32 bit integers.

Valid if the 64bit feature is set.

Starting

Byte

Ending

Byte

Size

in Bytes

Description
336 339 4 High 32-bits of the total number of blocks.
340 343 4 High 32-bits of the total number of reserved blocks.
344 347 4 High 32-bits of the total number of unallocated blocks.
348 349 2 Minimum inode size.
350 351 2 Minimum inode reservation size.
352 355 4 Misc flags, such as sign of directory hash or development status.
356 357 2 Amount logical blocks read or written per disk in a RAID array.
358 359 2 Amount of seconds to wait in Multi-mount prevention checking.
360 367 8 Block to multi-mount prevent.
368 371 4 Amount of blocks to read or write before returning to the current disk in a RAID array. Amount of disks * stride.
372 372 1 log2 (groups per flex) - 10. (In other words, the number to shift 1,024 to the left by to obtain the groups per flex block group)
373 373 1 Metadata checksum algorithm used. Linux only supports crc32.
374 374 1 Encryption version level.
375 375 1 Reserved padding.
376 383 8 Amount of kilobytes written over the filesystem's lifetime.
384 387 4 Inode number of the active snapshot.
388 391 4 Sequential ID of active snapshot.
392 399 8 Number of blocks reserved for active snapshot.
400 403 4 Inode number of the head of the disk snapshot list.
404 407 4 Amount of errors detected.
408 411 4 First time an error occurred in POSIX time.
412 415 4 Inode number in the first error.
416 423 8 Block number in the first error.
424 455 32 Function where the first error occurred.
456 459 4 Line number where the first error occurred.
460 463 4 Most recent time an error occurred in POSIX time.
464 467 4 Inode number in the last error.
468 475 8 Block number in the last error.
476 507 32 Function where the most recent error occurred.
508 511 4 Line number where the most recent error occurred.
512 575 64 Mount options. (C-style string: characters terminated by a 0 byte)
576 579 4 Inode number for user quota file.
580 583 4 Inode number for group quota file.
584 587 4 Overhead blocks/clusters in filesystem. Zero means the kernel calculates it at runtime.
588 595 8 Block groups with backup Superblocks, if the sparse superblock flag is set.
596 599 4 Encryption algorithms used, as a array of unsigned char.
600 615 16 Salt for the `string2key` algorithm.
616 619 4 Inode number of the lost+found directory.
620 623 4 Inode number of the project quota tracker.
624 627 4 Checksum of the UUID, used for the checksum seed. (crc32c(~0, UUID))
628 628 1 High 8-bits of the last written time field.
629 629 1 High 8-bits of the last mount time field.
630 630 1 High 8-bits of the Filesystem creation time field.
631 631 1 High 8-bits of the last consistency check time field.
632 632 1 High 8-bits of the first time an error occurred time field.
633 633 1 High 8-bits of the latest time an error occurred time field.
634 634 1 Error code of the first error.
635 635 1 Error code of the latest error.
636 637 2 Filename charset encoding.
638 639 2 Filename charset encoding flags.
640 1019 380 Padding.
1020 1023 4 Checksum of the superblock.

Required Features:

Flag

Value

Description
0x0001 Compression is used.
0x00002 Directory entries contain a type field.
0x00004 Filesystem needs to replay the Journal to recover data.
0x00008 Filesystem uses a journal device.
0x00010 Filesystem uses Meta Block Groups.
0x00040 Filesystem uses extents for files.
0x00080 Filesystem uses 64 bit features.
0x00100 Filesystem uses Multiple Mount Protection.
0x00200 Filesystem uses Flex Block Groups.
0x00400 Filesystem uses Extended Attributes in Inodes.
0x01000 Filesystem uses Data in Directory Entries. This is not implemented as of Linux 5.9rc3.
0x02000 Filesystem stores the metadata checksum seed in the superblock. This allows for changing the UUID wihtout rewriting all of the metadata blocks.
0x04000 Directories may be larger than 4GiB and have a maximum HTREE depth of 3.
0x08000 Data may be stored in the inode. See Inline Data for an discussion of this feature.
0x10000 Filesystem uses Encryption.
0x20000 Filesystem uses case folding, storing the filesystem-wide encoding in inodes.

Optional Features:

Flag

Value

Description
0x0001 Preallocate some number of blocks (see byte 205 in the superblock) to a directory when creating a new one.
0x0002 Possibly unused, "imagic inodes"
0x0004 Filesystem uses a Journal
0x0008 Inodes have Extended Attributes.
0x0010 Filesystem can resize itself for larger partitions.
0x0020 Directories use hash index.
0x0200 Backup the superblock in other block groups.
0x0800 Inode numbers do not change during resize.

Block Group Descriptor

See Ext2 wiki page for an introduction to block group descriptor tables.

In Ext4, the block descriptors, in addition to their role in Ext2 as information of important data structures, have new features such as flex block groups and meta block groups.

In a flex block group, multiple block groups are grouped together into a flex block group, as the group descriptor records the location of both bitmaps and the inode table. This allows for better data locality.

In a meta block group, the filesystem is partitioned into multiple 'metablock' groups. This allows the metadata of the block group descriptors to be stored in one block. This strategy also increases the maximum filesystem size to 512PiB from 256TiB without metablock groups.

Locating the Block Group Descriptors

Locating the Block Group Descriptors is similar to Ext2, except for metablock and flex block group locations.


One can check for flex block groups by checking the required option 'Flex Block Group'. The structure for that is found in the superblock data structure.

Flex Block Group info Structure, these fields are atomic integers:

Starting

Byte

Ending

Byte

Size

in Bytes

Description
0 7 8 Atomic 64 bit free clusters.
8 11 4 Atomic free inodes.
12 15 4 Atomic used directories.

Block Group Descriptor

Block group descriptor Structure:

Starting

Byte

Ending

Byte

Size

in Bytes

Description
0 3 4 Low 32bits of block address of block usage bitmap.
4 7 4 Low 32bits of block address of inode usage bitmap.
8 11 4 Low 32bits of starting block address of inode table.
12 13 2 Low 16bits of number of unallocated blocks in group.
14 15 2 Low 16bits of number of unallocated inodes in group.
16 17 2 Low 16bits of number of directories in group.
18 19 2 Block group features present.
20 23 4 Low 32-bits of block address of snapshot exclude bitmap.
24 25 2 Low 16-bits of Checksum of the block usage bitmap.
26 27 2 Low 16-bits of Checksum of the inode usage bitmap.
28 29 2 Low 16-bits of amount of free inodes. This allows us to optimize inode searching.
30 31 2 Checksum of the block group, CRC16(UUID+group+desc).

These fields are valid if the 64bit feature is set and the superblock's group desciptor size is greater than 32.

Starting

Byte

Ending

Byte

Size

in Bytes

Description
32 35 4 High 32-bits of block address of block usage bitmap.
36 39 4 High 32-bits of block address of inode usage bitmap.
40 43 4 High 32-bits of starting block address of inode table.
44 45 2 High 16-bits of number of unallocated blocks in group.
46 47 2 High 16-bits of number of unallocated inodes in group.
48 49 2 High 16-bits of number of directories in group.
50 51 2 High 16-bits of amount of free inodes.
52 55 4 High 32-bits of block address of snapshot exclude bitmap.
56 57 2 High 16-bits of checksum of the block usage bitmap.
58 59 2 High 16-bits of checksum of the inode usage bitmap.
60 63 4 Reserved as of Linux 5.9rc3.

block group flags:

Flag

Value

Description
0x0001 Block group's inode bitmap/table is unused.
0x0002 Block groups's block bitmap is unused.
0x0004 Block groups's inode table is zeroed.

Multiple Mount Protection

Multiple Mount Protection (MMP) protects filesystems being mounted multiple times leading to dangerous data races. This feature writes a sequence number into the block referenced in the MMP block superblock field.

To check for MMP, check the required feature flag Multiple Mount Protection, and the magic field in the block referenced by the MMP block superblock field.

In MMP, the driver checks the sequence number in the MMP block . If the sequence number is fs check running or any unknown code above maximum MMP value, the drive is not safe to mount, even if the timestamp is outdated.

While running, the driver checks the MMP block sequence number at the interval specified in the MMP check superblock field. If this does not match the in-memory sequence number, a different host has mounted the filesystem, requiring the driver to remount as readonly. If it does match, the driver increments the number in memory and on disk. The driver also writes the hostname and mount path in the MMP block on open() success.

The minimum interval to MMP check is 5 seconds and the maximum interval is 300.

MMP structure

Starting

Byte

Ending

Byte

Size

in Bytes

Description
0 3 4 MMP signature (0x004d4d50)
4 7 4 Sequence Number (see below)
8 15 8 Last updated time (does not affect algorithm)
16 79 64 Hostname of system that open() the filesystem (does not affect algorithm)
80 111 32 Mount path of system that open() the filesystem (does not affect algorithm)
112 113 2 Interval to check MMP block
114 115 2 Padding.
116 1019 904 Padding.
1020 1023 4 Checksum (crc32c(UUID+MMP Block number))

Journaling

See Journaling for a high level description of a filesystem journal.

Ext4 uses the Jbd2 Journaling layer.

Ext4 defines the journal inode as inode 8. The superblock contains the first 68 bytes of the journal. The journal is a hidden file in the filesystem, usually using an entire block group, but it is preferred to be in the middle of the volume.

The optional feature filesystem journal protects against filesystem corruption if the system crashes. The filesystem journal writes important data to a small contigous sliver of disk. Once this is flushed to disk, the driver writes a record of the data to be written to the journal. Later, the driver can write the transactions to disk. If the system crashes during the second write, it can simply replay the journal to the last sync. If the write succeedes, the write's record is removed from the journal.

The default journaling strategy is 'ordered', writing only filesystem metadata through journaling. If stronger guarantees are preferred, the filesystem can be use the 'journal' strategy, writing both metadata and data through the journal, slowing operation. The filesystem can also use the 'writeback' strategy, where data is not flushed to the disk before a metadata update.

Ext4 may also use a seperate journal device, specified in the superblock's journal UUID. The separate journal device will have 1024 bytes of padding, then an ext4 superblock with a matching UUID. The journal follows on the next complete block.

Journal Superblock Fields

Ext4 uses Jbd2 as the journal.

All fields are big endian unless otherwise specified.

Starting

Byte

Ending

Byte

Size Description
0 11 12 Journal header (see below)
12 15 4 Block size of the journal device.
16 19 4 Total number of blocks in the journal device.
20 23 4 First block of journal information.
24 27 4 First journal transaction expected.
28 31 4 First block of the journal.
32 35 4 Errno, if the journal has an error.
36 39 4 Required features present.
40 43 4 Optional features present.
44 47 4 Features that if not supported the journal must be mounted read-only. There are no read-only features as of Linux 5.9rc3.
48 63 16 Journal UUID.
64 67 4 Number of filesystems using this journal.
68 71 4 Block number of the journal superblock copy.
72 75 4 Maximum journal blocks per transaction. This is unused as of Linux 5.9rc3.
76 79 4 Maximum data blocks per transaction. This is unused as of Linux 5.9rc3.
80 80 1 Checksum algorithm.
81 83 3 Padding.
84 251 168 Padding.
252 255 4 Checksum of the journal superblock.
256 1023 768 UUID of filesystem

Journal Header

All fields are big endian unless otherwise specified.

Starting

Byte

Ending

Byte

Size Description
0 3 4 Magic signature (0xc03b3998)
4 7 4 Block Type (see below)
8 11 4 Journal transaction for this block.


See Also

External Links