Files and File Systems¶

File Systems Overview¶

Files are persistent, named data objects.
Since directories can be nested, a file system can be viewed as a tree, with a single root directory.
- Files are always leaves and
- Directories can be interior nodes (if they are non-empty) or leaves (if they are empty).
After saving, the file should remain until it is deleted. This aspect is called persistent storage.
Secondary storage is the first state info we’ve seen that does not disappear at shutdown. It is persistent storage. Therefore, it is where all important state ultimately resides.
- It is called secondary because the processor cannot directly access the data stored in secondary storage through assembly languages.
Data consists of a sequence of numbered bytes, i.e. a process can request the \(i^{th}\) byte.
Files may change size and content over time.
Upside: it has larger capacity than main memory; Downside: it is slower to access
File systems: the data structures and algorithms used to store, retrieve, and access files.
lseek does not modify the file and does not check if the new position is valid. However, the subsequent read or write operation will.
A directory maps file names (strings) to i-numbers. An i-number (index number) is a unique (within a file system) identifier for a file or a directory.
A directory is a special type of file that:
- Stores a list of mappings from file names (strings) to i-numbers (also called inodes or index numbers).
- Has its own i-number, just like any regular file.
Files may be identified by pathnames, which describe a path through the directory tree from the root directory to the file.
Q: Only the kernel is permitted to directly edit directories. Why?
- In general, the kernel should not trust the user with direct access to data structures that the kernel relies on.
A hard link is an association between a file name (string) and an i-number.
- Each entry in a directory is a hard link.
Each file has a unique i-number, but may have multiple pathnames.
In order to avoid cycles, it is not possible to (hard) link to a directory.
The file system ensures that hard links have referential integrity, which means that if the link exists, then the file that it refers to also exists.
A file is deleted when its last hard link is removed (and the count of the number of hard links become zero).
A symbolic link, or soft link, is an association between a file name (string) and a pathname.
Referential integrity is not preserved for symbolic links.
An inode number uniquely identifies a file only within the file system it belongs to. It cannot be used across different file systems.
Hard Link
- Definition: A direct association between a file name and an inode (i-number).
  - Note: An inode (short for index node) is a data structure used by many file systems (like ext4 in Linux) to store information about a file.
- Points to: The actual file data (inode).
- Can have multiple: Yes — multiple names (hard links) can point to the same file.
- Same inode number: Yes — all hard links to a file share the same inode.
- File stays even if original name is deleted: Yes — the file remains as long as at least one hard link exists.
- Can only link to files: Yes — cannot hard link to directories (to avoid cycles).
- Same file system only: Yes — cannot create a hard link across different file systems.
- Command: ln existing.txt new_hardlink.txt
Soft Link (Symbolic Link)
- Definition: An association between a file name (string) and a pathname
- Points to: The name (path) of the target file.
- Can break: Yes — if the target file is deleted or moved, the soft link becomes a dangling link.
- Different inode: Yes — the symlink has its own inode and just stores the path.
- File removed if target is deleted: The symlink remains, but it won't work.
- Can link to directories: Yes — can link to files or directories.
- Can cross file systems: Yes — works across file systems.
- Command: ln -s target.txt symlink.txt
Differences between them: A hard link directly points to the file’s underlying inode, meaning it refers to the actual file data and metadata. All hard links to the same file share the same inode, so the file remains accessible as long as one hard link exists. In contrast, a soft link (symbolic link) points to a file’s pathname as a string. It acts as a shortcut, and if the target file is deleted or moved, the soft link becomes broken because it doesn’t point to the underlying inode (you can still have that soft link, but accessing it will raise error).
Mounting does not combine the two file systems into one file system.
- It creates a single, hierarchical namespace that combines the namespaces of two file systems.
- The new namespace is temporary - it exists only until the file system is unmounted.
In Windows, when you plug in a USB flash drive it is auto-mounted and assigned a letter. In Linux, you may auto-mount or specify where it is mounted.

File System Implementation¶

RAM is usually byte addressable, i.e. can read or write to any byte. Drives are usually sector addressable, i.e. can read or write to any sector.
An inode is a fixed size index structure (i.e. for a given file system, they are all the same size) that holds the file meta-data and pointers to its data blocks.
In VSFS, an inode is 256 bytes
- Reserve room for 12 pointers to blocks.
- These are direct pointers.
- Each pointer points to a different block for storing user data.
- Pointers are ordered: the 1st pointer points to the 1st block in the file, the second points to the 2nd block, etc.
- It is okay for small files which can store file with size \(12 × 4KB = 48KB\)
- For big file, in addition to the 12 direct pointers, we can also introduce an indirect pointer.
An indirect pointer points to a block full of direct pointers.
A 4 KB block of indirect pointers holds 1,024 (4-byte) pointers (not data!). \(4KB/4B = 2^{12} / 2^{2} = 2^{10} = 1024\) pointers
- Maximum file size: \((12+1024)×4 KB=1036×4 KB=4144 KB\)
We can even add a double indirect pointer.
- A double indirect pointers points to a 4 KB block of indirect pointers. Each of those points to a block that holds 1024 direct pointers. 1024×1024=1,048,576 data blocks
- Maximum file size using 12 direct, 1 indirect and 1 double indirect pointers is: \((12 + 1024 + 1024^2) \times 4 \text{ KB}= (12 + 1024 + 1{,}048{,}576) \times 4 \text{ KB}\approx 1{,}049{,}612 \times 4 \text{ KB}\approx 4 \text{ GB}\)
- If not enough, use a triple indirect pointer
  - A triple indirect pointer points to a block of double indirect pointers.
  - That supports: \(1024^3\) data blocks.
Directories are implemented as a file that contains directory entries, pairing up an i-number and a file name (the last component of a pathname). Directories files are only updated by the kernel, in response to file system operations, e.g, create file, create link.
Each process the kernel maintains a descriptor table, which tracks which file descriptors does this process have open? To which file does each open descriptor refer? What is the current file position for each descriptor?
System Wide the kernel maintains an
- Open File Table: files that are currently open by any process
- Inode Cache: in-memory copies of recently-used inodes
- Block Cache: in-memory copies of recently-used data blocks and indirect blocks
When writing a partial block, that block must be read first. When writing (or overwriting) an entire block, no read is required. Generally, only data blocks are written in entire blocks.
Glob refers to expanding filenames that have cildcards, such as *, in them
File names that do not begin with “/” are interpreted as being relative to the cwd (current working directory); otherwise pathname translation happens as before
Shells, such as bash, track a default list of active contexts. This list is referred to as a search path for programs you can run. It will check the file in each directory in order and use the first match.
- Users can escape the search path by using explicit paths, i.e. using either a full (/bin/foo) or relative path (./foo)

Crash Recovery¶

After a crash, fields in inodes may be wrong. So for each i-node, count the number of directory entires to it in the entire file system to verify link count, if no entires but count ≠ 0, move to lost+found
When designing a file system, make sure fsck can recover it.
The OS must always update the file system components in a consistent order
- Write a new inode to disk before creating the directory entry
- Remove directory name before deallocating inode
- Write cleared inode to disk before updating free map
When the kernel has already queued a block (e.g., a directory block) to be written to the disk, that block is in flight—it is sitting in the device’s request queue waiting for the disk head. While it is in that queue:
- the in-memory buffer representing that block is marked “busy / under-write”
- the kernel must not modify it again until the first write has finished, or it would be rewriting data that the disk driver is about to send out in its old form.
Crash recovery needs updates to happen one at a time, in the right order.
- If you write multiple things without waiting (asynchronously), a crash might leave things only partially updated, which causes file system errors.
- So recovery only works safely if the writes are synchronous.
File system crash recoverability ensures that the file system can be safely restored after a crash or power loss. To achieve this, the system must update data in a specific, careful order and wait for each write to finish before continuing. This guarantees consistency, but it makes operations like extracting a tar file (untar) much slower—sometimes 10 to 20 times slower—because there are many small, synchronous metadata writes. The performance cost is high, but it’s necessary because a crash can happen at any time.
Soft updates rules:
1. Never write a pointer before initializing the structure it points to
  1. When creating a file, always write the inode (file data structure) before writing the directory entry (which points to it).
  2. If you write the pointer first and a crash happens, the system could end up pointing to an uninitialized or incorrect file — leading to serious file system corruption.
2. Never reuse a resource before nullifying all pointers to it
3. Never clear the last pointer to live resource before setting the new one
Y → X, pronounced “Y depends on X”, which means write X to disk before Y. The arrow is pointing to the one that must be written first. X is called dependee, Y is the depender.
The system can write the blocks in any order.
- But must temporarily rollback (i.e. undo) updates with pending dependencies.
- You cannot write a block with any arrows pointing to dependees, but you can temporarily undo whatever change requires the row
- Must lock rolled-back version so applications don’t see it, i.e. applications are using the new version but it may not be recorded on disk yet. The system may then choose a convenient order to write the blocks.
  - Note: when the system writes to disk, it writes the entire block, not just one field or entry inside it.
  - You cannot write a block with any arrows pointing to dependees (Don’t write a block to disk if it references (points to) something that hasn’t been safely written to disk yet.)… but you can temporarily undo whatever change requires the arrow
Some dependencies better handled by postponing rather than rolling-back updates
Operations requiring soft updates:
- Link addition: write directory entry → write bitmap, free inode map (if new inode)
- Block allocation: free block bitmap → write cleared block pointer to disk
  - You must make sure the pointer is actually null (or 0) on the disk, not just in memory.
- Block allocation: block pointer → update the free block bitmap, write data to data block
- Link removal: bitmap, inode → directory entry
fsync is the linux system call to flush (i.e. write) file changes to disk.
- Must also flush directory entries, parent directories, etc
unmount flush all changes to disk on shutdown.
- Some buffers must be flushed multiple times to become clean because of dependencies. (remember the rollback stuff?)
- Deleting large directory trees frighteningly fast because you nullify directory first
  - The unlink system call returns even if the inode/indirect block is not cached
The biggest crash-recovery challenge is inconsistency. Most of these problematic writes are to metadata. An error in writing file data will ruin one file, but an error in metadata can ruin many files.
- Use write-ahead log to journal the metadata.
  - Reserve a portion of dick to log
  - Write any metadata operation first to the log, then make the change on the disk.
  - After crash/reboot, re-play the log (efficient).
  - May re-do already committed change, but won’t miss anything.
- Group multiple operations into one log entry, such that clear directory entry, clear inode, update free map should either all happen after recovery, or none will
- Advantage:
  1. The log is a consecutive portion of disk
  2. Multiple operations can be logged at disk bandwidth (Disk bandwidth refers to the maximum rate at which data can be transferred between the disk and memory (RAM).) which means this allows writing as fast as the disk’s bandwidth allows
  3. It is safe to consider updates committed when written to the log
But we must find the oldest relevant log entry, otherwise, redundant and slow to replay the whole log
- When the system reboots after a crash, it replays the log to recover.
- But without knowing where to start, it might replay the whole log, which is slow and redundant.
The solution is to use checkpoints. A checkpoint is saying “All updates up to log entry N have been fully written to disk.” Once everything up to entry N is safe: The system records N as a checkpoint. During recovery: The system can skip everything before N. It only replays log entries after the last checkpoint.
Most logs are written as a circular buffer (like a ring), so:
- You must know where the current log ends.
- Don’t play old records out of order.
- You must never overwrite unprocessed log entries (still needed for recovery).
- Can include begin transaction/ end transaction records.
A checksum is added to log entries to detect corruption (e.g. if a sector is damaged). This ensures only valid log entries are replayed.