Virtual Memory¶

Overview¶

Physical addresses are provided by directly by the hardware, i.e., it is the amount of installed RAM
Virtual addresses (or logical addresses) are addresses provided by the kernel to the processes. There is one virtual address space per process. Processes use virtual addresses.
- As a process runs, the hardware converts each virtual address to a physical address. The conversion of a virtual address to a physical address is called address translation.
With multitasking, processes share the processor over time and share physical memory in space, i.e. different processes share the same physical memory.
B always means byte, and b always means bit, and 8b = 1B. For CS350, K, M, G and T will always be powers of 2. 1K = \(2^{10}\), 1M = \(2^{20}\), 1G = \(2^{10}\), 1T = \(2^{40}\)
The hardware which translates from virtual to physical addresses is called the Memory Management Unit (MMU). Typically there is one MMU per core, managed by the kernel and it is accessed with privileged (kernel mode) instructions
Address translation methods:
1. load-time linking: add the base physical address to the virtual address
2. Base+Bound registers: use two registers base and bound. Add virtual address to base and cannot go beyond base + bound
  - Note: Virtual addresses that cannot be translated produce exceptions.
  - Disadvantages:
    - In Base + Bound, the process must occupy one contiguous block of physical memory. So growing a process is expensive or impossible if the process needs more memory.
    - No way to share code or data (E.g., many programs share the stdio library.)
  - Solution: use multiple segments
3. Segmentation: Let processes have base and bound registers for (1) text (2) data and (3) stack. Could further break data into (2a) readonly data (2b) global variables and heap. With this, we must specify the segment as part of virtual address.
  - Each process has its own segment table. Each virtual address is divided into a segment number and an offset.
  - Allows multiple segments per process, and allows sharing, and no need to store the entire process into memory. However, it requires address translation hardware, which could limit performance. Make fragmentation a problem.
    - Fragmentation: Inability to use free memory because it is in too many small pieces.
      - Internal: Occurs when memory is allocated in fixed-sized slices.
      - External: Occurs when memory is allocated in variable-sized slices.
4. Paging: Divide memory up into small fixed-sized slices (e.g. 4KB) called pages. Pages in physical memory are called frames or physical pages and pages in virtual memory are are called virtual pages. for 4KB pages, the LSB 12 bits are called the page offset and the MSB bits are called page number. For each process, the kernel maps each of its virtual page numbers (VPN) to a physical page number (PPN) using a kernel data structure called a page table.
  - Each process has its own page table.
  - The page table maps pages (from virtual memory) to frames (in physical memory).
  - Each row of a page table is called a page table entry (PTE).
  - Not all available pages are used.
  - The valid bit is used to indicate if the PTE is used or not.
  - Number of PTEs = Maximum Virtual Memory Size / Page Size
  - MMU (Memory Management Unit) doesn’t store the whole page table internally (because it can be large), but instead uses a page table base register to find where the page table is stored in memory, and uses it to translate addresses during execution.
  - Single-level page tables can be very, very large.
5. Two-level paging: A large, contiguous table is replaced with multiple smaller tables, each fitting onto a single page. If a table contains no valid PTEs, do not create that table.
6. Multi-level paging: The MMU’s page table base register points to the level 1 table for the current process. Each virtual address v has n+1 parts: (p1, p2, ..., pn, offset)
  - A goal of multi-level paging is to reduce the size of individual page tables.
  - When the number of entires required in the directory is so large that they no longer fit on a page (in this case 4KB) ⇒ add more levels to map larger virtual memories efficiently

The MMU and TLB¶

TLB stands for Translation Lookaside Buffer, which is a cache for PTEs in the MMU
- The TLB is a small, fast, dedicated cache of address translations in the MMU.
- Each TLB entry stores a single (virtual page # → physical frame #) mapping.
- I.e. store the pair (p,f) which means translate from p to f.
There are two types of TLBs: hardware-managed TLBs and Software-Managed TLBs
- Algorithm for a Hardware-Managed TLB:
```
if (p has an entry (p,f) in the TLB) then
    return f // TLB hit!
else
    find p’s frame number (f) from the page table
    add (p,f) to the TLB, evicting another entry if the TLB is full
    return f // TLB miss
```
  - On a context switch (when the CPU switches from one process to another), the kernel must flush (clear) the TLB so the new process doesn’t use leftover entries from the old one.
    - MIPS has a software-managed TLB, which translates a virtual address on page p like this:
```
if (p has an entry (p,f) in the TLB) then
    return f // TLB hit!
else
    raise exception // TLB miss
```
  - If TLB misses, the kernel must
    - determine the frame number for p
    - add (p,f) to the TLB, evicting another entry if necessary.
    - After the miss is handled, the instruction that caused the exception is re-tried.
  - Software-managed TLBs use simpler hardware and rely on the OS to handle TLB misses, giving flexibility but requiring fast software support. In contrast, hardware-managed TLBs shift the work to the processor itself for faster automatic translation, at the cost of more complex hardware — making them ideal for modern architectures.
  - MIPS uses a software-managed TLB with 64 entries, meaning the kernel handles TLB misses manually. Each entry maps a virtual page number to a physical frame, along with process ID and flags. The kernel runs in a separate, unpaged high-memory region, and is free to implement page tables however it chooses. Efficient TLB miss handling is critical, as even the page table handler itself might need to access paged-out memory.
    - Unpaged means the kernel’s memory is always in RAM, never swapped out.
  - In MIPS and similar systems, paged and unpaged refer to how virtual memory addresses are translated to physical addresses. Paged memory requires translation through the page table and TLB (Translation Lookaside Buffer), allowing for flexible virtual-to-physical mappings and support for features like memory protection and swapping. This is commonly used for user programs and some kernel space (like useg and kseg2), but comes with some overhead due to the translation process. In contrast, unpaged memory uses a direct mapping, meaning the virtual address maps straight to a physical address without going through the TLB or page tables. This makes memory access very fast and predictable, which is why it’s used in kernel segments like kseg0 and kseg1. However, it is less flexible and always stays resident in physical memory. This trade-off between flexibility vs. speed is why systems often use a combination of paged and unpaged regions.

TLB Lookup Algorithm (Note: Here, set means it is 1. Not set means it is 0.)
1. If the most significant bit (MSB) is 1 (i.e. virtual address >= 0x8000 0000, so it is a kernel address) and the core is in user mode → address error exception.
2. If no VPN match → TLB miss exception. Translations not in TLB.
3. If the PID mismatches and the global bit not set → TLB miss → translation not in TLB.
4. If the valid bit is not set → TLB miss exception. The entry is stale.
5. Write to read-only page → TLB mod(ification) exception.
6. If the N bit is set → directly access device memory (disable cache).
  - This is important for memory-mapped I/O (MMIO), where:
    - You write to memory addresses that actually control hardware.
    - Caching would cause incorrect behavior (stale reads, skipped writes).
Hardware Managed MMU: The TLB is managed by hardware and microcode. TLB automatically reloaded from page table by the processor. If the entry is missing in the page tables, it results in a page fault.
The kernel doesn’t live in a totally separate address space because:
1. Most hardware doesn’t support switching spaces during syscalls.
2. It would make system calls inefficient and complicated, especially when accessing user memory.
Kernel addresses are in the same address space as the user process.
- Use protection bits to prohibit user code from reading and writing kernel addresses.
The kernel is typically mapped into the upper half of every virtual address space, always at the same virtual address, to ensure consistency and efficiency. This mapping is set up during boot and stays the same across processes.

Paging¶

Paging with swapping allows the OS to provide the illusion of a large memory space by temporarily storing unused pages on disk, and bringing them into RAM only when needed. This enables running more and larger programs than physical memory alone would allow.
When swapping is used, some pages of a process’s virtual memory will be in main memory while others will be on disk. The subset of pages currently in physical memory is called the resident set, and it changes as pages are swapped in and out. To manage this, each page table entry (PTE) includes a present bit that indicates whether the page is in RAM or swapped out. If valid=1 and present=1, the page is in main memory. If valid=1 but present=0, the page is valid but currently stored on disk. If valid=0, the page is entirely invalid and cannot be accessed. This mechanism lets the operating system track and manage which virtual pages are actually resident in RAM.
When a process accesses a virtual page that has its present bit set to 0, it means that the page is not in main memory. On a system with a hardware-managed TLB, the MMU (Memory Management Unit) checks the page table entry (PTE) and triggers an exception if the page is not present — this exception is then handled by the kernel. On a system with a software-managed TLB, it is the kernel that checks the PTE (usually after a TLB miss), and it ensures that no TLB entry maps to a page that isn’t currently in RAM. In both cases, the act of trying to use a page that is not resident in physical memory causes a page fault, prompting the OS to load the missing page from disk.
When a page fault occurs, the kernel is responsible for handling it. First, it loads the missing page from secondary storage (such as an SSD or HDD) into RAM. If RAM is full, the kernel may need to evict an existing page—moving it back to disk to make room. Next, the kernel updates the page table entry (PTE) for the newly loaded page, setting the present bit to indicate that the page is now in memory. Finally, the kernel resumes the application, allowing it to retry the memory access that initially caused the page fault.
Secondary storage much, much slower than memory: about 1000x for a typical SSD.
80/20 rule: 20% of memory gets 80% of memory accesses
- Keep the hot 20% in memory
- Keep the cold 80% in disk
When a page fault occurs, the hardware provides the kernel with key information such as the faulting virtual address, the instruction that caused the fault, the type of access (read, write, or fetch), and whether it was an illegal user-mode access to kernel memory. This helps the kernel determine how to respond. The system must also be able to resume execution after handling the fault. Instructions that are idempotent—such as load or store operations—are easy to handle in this context because they can be safely retried without changing the program’s behavior. The key idea is that any instruction accessing only one memory location can be re-executed without side effects, ensuring correctness even if interrupted by a page fault.
When a page fault occurs, the operating system must at minimum load the page that caused the fault. However, to improve performance, it can pre-fetch surrounding pages as well. Since reading multiple disk blocks is nearly as fast as reading one, this can be beneficial if the application exhibits spatial locality—meaning it accesses nearby memory locations. In such cases, prefetching can significantly reduce future page faults. Additionally, operating systems often pre-zero unused pages during idle CPU time. Zero-filled pages are needed for memory allocations such as stack, heap, or anonymous mappings, and preparing them in advance avoids the delay of zeroing on demand. This helps improve responsiveness. Finally, when memory becomes full, the OS may need to evict some pages—moving them from RAM to secondary storage—to make space for new pages. These optimizations help virtual memory systems perform more efficiently.
Operating systems can improve memory performance by using large page mappings, such as the 2 MB or 4 MB pages supported by x64 systems. These larger pages reduce the number of entries required in the Translation Lookaside Buffer (TLB), which helps avoid costly TLB misses that would otherwise require access to main memory. Since the number of pages in the L2 cache can exceed the number of TLB entries, reducing TLB pressure is important. One approach is to use a two-level TLB, where a small, fast L1 TLB handles most lookups and a larger L2 TLB provides backup. Additionally, the operating system can transparently support superpages by combining contiguous pages into larger ones when they are frequently accessed together. This can improve efficiency, especially for applications with predictable memory access patterns. However, managing superpages introduces challenges, such as the need to reserve contiguous physical pages and the added complexity of evicting large pages, particularly when some of their contents have been modified.
The main idea is to reduce the number of I/O operations that occur along the critical path of execution. When a page fault happens, the OS typically has to evict a page to make room, and that evicted page needs to be written to disk if it is dirty. This write operation is slow and blocks the process. Instead, the OS can maintain a pool of free physical pages. When a page fault occurs, the OS selects a victim page to evict as usual, but it does not immediately reuse that page frame. Instead, it brings in the new page from disk and places it into one of the already free frames from the pool. This allows the process to resume right away. Meanwhile, in the background, the OS writes the victim page to disk and eventually adds its frame back to the free pool. Additionally, the OS can sometimes recover pages from the free pool if they were recently evicted but are still clean (i.e., unmodified). If a process references such a page again, the OS can avoid reloading it from disk by simply restoring it from the free pool. This reuse further reduces disk I/O. Overall, this strategy helps keep execution fast by decoupling slow disk writes from the critical path of servicing a page fault.

Eviction Policies¶

Thrashing happens when a program is using more memory than what is available in physical RAM, causing it to spend most of its time swapping pages in and out of memory rather than doing useful work.
Global and local allocation are two strategies the operating system can use to decide which pages to evict when memory is full. In global allocation, all processes share a common pool of physical memory, and the OS evicts the least recently used page from any process, regardless of ownership. This approach can be efficient when processes use memory in proportion to their needs, but it can become unfair if one process behaves aggressively, such as constantly accessing a large array. This "memory hog" can evict pages from other well-behaved processes, causing unnecessary page faults and degrading system performance. In contrast, local allocation assigns each process a fixed portion of memory and only evicts pages within that process. This protects processes from interfering with each other and ensures more predictable performance, but may be less flexible in handling dynamic workloads.
According to the 80/20 model (20% of the memory gets 80% of the access), some portion of a programs address space will be heavily used and the remainder will not. This heavily used portion is called the working set of the process. The set of pages that are currently in main memory is called the resident set.
How the operating system determines how much memory to give each process by monitoring its Page Fault Frequency (PFF)? PFF is calculated as the number of page faults divided by the number of instructions executed. A high PFF indicates that the process is experiencing frequent page faults and likely needs more memory. If the PFF exceeds a certain upper threshold, the OS may allocate more memory to the process or swap out other processes to free up space. Conversely, if the PFF falls below a lower threshold, the OS can take memory away from that process and give it to others. This dynamic adjustment helps balance memory usage and prevent thrashing.
The working set changes across phases of using a program.
Belady’s Anomaly is a surprising and unintuitive behavior in some page replacement algorithms — particularly FIFO — where increasing the number of page frames actually results in more page faults. Adding more physical memory does not always mean fewer faults.
Optimal Page Replacement: Replace page that will not be used for longest period of time in the future. This algorithm is not implementable in practice, because the OS cannot predict the future. It is used as a theoretical benchmark to compare other practical algorithms like FIFO or LRU and see how close they come to optimal behavior.
LRU page replacement: Approximate the optimal answer with Least Recently Used (LRU) page replacement.
The second clock hand improves the basic Clock page replacement algorithm by making it more effective at identifying inactive pages, especially in systems with large memory. In this enhanced approach, two hands move together around the circular list of pages. The leading hand clears the Access (A) bits to track which pages have been used recently, while the trailing hand evicts pages whose A bits remain zero, indicating they have not been accessed since the last sweep. This delay helps distinguish truly inactive pages from recently used ones. Additionally, the system can use the hardware Dirty bit to prioritize evicting clean pages, which avoids the cost of writing data back to disk. Overall, this two-handed strategy leads to smarter page replacement and better performance.
Clock algorithm can use n-bit accessed count instead just a single A bit to better track how frequently a page has been used over time. On sweep: count = (A << (n − 1)) | (count >> 1). On each sweep, the algorithm updates the counter using a formula that shifts the current access bit into the highest bit and shifts the existing counter to the right, gradually aging the page’s usage history. This way, pages that are accessed repeatedly maintain higher counter values, while inactive pages slowly decay toward zero. When a page’s counter reaches zero, it indicates that the page has not been accessed for a long time and is a good candidate for eviction. This approach provides a more accurate and efficient approximation of Least Recently Used (LRU) behavior than using a single bit.
Every page has metadata that includes access bit A, count:

Stack policy: If a page would be in memory with N frames, it will still be in memory with N+1 frames.
- As you give a process more memory, it will never increase the number of page faults.
- This property ensures that the set of pages in memory with N frames is always a subset of the set with N+1 frames.
- Stack policies do not exhibit Belady’s Anomaly (the weird situation where adding memory causes more page faults like FIFO).

it doesn’t contain the virtual page number, but it is used as the index