Service Design¶

Server Design & System Model¶

System Model & Load Balancing¶

The Architecture The standard web architecture consists of Clients connecting through an IP network to a "Single-site server" cluster. Crucially, a Load Manager sits at the entry point to distribute traffic across the backend nodes.

Load Manager Options

The component doing the work. This is the physical or software component that sits in the network and makes the decisions (e.g., a hardware switch, a DNS server, or the client app itself).

DNS-based: Simple, but slow to adapt (hours due to caching) and often unavailable to small clusters.
- The DNS server isn't actually checking if Server A is healthy. It just blindly hands out the address. If Server A crashes, the DNS server keeps sending people there until a human updates the configuration. If the user has cached server A, it may take hours for it to expire.
Appliance (L4): Hardware switches functioning at Layer 4 (Transport).
- Uses specialized hardware switches to route traffic based strictly on Transport Layer data (IP addresses and TCP/UDP ports) for maximum speed.
Smart Client (L7): Software-based logic at Layer 7 (Application).
- Uses software logic embedded in the client to select servers based on Application Layer data (such as specific URLs, cookies, or headers).

Balancing Techniques

The Algorithm used to make decisions. This is the mathematical algorithm that the Load Manager uses to decide where to send the next request (e.g., "pick the shortest line" or "pick the fastest server").

Round Robin: Cycles through servers (assumes equal workload/hardware).
Least Connections: Sends traffic to the server with the fewest active users.
Response Time: Sends traffic to the fastest responder.
IP Hash: Ensures a specific user always lands on the same server (session stickiness).
SDN Based: Uses a software controller to program network switches dynamically, directing traffic based on real-time network conditions rather than fixed hardware rules.
Chained Failover: Organizes servers into a linear sequence where a request is automatically passed to the next server in the chain if the primary one fails.

Performance Metrics¶

Architects evaluate server designs based on four criteria:

Throughput: Requests processed per second.
Latency: Response time. Crucially, systems care about Tail Latency (the slowest 1% of requests) rather than just the average, as predictability is vital.

A CDF graph plots Latency on the X-axis and Percentile (0% to 100%) on the Y-axis.
- The Curve: It shows "What percentage of requests finished faster than time X?"
- The Connection: If you want to know the "Tail Latency," you look at the 99th percentile (0.99) on the Y-axis and trace it down to the X-axis.
- Why it matters: In a bad system, the line from 99% to 100% might stretch way out to the right (e.g., from 500ms to 5000ms). This "Long Tail" is what destroys predictability.
- Fairness: Does every request get a turn, or do short requests get stuck behind long ones?
- Programmability: How easy is it to debug and build?

Predictability Challenges Achieving consistent latency is hard due to workload variance (random spikes in traffic) and cache effects (CPU/OS/Disk caches missing causes massive delays).

Server Design Architectures¶

The slides present four evolutionary stages of server design.

Single Process (The Loop)¶

Design: A single process handles all requests sequentially.
Pros: Very easy to build and debug.
Cons: Blocking I/O. If one user needs to read a file from disk, the entire server stops. This leads to low throughput and poor resource utilization.

Multi-Threaded (Thread-per-Request)¶

Design: The dispatcher creates a new thread for every incoming request.
Pros: Overlaps I/O with Compute. If Thread A blocks on disk, Thread B can use the CPU.
Cons (Under Heavy Load):
- Throughput collapses as thread count rises (due to context switching overhead).
- Latency spikes exponentially.

Multi-Process (Fork-per-Request)¶

Design: Similar to threads, but uses fork() to create a separate process for each request.
Pros: Isolation. If one request crashes, the server stays alive. Security is better.
Cons: Extremely heavy overhead. Forking is slow, consumes high memory, and scheduling is expensive.

Thread/Process Pool¶

Design: Create a fixed number of threads (e.g., 100) at startup. An Acceptor puts requests into a queue, and the threads grab work from the queue in a loop.
- The latency will get higher if there are more requests in the queue, as they need to wait.
Why it wins: It gains the concurrency of multi-threading but bounds the overhead. You never crash from having "too many threads" because the pool size is fixed.
Challenges
1. If a thread calls Read file (Disk I/O), it stops working and goes to sleep until the disk is finished.
2. If you have a pool of 100 threads and all 100 receive requests that need to read from a slow disk, all 100 threads will block (sleep).

Single Process Event-Based¶

Non-blocking I/O

// Server structure
Whie(true){
    List of events = Select()
    Process events
}

Idea: I/O is the primary bottleneck, not CPU. A single process is sufficient for the compute workload if and only if that process never blocks.
Mechanism: Uses non-blocking I/O primitives like select() or poll(). The server runs an infinite loop that checks a list of events (e.g., "new connection," "data ready") and processes them immediately.
Cons:
- Hard to program: The developer must manually maintain state machines for every connection using logic like Key.attachment.
- Bad for Disk I/O: Standard non-blocking tools work well for network sockets but often fail to prevent blocking on disk operations, which stalls the entire server.
Example: To read a 1GB file in a Single Process Event-Based architecture, the server establishes a persistent Connection(specifically a SocketChannel) with the client, but it cannot simply pause and wait for the full gigabyte to arrive because that would block the single thread and freeze the application for everyone else. Instead, the system uses a State Machine attached to that specific connection (e.g., via Key.attachment) to track progress across thousands of distinct, non-blocking Read Events. When the select() loop detects that data has physically arrived in the network buffer, it fires a "read ready" event; the server then wakes up to handle just that specific event, reads only the small chunk of data currently available (perhaps just a few kilobytes), updates the state object (e.g., "received 4KB of 1GB"), and immediately returns to the main loop to process other clients. This cycle repeats—waiting for the next read event on the connection, reading the new fragment, updating the state, and returning—until the state machine confirms that all 1GB has been received, allowing the server to handle massive concurrent uploads without ever stalling on a single slow connection.
- The "Read Events" are the specific, individual notifications from the select() loop saying, "Wake up! Another tiny piece of that 1GB puzzle has just arrived.”
- Read Event is considered "non-blocking" because the actual function call to read data (e.g., read() or recv()) is configured to return immediately, regardless of whether the full data is ready or not.
  
  Here is the technical reason why this distinction matters:
  1. The "Wait" is Removed: In a standard blocking system, if you ask to read data and the network is slow, the Operating System forces your thread to sleep (block) until the data arrives. You lose control of the CPU.
  2. The "Check" Replaces the Wait: In the event-based model, you never call read() blindly. You first wait on the select() function.
  3. The Guarantee: When select() returns and tells you a connection is "Ready," it guarantees that data is already sitting in the kernel's memory buffer.
  4. The Result: Because the data is already there, the subsequent read() call copies that data instantly and returns control to your code in microseconds. The thread never pauses to wait for a packet to travel across the internet; it only harvests what has already arrived.

Asymmetric Multi-Process Event Driven (AMPED)¶

Design: A hybrid approach that fixes the disk blocking problem of the single-process model.
- The core is still a Single Process Event-Based loop for network tasks.
- It adds Helper Threads specifically to handle blocking Disk I/O tasks.
Pros: efficiently uses resources and strictly separates compute (event loop) from I/O (helpers).
Cons: Remains difficult to program due to the complexity of the event logic.

Staged Event Driven Architecture (SEDA)¶

Goal: To create a "Well-Conditioned" service—one where performance stays stable (throughput doesn't crash) even when the workload exceeds capacity.
Design: Breaks the application into a network of distinct Stages that communicate only via explicit queues.
Internal Structure: Each stage operates like a mini-server with its own incoming Event Queue, Thread Pool, and Event Handler.

Benefits:
- Modular: Components are isolated, making them easier to design, test, and debug.
- Granular Control: You can assign 100 threads to the "File I/O" stage and only 1 thread to the "HTTP Parse" stage, optimizing based on specific bottlenecks.
- Automated Tuning: Controllers can dynamically adjust thread counts or batch sizes based on the current length of the stage's queue.

Good design:

abstracts complex generic parts letting programmer focus on application-specific parts.

Comparison¶

	Throughput	Response Time	Programmability	Notes
Single Process	Low	High	Easy	-
Multi-Threaded	Low	High	Medium	-
Multi-Process	Low	High	Medium	-
Thread/Process Pool	High	Medium	Medium	Config is hard
Single Process Event-Based	High	Low	Hard	-
SEDA	High	Low	Medium	-