Introduction to Data Centers¶

Data Center Design and Architecture Summary¶

Overview and Metrics¶

Hierarchy and Motivation Data centers leverage the economy of scale to amortize capital and maintenance costs. The physical organization typically follows a hierarchy: Machines \(\rightarrow\) Server Racks \(\rightarrow\) Cluster.

Key Design Metrics Architects optimize for three primary variables:

Performance: Measured in requests per second.
Cost: Measured in requests per dollar (capital + operation).
Power: Measured in requests per Watt.

Compute Node Design Options¶

Option 1: SMP (Symmetric Multi-processor)

Architecture: A shared memory multiprocessor where a set of CPUs (each with its own cache) share main memory over a single bus.
Pros/Cons: Offers high performance per node but is expensive.
- Because every CPU must use this same highway to fetch data, they compete for bandwidth. As you add more CPUs, the contention increases; if CPU A is using the bus to read memory, CPU B may have to wait. This physical limitation makes SMP expensive and difficult to scale beyond a certain number of processors, whereas commodity clusters (many distinct machines) can scale much larger because they do not share a single memory bus.

Option 2: Commodity Nodes

Architecture: Uses standard "off-the-shelf" components to build the cluster.
Pros/Cons: Significantly lower cost and achieves equal performance to SMP at scale, but individual components fail more frequently.
- Instead of building one massive, expensive machine with a shared memory bus (like SMP), you build a Cluster using standard, affordable hardware components that you could buy off the shelf. You organize these individual machines into Server Racks, which are then connected via cluster switches to form a large system. Unlike SMP, where CPUs share memory, every commodity node has its own independent CPU and memory. They talk to each other over a network, not a memory bus.

Option 3: Wimpy Nodes

Architecture: Uses low-end, low-power CPUs (e.g., ARM processors).
Pros/Cons: Lower cost and energy consumption, but harder to utilize efficiently.

Performance Analysis and Math¶

Execution Time Model The total execution time is defined as:

\[ \text{Execution\_time} = \text{CPU\_time} + \text{Communication\_time} \]

Communication Cost (SMP vs. Commodity) Communication time depends heavily on whether data is local or remote. Assuming local access takes \(100\text{ns}\) and remote access takes \(100\mu\text{s}\), the cost model is:

\[ \text{Communication\_time} = \#\_\text{operations} \times [100\text{ns} \times \frac{1}{\#\text{nodes}} + 100\mu\text{s} \times (1 - \frac{1}{\#\text{nodes}})] \]

Communication Time = # ops * [Local Access + Remote Access]
The term \((1 - \frac{1}{\#nodes})\) represents the probability that the data you need to access is located on a remote machine rather than your own. Random Distribution Assumption: The model assumes that the data you need to access is distributed randomly and evenly across all nodes in the cluster. The "Local" Chance (\(1/N\)): If you have \(N\) nodes, there is a \(1\) in \(N\) chance (\(\frac{1}{\#nodes}\)) that the random piece of data you need happens to be sitting in your own local memory.

The "Remote" Chance (The Rest): Since the total probability must add up to 100% (1.0), the probability that the data is not local is simply 1 minus the local probability.
- \(Pr(Remote) = 1 - Pr(Local)\)
- \(Pr(Remote) = 1 - \frac{1}{\#nodes}\)
- Implication: As the cluster size grows, the probability of remote access increases (\(1 - \frac{1}{\#nodes}\)), causing performance to drop sharply for high-communication workloads.

Amdahl's Law (Wimpy Node Limitations)

The potential speed-up is bounded by the sequential part of the code. If \(p\) is the ratio of code that can run in parallel and \(s\) is the number of cores:

Task execution is \(T=(1-p)T+pT\)

After parallelization on \(s\) cores, \(T’ = (1-p)T+\frac{p}{s}T\)

Speed-up is defined as the Original Time divided by the New Time:

\[ \text{Speed-up} = \frac{T}{T'}= \frac{T}{(1-p)T+\frac{p}{s}T} = \frac{1}{(1-p) + \frac{p}{s}} \]

If \(s \rightarrow \infty\), the maximum theoretical speed-up is limited to:

\[ \text{Speed-up} = \frac{1}{1-p} \]

Disadvantages: Wimpy nodes require a higher number of threads (\(s\)), which increases serialization/communication costs and makes programming harder.
- You use many "wimpy" nodes (increasing \(s\)) to try to match the speed of a powerful processor. The math says increasing simproves theoretical speed-up. In reality, this is a "time penalty." The more nodes you have to coordinate, the more time the system spends "talking" (network overhead) instead of "working" (computing). If that communication overhead gets too high, the system can actually become slower than a single powerful node, despite having more CPUs.

The trade-off is essentially:

You save: Money on hardware and electricity.
You pay: Complexity in software and latency in the network.

Storage Design Options¶

Option 1: Network Attached Storage (NAS)

Concept: A dedicated storage appliance attached to the network.
- What it is: Instead of putting hard drives inside every single server node, you buy a specialized, high-performance machine (the "appliance") whose only job is to store data.
- How it connects: All the compute nodes (Commodity or SMP) connect to this central appliance over the network to read and write files.
Pros: Simpler deployment, better management (QoS: Quality of Service), and lower network overhead (since replication happens inside the appliance; the appliance handles replication internally or with a backup appliance, keeping that "chatter" off your main cluster network.).

Option 2: Distributed Storage

Concept: Aggregates storage space from the local disks of nodes in the cluster.
- Save money by using the cheap, empty hard drive space found inside the compute nodes themselves, aggregating them into one virtual pool.
Pros: Lower cost (uses cheap disks), higher availability (via software replication, even they fail more), and higher data locality (compute moves to data).
Cons: Higher network overhead and lower component reliability.

Data Center Network Design¶

The Challenge

The primary goal is to build a high-speed, scalable network while keeping costs low.

Network Architecture (The Hierarchy)¶

The diagram illustrates a hierarchical tree topology commonly used in data centers:

Edge Layer: Top-of-Rack (ToR) switches connecting directly to the servers in the racks (organized into "Pods").
Aggregation Layer: Switches that aggregate traffic from multiple Pods.
Core Layer: High-capacity switches that handle traffic between different aggregation blocks and the outside world.

Optimization Tricks

To manage the immense cost of bandwidth, architects use "Oversubscription":

5:1 Ratio: It is common to reduce core bandwidth availability (end-to-end ratio or edge-to-core relationship). A 5:1 ratio means that for every 5 GB of data entering the access layer, there is only 1 GB of bandwidth available to go up to the core. This assumes not all servers transmit at max speed simultaneously.
- Imagine 5 servers at the bottom each have a 1 Gbps cable (Total = 5 Gbps capability).
- The switch connecting them only has one 1 Gbps cable going up to the Core.
- Result: If all 5 servers try to download from the internet at full speed simultaneously, they get jammed (1/5th speed).
- Why do this? Because it saves massive amounts of money. Architects assume that not every server runs at 100% speed at the exact same second.
Traffic Separation: Using multiple distinct physical networks for different traffic types (e.g., a separate Storage Area Network (SAN) vs. a Control Network).

"Numbers Everyone Should Know" (Latency Hierarchy)¶

Software performance is dominated by the physical distance data must travel. The slides present Jeff Dean's famous latency scale:

L1 Cache Reference: \(0.5 \text{ ns}\) (The baseline).
Main Memory Reference: \(100 \text{ ns}\) (~200x slower than L1).
Network (Send 2K bytes): \(20,000 \text{ ns}\) (~200x slower than memory).
Disk Seek: \(10,000,000 \text{ ns}\) (10 million ns - massive latency cliff).

Key Takeaway:

The difference between accessing local memory and accessing network/disk is exponential. Software must be designed to avoid these penalties whenever possible.

Design Implications for Software¶

Because of the hardware realities (latency gaps and commodity failures), the software layer must adapt.

1. Hierarchy Awareness Software cannot treat all memory as equal. It must be Network/Locality aware, moving computation to where the data is (RAM/Local Disk) rather than moving data to the computation.

2. Fault Tolerance Since commodity nodes fail frequently, software fault tolerance is necessary. The application itself must handle retries and replication, rather than relying on perfect hardware.

3. Abstraction Frameworks To prevent every programmer from dealing with these complexities manually, systems use programming frameworks (like MapReduce or Spark) to hide the complexity of distributed storage, synchronization, and failure recovery.

Examples and Case Studies¶

Example 1¶

With a 4:1 ratio, each rack effectively gets only ¼ of that bandwidth when communicating across racks.

	RAM	Disk	Rack RAM (B)	Rack Disk (B)	DC RAM (C)	DC Disk (C)
Time	100 ns (1)	10 ms (4)	70 µs (2)	10 ms (5)	500 µs (3)	10 ms (6)
BW	20 GB/s (1)	80 MB/s (3)	128 MB/s (2)	80 MB/s (3)	32 MB/s (4)	32 MB/s (4)

Order: ns < µs < ms (fast → slow)

In numbers:

1 ms = 1,000 µs
1 µs = 1,000 ns
Therefore 1 ms = 1,000,000 ns

We use the performance numbers from the slide and apply one simple rule:

👉 The slowest component in the path dominates the result. (i.e., latency ≈ max(latencies), bandwidth ≈ min(bandwidths))

Column 1 — RAM (local)

Path: **CPU → local RAM**

Only one component:

- Time = **100 ns**
- BW = **20 GB/s**

👉 No bottleneck besides RAM itself.

Column 2 — Disk (local)

Path: CPU → local disk

Only one component:
- Time = 10 ms
- BW = 80 MB/s
👉 Disk is much slower than RAM, so this is the cost.
Column 3 — Rack RAM (B)

Path: CPU → same-rack network → RAM of B

Components:
- Network: 70 µs, 128 MB/s
- RAM: 100 ns, 20 GB/s
Apply bottleneck rule:
- Time = max(70 µs, 100 ns) ≈ 70 µs
- BW = min(128 MB/s, 20 GB/s) = 128 MB/s
👉 Network dominates, RAM is effectively “free” compared to it.
Column 4 — Rack Disk (B)

Path: CPU → same-rack network → disk of B

Components:
- Network: 70 µs, 128 MB/s
- Disk: 10 ms, 80 MB/s
Bottleneck:
- Time = max(70 µs, 10 ms) ≈ 10 ms
- BW = min(128 MB/s, 80 MB/s) = 80 MB/s
👉 Disk dominates. The network is fast compared to disk.
Column 5 — DC RAM (C)

Path: CPU → cross-rack (DC) network → RAM of C

Components:
- Network: 500 µs, 32 MB/s
- RAM: 100 ns, 20 GB/s
Bottleneck:
- Time = max(500 µs, 100 ns) ≈ 500 µs
- BW = min(32 MB/s, 20 GB/s) = 32 MB/s
👉 DC network dominates.
Column 6 — DC Disk (C)

Path: CPU → cross-rack network → disk of C

Components:
- Network: 500 µs, 32 MB/s
- Disk: 10 ms, 80 MB/s
Bottleneck:
- Time = max(500 µs, 10 ms) ≈ 10 ms
- BW = min(32 MB/s, 80 MB/s) = 32 MB/s
👉 Disk dominates latency, network dominates bandwidth.

One-sentence intuition

Locality is king: every extra “hop” (disk or network) adds a much slower bottleneck.

Special Cases: Pipeline vs Stop-and-Forward¶

Assume the link bandwidth is 128 MB/s on both hops: A → B and B → C.

Q1 (Stop-and-Forward)

How long does it take to transfer 1 GB of data from A’s memory to B’s memory? Then, after the entire data is in B’s memory, transfer it from B’s memory to C’s memory?

Q2 (Pipeline)

How long does it take to transfer 1 GB of data from A’s memory to B’s memory, then from B’s memory to C’s memory, if the data is pipelined through B (i.e., B forwards data to C as it arrives)?

Solution

Let 1 GB = 1024 MB (1 GiB, common in systems).

Bandwidth on each link: 128 MB/s

Step 1: Time to send 1 GB over one 128 MB/s link

Time = Data / Bandwidth

= 1024 MB / 128 MB/s

= 8 s

Q1 Answer: Stop-and-Forward (no pipelining)

You do A→B first (8 s), then B→C (another 8 s):

Total = 8 s + 8 s = 16 s

Q2 Answer: Pipelining through B

B can forward to C while receiving from A.

The end-to-end throughput is limited by the slowest stage.
Both stages are the same: 128 MB/s.
So after a brief startup, the whole transfer completes in approximately the time of one hop, not two.

Total ≈ 1024 MB / 128 MB/s = 8 s

(If you accounted for non-negligible propagation/startup latency, you’d add a small constant, but none is given here.)

Special Cases: Bottleneck Link¶

Assume 1 GB = 1024 MB.

Bandwidths:

A → B: 128 MB/s
B → C: 64 MB/s (this is the bottleneck)

Question.

How long does it take to transfer 1 GB of data from A’s memory to B’s memory, then from B’s memory to C’s memory, if the data is pipelined through B?

Solution.

With pipelining, the end-to-end transfer rate is limited by the slowest link (the bottleneck).

The slowest link is B → C at 64 MB/s.

Time = Data / Bandwidth

= 1024 MB / 64 MB/s

= 16 seconds

So the total time to move 1 GB from A → B → C (with pipelining through B) is 16 s.

Example 2¶

Assume 1 GB = 1024 MB.

Given:

Network (same rack): 70 µs latency, 128 MB/s throughput
Disk (hard disk): 10 ms latency, 80 MB/s throughput

Question:

How long does it take to complete all of the following?

Transfer 1 GB of data from A’s memory to B’s memory.
After receiving the entire data in memory, write the data on B’s disk.

Solution:

A (RAM) → B (RAM) is a same-rack network transfer. Time_transfer = latency + (size / bandwidth) = 70 µs + (1024 MB / 128 MB/s) = 70 µs + 8 s ≈ 8.00007 s (≈ 8 s)
Write 1 GB from B’s memory to B’s disk is a disk write. Time_disk = latency + (size / bandwidth) = 10 ms + (1024 MB / 80 MB/s) = 0.010 s + 12.8 s = 12.81 s

Total time (sequential, because you must receive all data before writing): = 8.00007 s + 12.81 s = 20.81007 s ≈ 20.81 s

Example III — Concurrent Transfers (A)¶

Assume 1 GB = 1024 MB.

Given (same-rack network):

Latency = 70 µs
Throughput = 128 MB/s

Question:

How long does it take to replicate 1 GB of data from A’s memory to B and D’s memories, concurrently?

Solution:

A is sending to two receivers (B and D) at the same time over the same outgoing network link. So A’s effective sending bandwidth is shared between the two transfers:

Per-transfer bandwidth = 128 MB/s ÷ 2 = 64 MB/s

Time for each transfer: = latency + (size / per-transfer_bandwidth) = 70 µs + (1024 MB / 64 MB/s) = 70 µs + 16 s ≈ 16.00007 s

Because the two transfers run concurrently and finish at the same time, the total wall-clock time is: ≈ 16.00007 s ≈ 16 s

Example III — Concurrent Transfers (B)¶

Assume 1 GB = 1024 MB.

A has one same-rack uplink of 128 MB/s total.

It is sending data simultaneously to:

B (same rack)
C (different rack)

So A must split its outgoing capacity between the two receivers.

However, the split is not 64/64 because the two paths are not symmetric:

B is limited by same-rack network: 128 MB/s
C is limited by DC network: 32 MB/s

So C can only ever receive 32 MB/s, even if A wanted to send more.

That leaves: 128 − 32 = 96 MB/s available for B

So the effective per-receiver rates become:

A → C: 32 MB/s (network bottleneck)
A → B: 96 MB/s (remaining rack bandwidth)

Question:

How long does it take to replicate 1 GB of data from A’s memory to B and C’s memories, concurrently?

Solution:

The two transfers happen at the same time, so the total wall-clock time is the maximum of the two individual transfer times.

A → B Time = latency + size / bandwidth

= 70 µs + (1024 MB / 96 MB/s)

≈ 70 µs + 10.67 s

≈ 10.67 s

A → C Time = 500 µs + (1024 MB / 32 MB/s)

≈ 500 µs + 32 s

≈ 32.0005 s

Concurrent total time: = max(10.67 s, 32.0005 s) = 32.0005 s ≈ 32 s

Back-of-the-Envelope Calculations — Explanation¶

Problem:

How long does it take to generate an image results page with 30 thumbnails, where each thumbnail is 256 KB, and disk read latency is 10 ms per seek, with 30 MB/s disk throughput?

We analyze two designs.

Design 1 — Read serially (one by one)¶

You read each thumbnail one at a time from disk.

For each image you pay:

1 disk seek = 10 ms
1 data transfer = 256 KB / 30 MB/s

Compute transfer time per image:

256 KB = 0.256 MB

0.256 MB / 30 MB/s ≈ 0.0085 s ≈ 8.5 ms

So per image: 10 ms (seek) + ~8.5 ms (read) ≈ 18.5 ms

For 30 images (done serially): 30 × 18.5 ms ≈ 555 ms ≈ 560 ms

This matches the slide’s calculation:

30 seeks × 10 ms/seek + 30 × 256K / 30 MB/s = 560 ms

Key idea:

👉 Dominated by seek latency, because you pay it 30 times.

Design 2 — Issue reads in parallel¶

Now you issue all 30 disk reads at once (or in a batch), so:

You pay only one seek latency (≈ 10 ms total) instead of 30.
Then you stream all 30 thumbnails at disk bandwidth.

Total data to read: 30 × 256 KB = 7680 KB ≈ 7.68 MB

Time to transfer: 7.68 MB / 30 MB/s ≈ 0.256 s ≈ 8 ms

So total ≈

10 ms (one seek) + 8 ms (data transfer) ≈ 18 ms

This matches the slide:

10 ms/seek + 256K read / 30 MB/s = 18 ms

(They simplified the math by focusing on per-image numbers, but it’s equivalent.)

The slide notes:

“Ignores variance, so really more like 30–60 ms, probably”

Because in reality:

Disks may not perfectly parallelize all requests
There may be queuing delays

Moral:

👉 When many small files are involved, minimizing disk seeks (by batching/parallelizing) matters much more than raw bandwidth.