Real-Time Analytics¶
Algorithms¶
Reservoir Sampling¶
Goal:
Maintain a uniform random sample of size \(S\) from an infinite or unknown stream of \(N\) items.
- Storage: \(O(S)\) space.
-
Logic:
- Store the first \(S\) items.
- For every \(k\)-th item (\(k > S\)), keep it with probability \(P = \frac{S}{k}\).
- If kept, randomly replace one of the existing \(S\) items.
-
Key Property: Every item in the stream has an equal \(\frac{S}{N}\) probability of being in the final sample.
HyperLogLog (HLL)¶
Goal:
Estimate the cardinality (number of unique elements) in a massive dataset.
- Storage: \(O(\log_2(\log_2(N)))\) space (Extremely tiny).
- Logic:
- Hash each input to a binary string.
- Record the maximum number of leading zeros (\(x\)) observed in the hashes.
- Estimate the count as \(E \approx 2^x\).
Bloom Filter¶
Goal:
Test membership ("Is this item in the set?") with zero False Negatives.
- Storage: A bit array of size \(m\) and \(k\) hash functions.
- Logic:
- Start with an array of all \(0\)s.
- To Add: Hash the item \(k\) times and set those bit positions to \(1\).
- To Query: If any of the \(k\) hashed positions are \(0\), the item is definitely NOT in the set. If all are \(1\), the item is PROBABLY in the set.
- False Positive Rate: \(\approx (1 - e^{-kn/m})^k\)
- Can tune false positive rate by adjusting values for \(m\) and \(k\)
Count-Min Sketch¶
Goal:
Estimate the frequency of elements in a stream (Frequency Estimation).
- Storage: A 2D array (matrix) of \(d\) rows and \(w\) columns of integer counters.
-
Logic:
- For every arriving item \(i\), apply \(d\) different hash functions.
- Increment the counter at \(\text{Count}[j, h_j(i)]\) for each row \(j\).
- To Query the frequency of \(i\), return the minimum value among all its hashed cells:
\[ \hat{f_i} = \min_{j} \text{Count}[j, h_j(i)] \] -
Key Property: The estimate is always \(\ge\) the true frequency, but the "min" helps ignore collisions.
Kafka¶
- Maximum parallelism of a consumer group: #consumers (in the group) <= #partitions
- Guarantees when reading data from Kafka
- A message is only ever read by a single consumer in a group.
- A consumer sees messages in the order they were stored in the log.
- The order of messages is only guaranteed within a partition.
-
When a Kafka broker "meets" its consumers through rebalancing, it is essentially performing a dynamic "Musical Chairs" operation to ensure that no data is left unread and no consumer is overwhelmed.
1. Independent "Viewers" (Consumer Groups)
One of the most powerful things about Kafka is that it allows multiple applications to look at the same data simultaneously.
- You can have one Analytics Group that is 2 hours behind, calculating daily trends.
- You can have a Real-time Group that is processing messages as they arrive.
- These groups never interfere with each other because each group manages its own "bookmark" (the Offset) in the partition.
2. The Scaling Rule
Inside a single group, Kafka enforces a strict rule: one partition can only be read by one consumer at a time.
- This is why the formula #consumers <= #partitions matters.
- If you have 4 partitions but 6 consumers, 2 of those consumers will sit idle.
- This restriction is a safety feature—it ensures that messages are processed in the correct order (e.g., you don't "Withdraw" before the "Deposit" message is finished).
3. How Rebalancing Happens
Rebalancing is the "traffic control" event that kicks in whenever the group membership changes (like when a consumer crashes or a new server is added).
- The Pause: The broker signals a rebalance, and all consumers stop reading.
- The Leader: One consumer is chosen as the "Leader" to decide the new map of who owns which partition.
- The Resumption: Consumers receive their new assignments and check the last committed offset (the "save point") to pick up exactly where the previous owner left off.
When a Kafka group rebalances, the Broker (acting as Coordinator) manages the communication, but the Consumer Leader is actually the one that calculates the new partition assignments based on the group's chosen strategy. This "leader-side" assignment allows for flexible, custom logic without needing to change the broker itself. Once the Leader shares the plan, the Broker distributes the assignments to all members, who then query the Broker for the last committed offset—the shared "save point"—stored in the system. This ensures that the new owner of a partition picks up exactly where the previous one left off, maintaining a continuous and ordered stream of data.